Red Hat Bugzilla – Bug 496536
Kernel crashes correlated with network load
Last modified: 2009-10-21 21:34:23 EDT
Description of problem:
While doing an http install on the local network the server providing the repo crashed. No useful information was in /var/log/messages after rebooting. The display stayed as it was but there was no affect from the keyboard or mouse. There was no response to ping from another machine on my local network.
This happened a bit inconsistantly but occurred while the install machine was pulling repo metadata and packages. I saw this with the PAE-184.108.40.206-85.fc11.i686 and 220.127.116.11-97.fc11.i686.PAE kernels. I went back to the 18.104.22.168-68.fc11.i686.PAE and didn't see it, but may have just been lucky.
I don't normally have much local traffic so the network load is much higher than I would typically see. (Where I am limited by my network connection to about 1.5 Mb/s instead of 100Mb/s.)
Version-Release number of selected component (if applicable):
The crashes didn't happen repeatedly in the same places. So I don't think the trigger is 100%.
Steps to Reproduce:
1.I got this to happen doing a network install of fedora on another machine while serving the repo via apache on the machine that crashed.
can you configure kdump on the server to capture a crash next time it happens?
I think I can do that tonight. I skimmed through:
and that looks like the instructions I need to do this. If they aren't please let me know.
yep you want to follow the "How to configure kdump" section. You might also want to explore configuring /etc/kdump.conf to capture from the initrd rather than the rootfs (although thats not required).
I am still seeing this on kernel 22.214.171.124-102.fc11.i686.PAE.
I am not finding a crash dump after rebooting. The kdump service is running when the system crashes, but still ends up being unresponsive and there is nothing in /var/crash when I reboot.
I asked around in #fedora-qa but it seems like it is late enough that I am not going to get help there tonight. I can probably try again tomorrow night.
It might be relevant that all of my file systems except /boot are encrypted.
I do have about 200MB in /boot, but I don't know if that is enough space for a typical crash dump. (There is 2 GB of real memory on the system.)
Yeah, that won't be enough space. Do you have an NFS server or an other system with an ssh server running? You can configure kdump to dump there. I would recommend the use of ssh. Edit /etc/kdump.conf to include this line:
where <server> is the name of the system you can ssh too.
Then run the following commands from the terminal:
service kdump propagate
service kdump restart
the propagate command will copy your ssh public key to the remote system to automate the ssh process during kdump (it does this with an scp operation, so you'll need to login with your root password during that part of the process to the remote system). Then when you crash, the vmcore will be copied to that remote system automatically.
The box that is crashiing is my network server, but I have two boxes I am in the process of trying to install F11 on with custom disk layouts. I could run a live image on one and use it for storage.
I'll play with it some more tonight.
There also seems to be some correlation with when I am running xmms.
Ok, please let us know when you have a vmcore to look over.
I am not finding any evidence that the crash kernel is getting started.
The kdump service seems to be running. But when I trigger the problem or use sysrq to cause a crash nothing happens.
My latest try was putting the dump in /boot (even though there wouldn't be room for the whole thing) and setting a default of shell so that I could try stuff manually after it failed. I never got a shell prompt. In fact the screen never changed from what it was when I hit enter to trigger the crash despite waiting several hours. When I looked /boot had a var/crash directory but no file in that directory.
I didn't see any option for kdump to use software raid devices for output. I was willing to risk /boot with just using ext3 when it was really ext3 on top of raid 1, but that didn't appear to work.
I wouldn't be too surprised if this is somehow tied to my sound driver since there have been some changes there with recent kernels around the time I started seeing the problem.
You need to be running a serial console (or be running in text mode) to verify whats going on. kdump is incapable of blanking the screen on all video cards during a kdump operation. Attach a serial console and add the appropriate command line parameter, and you'll be able to see all the messages kdump produces during boot.
That helped a bit. I could see what was happening.
I getting a message about waiting for the luks device that corresponds to / and then a few messages related to detecting my mouse and usb devices and then I can have characters I type echoed but nothing else happens.
So I suspect the initrd created doesn't have builtin smarts for handling encrypted root partitions. Is there any standard way around this?
If there isn't, I have some partitions set up for a Fedora 10 instance that I think is obsolete and I could try to do an unencrypted install of rawhide using them and then see if the problem still shows up. It is some work, but the way things are going it it might end up being easier. I would easily have space for a crash dump. So as long as the problem isn't related to using encrypted devices, that approach will likely work.
Is there any standard way around this?
Yes, see comment 6. kdump is meant to be a non-interactive process, and since entering a password to decrypt a drive is by definition interactive, we don't support that. The way around it is do either reserve unencrypted space locally for a dump, or to send the dump to another system via ssh or nfs, which can be configured in /etc/kdump.conf.
Turns out ssh won't work because access to / is still needed to grab the key.
I don't have an nfs server setup right now. But that situation might change tonight if either I get a chance to pick up some hardware that I have at my brother inlaws or if today's rawhide is actually installable on my other machines. If neither of those options are happening, I'll look at reusing my obsolete F10 file system. There would only be a couple of directories that I might want stuff from so I should be able to safely reuse it without too much work.
I am over time this morning, so I won't bug you again 'til tonight and hopefully I'll have the vmcore data for you.
rootfs access should be unneeded for ssh dumping, as the key is copied into the initramfs (where we do all the dump work), at the time you run service kdump start.
sorry, typo above. root fs access should _not_ be needed when using scp to capture a dump.
I think I still had the waiting message mentioning the luks device. I looked at the router and the ethernet lights weren't flashing, strongly suggesting that no traffic was being sent out. Nothing on the console suggested that scp started up (though that might be normal).
I am going to be retrieving my other hardware tonight so I will have another working machine on the local network with lots of space that I can use to capture stuff.
So one way or the other I expect to have something captured tonight.
I got a bit swamped with new rawhide bugs/issues sucking up time and fell behind.
I take a look and at least make sure that I can reproduce this on the -142 kernel as I saw at least one comment that indcated there was a network fix.
If I can I'll try the crash dump again, but I really think it wasn't working correctly when / was on an encrypted device.
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.
More information and reason for this action is here:
ping any update?
I have had crashes that I suspect are related, but since I have started playing with asterisk, the recent ones may have had to do with the dahdi kernel modules.
I have been avoiding playing music when loading the system, so I have been less likely to see the problem.
I will make an attempt to retest this this weekend with the -213 (or later kernel) with the dahdi modules uninstalled. Even if I still have trouble collecting the traceback, it would still be useful to know if the problem is even still there.
I still haven't got kdump working properly, but I did want to add an update that I am still seeing the problem with kernel 126.96.36.199-45.fc11.i686.PAE.
ok, still no vmcore. I'm closing this. Please re-open it if you manage to get a vmcore or backtrace for us to take a look at. Thanks!
I am running F12 and a 2.6.31 kernel now and am getting crashes that I believe are video related. So seeing if the audio ones are still there is going to need to wait. But if I see them again, and can figure out a way to get a dump for you I'll repopen the bug. I also have a usb headset so if it does show up, I should be able to tell if it is related to the motherboard audio hardware driver.