496536 – Kernel crashes correlated with network load

Bug 496536 - Kernel crashes correlated with network load

Summary: Kernel crashes correlated with network load

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	11
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Neil Horman
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-20 00:46 UTC by Bruno Wolff III
Modified:	2009-10-22 01:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-10-21 19:34:07 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Bruno Wolff III 2009-04-20 00:46:32 UTC

Description of problem:
While doing an http install on the local network the server providing the repo crashed. No useful information was in /var/log/messages after rebooting. The display stayed as it was but there was no affect from the keyboard or mouse. There was no response to ping from another machine on my local network.
This happened a bit inconsistantly but occurred while the install machine was pulling repo metadata and packages. I saw this with the PAE-2.6.29.1-85.fc11.i686 and 2.6.29.1-97.fc11.i686.PAE kernels. I went back to the 2.6.29.1-68.fc11.i686.PAE and didn't see it, but may have just been lucky.
I don't normally have much local traffic so the network load is much higher than I would typically see. (Where I am limited by my network connection to about 1.5 Mb/s instead of 100Mb/s.)

Version-Release number of selected component (if applicable):


How reproducible:
The crashes didn't happen repeatedly in the same places. So I don't think the trigger is 100%.

Steps to Reproduce:
1.I got this to happen doing a network install of fedora on another machine while serving the repo via apache on the machine that crashed.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Neil Horman 2009-04-20 13:48:22 UTC

can you configure kdump on the server to capture  a crash next time it happens?

Comment 2 Bruno Wolff III 2009-04-20 14:14:02 UTC

I think I can do that tonight. I skimmed through:
http://fedoraproject.org/wiki/FC6KdumpKexecHowTo
and that looks like the instructions I need to do this. If they aren't please let me know.

Comment 3 Neil Horman 2009-04-20 15:09:30 UTC

yep you want to follow the "How to configure kdump" section.  You might also want to explore configuring /etc/kdump.conf to capture from the initrd rather than the rootfs (although thats not required).

Comment 4 Bruno Wolff III 2009-04-21 04:49:55 UTC

I am still seeing this on kernel 2.6.29.1-102.fc11.i686.PAE.
I am not finding a crash dump after rebooting. The kdump service is running when the system crashes, but still ends up being unresponsive and there is nothing in /var/crash when I reboot.

Comment 5 Bruno Wolff III 2009-04-21 05:04:03 UTC

I asked around in #fedora-qa but it seems like it is late enough that I am not going to get help there tonight. I can probably try again tomorrow night.
It might be relevant that all of my file systems except /boot are encrypted.
I do have about 200MB in /boot, but I don't know if that is enough space for a typical crash dump. (There is 2 GB of real memory on the system.)

Comment 6 Neil Horman 2009-04-21 10:53:37 UTC

Yeah, that won't be enough space.  Do you have an NFS server or an other system with an ssh server running?  You can configure kdump to dump there.  I would recommend the use of ssh.  Edit /etc/kdump.conf to include this line:
net root@<server>

where <server> is the name of the system you can ssh too.

Then run the following commands from the terminal:
service kdump propagate
service kdump restart

the propagate command will copy your ssh public key to the remote system to automate the ssh process during kdump (it does this with an scp operation, so you'll need to login with your root password during that part of the process to the remote system).  Then when you crash, the vmcore will be copied to that remote system automatically.

Comment 7 Bruno Wolff III 2009-04-21 13:26:30 UTC

The box that is crashiing is my network server, but I have two boxes I am in the process of trying to install F11 on with custom disk layouts. I could run a live image on one and use it for storage.
I'll play with it some more tonight.

Comment 8 Bruno Wolff III 2009-04-22 01:24:26 UTC

There also seems to be some correlation with when I am running xmms.

Comment 9 Neil Horman 2009-04-22 10:39:54 UTC

Ok, please let us know when you have a vmcore to look over.

Comment 10 Bruno Wolff III 2009-04-22 12:40:48 UTC

I am not finding any evidence that the crash kernel is getting started.
The kdump service seems to be running. But when I trigger the problem or use sysrq to cause a crash nothing happens.
My latest try was putting the dump in /boot (even though there wouldn't be room for the whole thing) and setting a default of shell so that I could try stuff manually after it failed. I never got a shell prompt. In fact the screen never changed from what it was when I hit enter to trigger the crash despite waiting several hours. When I looked /boot had a var/crash directory but no file in that directory.
I didn't see any option for kdump to use software raid devices for output. I was willing to risk /boot with just using ext3 when it was really ext3 on top of raid 1, but that didn't appear to work.
I wouldn't be too surprised if this is somehow tied to my sound driver since there have been some changes there with recent kernels around the time I started seeing the problem.

Comment 11 Neil Horman 2009-04-22 13:19:17 UTC

You need to be running a serial console (or be running in text mode) to verify whats going on.  kdump is incapable of blanking the screen on all video cards during a kdump operation.  Attach a serial console and add the appropriate command line parameter, and you'll be able to see all the messages kdump produces during boot.

Comment 12 Bruno Wolff III 2009-04-22 14:01:43 UTC

That helped a bit. I could see what was happening.
I getting a message about waiting for the luks device that corresponds to / and then a few messages related to detecting my mouse and usb devices and then I can have characters I type echoed but nothing else happens.
So I suspect the initrd created doesn't have builtin smarts for handling encrypted root partitions. Is there any standard way around this?

If there isn't, I have some partitions set up for a Fedora 10 instance that I think is obsolete and I could try to do an unencrypted install of rawhide using them and then see if the problem still shows up. It is some work, but the way things are going it it might end up being easier. I would easily have space for a crash dump. So as long as the problem isn't related to using encrypted devices, that approach will likely work.

Comment 13 Neil Horman 2009-04-22 14:08:17 UTC

Is there any standard way around this?

Yes, see comment 6.  kdump is meant to be a non-interactive process, and since entering a password to decrypt a drive is by definition interactive, we don't support that.  The way around it is do either reserve unencrypted space locally for a dump, or to send the dump to another system via ssh or nfs, which can be configured in /etc/kdump.conf.

Comment 14 Bruno Wolff III 2009-04-22 14:46:51 UTC

Turns out ssh won't work because access to / is still needed to grab the key.
I don't have an nfs server setup right now. But that situation might change tonight if either I get a chance to pick up some hardware that I have at my brother inlaws or if today's rawhide is actually installable on my other machines. If neither of those options are happening, I'll look at reusing my obsolete F10 file system. There would only be a couple of directories that I might want stuff from so I should be able to safely reuse it without too much work.
I am over time this morning, so I won't bug you again 'til tonight and hopefully I'll have the vmcore data for you.

Comment 15 Neil Horman 2009-04-22 15:00:43 UTC

rootfs access should be unneeded for ssh dumping, as the key is copied into the initramfs (where we do all the dump work), at the time you run service kdump start.

Comment 16 Neil Horman 2009-04-22 15:01:20 UTC

sorry, typo above.  root fs access should _not_ be needed when using scp to capture a dump.

Comment 17 Bruno Wolff III 2009-04-22 21:00:58 UTC

I think I still had the waiting message mentioning the luks device. I looked at the router and the ethernet lights weren't flashing, strongly suggesting that no traffic was being sent out. Nothing on the console suggested that scp started up (though that might be normal).
I am going to be retrieving my other hardware tonight so I will have another working machine on the local network with lots of space that I can use to capture stuff.
So one way or the other I expect to have something captured tonight.

Comment 18 Neil Horman 2009-04-23 00:13:00 UTC

ok

Comment 19 Neil Horman 2009-05-18 14:25:15 UTC

ping?

Comment 20 Bruno Wolff III 2009-05-18 17:57:28 UTC

I got a bit swamped with new rawhide bugs/issues sucking up time and fell behind.
I take a look and at least make sure that I can reproduce this on the -142 kernel as I saw at least one comment that indcated there was a network fix.
If I can I'll try the crash dump again, but I really think it wasn't working correctly when / was on an encrypted device.

Comment 21 Bug Zapper 2009-06-09 14:09:50 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 22 Neil Horman 2009-07-10 10:39:43 UTC

ping any update?

Comment 23 Bruno Wolff III 2009-07-11 01:44:40 UTC

I have had crashes that I suspect are related, but since I have started playing with asterisk, the recent ones may have had to do with the dahdi kernel modules.
I have been avoiding playing music when loading the system, so I have been less likely to see the problem.
I will make an attempt to retest this this weekend with the -213 (or later kernel) with the dahdi modules uninstalled. Even if I still have trouble collecting the traceback, it would still be useful to know if the problem is even still there.

Comment 24 Bruno Wolff III 2009-09-05 22:02:00 UTC

I still haven't got kdump working properly, but I did want to add an update that I am still seeing the problem with kernel 2.6.30.5-45.fc11.i686.PAE.

Comment 25 Neil Horman 2009-10-21 19:34:07 UTC

ok, still no vmcore.  I'm closing this.  Please re-open it if you manage to get a vmcore or backtrace for us to take a look at.  Thanks!

Comment 26 Bruno Wolff III 2009-10-22 01:34:23 UTC

I am running F12 and a 2.6.31 kernel now and am getting crashes that I believe are video related. So seeing if the audio ones are still there is going to need to wait. But if I see them again, and can figure out a way to get a dump for you I'll repopen the bug. I also have a usb headset so if it does show up, I should be able to tell if it is related to the motherboard audio hardware driver.

Note You need to log in before you can comment on or make changes to this bug.