Bug 206744

Summary:

Kernel Panic 2.6.17-1.2187_FC5 with samba access

Product:

[Fedora] Fedora

Reporter:

Daniel Rowe <bart>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED RAWHIDE

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

CC:

mjc, rhbz001, rtresidd, trevor, wtogami

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-09-26 02:13:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/proc/cpuinfo	none
/proc/interrupts	none
lsmod lspci /proc/version	none
ftpd panic png	none
nfsd panic png	none
sshd panic png	none
cpuinfo	none
interrupts	none
lspci	none

Description Daniel Rowe 2006-09-16 01:52:32 UTC

Description of problem:

Kernel Panic when Samba share is access that is served from this box.

Does not happen with kernel before 2.6.17-1.2174_FC5.

Message dumped to console is: 'Kernel Panic not syncing: Aiee, killing interrupt
handler'

There is nothing in the logs the kernel must panic before anything can be
written to the log files. The box completely locks up.

Version-Release number of selected component (if applicable):

2.6.17-1.2187_FC5 samba-3.0.23a-1.fc5.1

How reproducible:

Every time.

Steps to Reproduce:

Access samba share on the machine when it is running kernel version 
2.6.17-1.2187_FC5
  
Actual results:

Kernel panic.

Expected results:

Normal operation.

Additional info:

This a pretty standard machine. The system has some large XFS file systems which
are served via Samba.

Comment 1 Daniel Rowe 2006-09-16 10:40:50 UTC

I forgot to say that if I shut Samba down or not allow clients to access the
shares the machine seem to work fine. I can browse the file systems, start a X
session, browse the Web and do network stuff. Seem to be just Samba and the new
kernel causing the panic.

Comment 2 Daniel Rowe 2006-09-17 09:24:29 UTC

Created attachment 136474 [details]
/proc/cpuinfo

Comment 3 Daniel Rowe 2006-09-17 09:26:48 UTC

Created attachment 136475 [details]
/proc/interrupts

Comment 4 Daniel Rowe 2006-09-17 09:31:12 UTC

Created attachment 136476 [details]
lsmod lspci /proc/version

Comment 5 Daniel Rowe 2006-09-17 15:12:46 UTC

I have just done some testing and it is not just samba traffic that causes the
panic. Any network traffic to a service on the machine get a panic. For example
if I access a web page via apache on this machine it panics.

I am going to see if I can get a null modem cable to see if I can capture the
panic.

Comment 6 Allen Kistler 2006-09-18 21:35:14 UTC

I get the panic serving up NFS from the 2187 kernel.  I mount and chdir a few
levels into the mount fine, but when I try to do a list, the server panics. 
Clients running the 2187 kernel do not panic.

FWIW the 3001 build at http://people.redhat.com/davej/kernels/Fedora/FC5/ does
not exhibit this problem for me.

Comment 7 Daniel Rowe 2006-09-19 00:59:28 UTC

By anychance are you using the forcedeth network driver?

http://forums.fedoraforum.org/showthread.php?p=611858#post611858

I am so could be this.

Comment 8 Allen Kistler 2006-09-19 04:42:28 UTC

Created attachment 136598 [details]
ftpd panic png

Comment 9 Allen Kistler 2006-09-19 04:43:25 UTC

Created attachment 136599 [details]
nfsd panic png

Comment 10 Allen Kistler 2006-09-19 04:44:25 UTC

Created attachment 136600 [details]
sshd panic png

Comment 11 Allen Kistler 2006-09-19 04:51:46 UTC

These kernel panics occur on a machine using e1000.ko.
I am unable to recreate the panic on machines using 3c59x.ko and e100.ko.

I've attached the pngs for reference, I guess, since the test kernel already
appears to have fixed (or at least addressed) any bug.

Comment 12 davidh 2006-09-19 06:23:09 UTC

i'm the other guy at
http://forums.fedoraforum.org/showthread.php?p=611858#post611858 with the
forcedepth driver.  running 32bit on an amd

Comment 13 davidh 2006-09-19 06:24:32 UTC

Created attachment 136606 [details]
cpuinfo

/proc/cpuinfo

Comment 14 davidh 2006-09-19 06:26:22 UTC

Created attachment 136607 [details]
interrupts

/proc/interrupts

Comment 15 davidh 2006-09-19 06:27:13 UTC

Created attachment 136608 [details]
lspci

lspci

Comment 16 Hesty 2006-09-20 06:26:24 UTC

Probably the same problem as my bug 206901. Yes, I'm using forcedeth on one of
my NICs. Running smp 32 bit kernel.

Comment 17 Daniel Rowe 2006-09-20 10:55:30 UTC

I have installed the 3001 build from
http://people.redhat.com/davej/kernels/Fedora/FC5/ and I can confirm also that
there seems to be no problem with this kernel.

Comment 18 Phil Anderson 2006-09-24 05:02:06 UTC

Also getting kernel panics late during the boot process (just after samba
starts), using the forcedeth driver.  Couldn't find a newer kernel at davej's
site.... looks like he has taken them down.

Comment 19 Trevor Cordes 2006-09-25 04:08:32 UTC

"Me too".  Upgraded to kernel-smp-2.6.17-1.2187_FC5 (skipped 2174) and have had
3 hard freezes since.  I'm trying to figure out how best to see the panic dumps
(my symptom is X just freezes and I see no panic debug output).

And "me too" on using e1000.  I have an Intel PRO/1000 MT Server NIC with jumbo
frames on.  And my hunch is the bug is network related:

Crash:
#1: can't remember
#2: froze immediately when I released the mouse button to drop some files in
Nautilus on an XP file share.
#3: froze when I opened an ssh window from XP to linux and ran dmesg: it output
about 2 pages of dmesg and froze

It doesn't appear to be samba specific, so someone change the summary.  If
everyone here has e1000 (esp the higher end ones with all the advanced features)
then I think we have found the culprit.

I am now running 2174 and it hasn't crashed yet (24 hours, light usage but lots
of NFS and remote MythTV).  I no longer have the previous kernel installed but
will go back to that if 2174 crashes again.  If I can capture a panic dump I
will post it here.

Comment 20 Trevor Cordes 2006-09-25 04:18:33 UTC

PS: I don't think I'm using "forcedeth", in fact I have no idea what that is. 
But if it's a mod, lsmod doesn't show it on my system.

Comment 21 Allen Kistler 2006-09-25 21:46:21 UTC

pza wrote in Comment #18

> Couldn't find a newer kernel at davej's
> site.... looks like he has taken them down.

The kernel has gone from unofficial testing to official testing.
It's now at

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/5/

as build 2189.  That means if you find bugs in it, you can file them with
Bugzilla, whereas with unofficial tests you could only complain about it on the
lists.  The announcement is at

https://www.redhat.com/archives/fedora-test-list/2006-September/msg00746.html

P.S.  bart can probably close this bug as "Rawhide" if no one objects.

Comment 22 Matt Castelein 2006-09-27 16:25:10 UTC

I am also having this panic with 2187 and forcedeth driver on a sun Ultra 20

Comment 23 Richard Tresidder 2006-10-04 05:00:44 UTC

I've also tried this kernel and eperience a hard lock when any network traffic
goes over my eth0 link which is using the e1000 driver.
I have no problems at all with data traversing my eth1 link which is using the
natsemi driver.
Have reverted back to the 2174 kernel for the moment. no problems with that one.

Comment 24 Andrey Petrov 2006-10-04 11:46:01 UTC

I'm experiencing the exact same problem (kernel panic) with 2.6.17-1.2187_FC5 on
a x86_64 box while trying to transfer files over HTTP or SCP.
The funny thing is the machine has two ethernet NICs and the problem occurs only
when using the integrated nvidia controller. Serving files to the other network
(via the second NIC) works fine!

Comment 25 Trevor Cordes 2006-10-04 20:26:00 UTC

This just happened on a friend's system too, when upgraded to FC5 2187.  His
system has an onboard e1000.  So this bug is not just limited to high-end MT
e1000 server cards.  If it occurs with all e1000's then this is a very nasty bug
indeed.

As an interesting note, my friend has his box (a router/firewall) unplugged from
the LAN (via e1000) for a few days, though it was still on the WAN (via tulip).
 It was 100% stable until the e1000/LAN was plugged in, then it had about 10
panics in 2 days.

As a follow-up to comment #19, my own system has not crashed at all in over a
week using 2174.  My friend's system also seems stable (for 6 hours so far) with
2174.

If this bug has been closed-rawhide, does that mean an errata will be issued for
FC5?  I can't see any mention of anyone here actually finding the source of this
bug and what the code fixes are.  Has it been fixed upstream?

Comment 26 John Holmstadt 2006-10-04 22:31:10 UTC

"Me too"

We've been setting up all of our new servers with FC5 and things had been very
stable until we updated 2 of our systems with 2187. Both systems frequently
crashed on samba share access, and when samba was shut down, everything appeared
to be stable again (Unfortunately samba is a vital service on these systems). I
pulled the 2174 rpm from our stable system, installed it on the unstable box,
and we've been stable ever since.

All of our systems have at least one e1000 interface in use. My nic is reported
by lspci as "Intel Corporation 82545GM Gigabit Ethernet Controller (rev 04)".

Comment 27 Trevor Cordes 2006-10-05 02:16:27 UTC

Follow-up to comment #21, where this was set to CLOSED RAWHIDE.  I checked the
2189 link and details and I see no mention of anything related to this bug.  Has
someone tested 2189 or newer and shown this bug to be fixed there?  Does anyone
have any info on the cause/fix?  I still don't see why this bug is closed. 
Everyone with a e1000 is going to get bitten by this one -- I predict a new "me
too" posted every day.

Comment 28 Daniel Rowe 2006-10-05 05:14:24 UTC

Hi

It was closed as rawhide because the dev kernel 3001 and the kernel in rawhide 
fixes the problems with the forcedepth ehternet driver causing this lockup. I 
have tested the 3001 and have been running this kernel for the last few weeks 
with no problems as all. The 3001 kernel seem to be very stable.

Regards
Daniel

Comment 29 Trevor Cordes 2006-10-20 10:50:04 UTC

Has anyone here tested 2200 to see if that fixes this bug?  I don't have the
guts to do this on my machines -- they are all production servers.

Comment 30 Daniel Rowe 2006-10-20 13:23:46 UTC

2200 works with no problems at all on my machine. I have been up for 3 days so
far with no problems.

Comment 31 John Holmstadt 2006-10-23 19:29:37 UTC

Just updated one of our newly deployed FC5 systems with .2200 and started up 4
rsync sessions. On .2187 just one session kp'ed the server in a matter of
seconds. As of this moment, 2 of the 4 have completed successfully and the other
two are still in progress (fairly large transfer). So from my experience, 2200
fixed the issue.

Comment 32 Richard Tresidder 2006-10-23 23:45:01 UTC

I have also applied the .2200 X86_64 kernel and the system has been running
stable for over 24hrs now.  (e1000 driver or associated in .2187 locked the system)