Bug 187115
Description
Andrew Benham
2006-03-28 16:22:25 UTC
It's not just FC5. It's just been duplicated on an FC4 box where the kernel has recently been updated to 2.6.16. So suspect the kernel. Has this been fixed in recent kernels? if not, could you please post the oops backtrace Not fixed in recent kernels. I can't readily provide a backtrace as the fault totally hangs the kernel - guess it would have to involve a video capture of the console output. Ok... how about either a binary tethereal or snoop trace of when this happens... something simlar to: tethereal -w /tmp/data.pcap host <client> ; bzip2 /tmp/data.pcap Created attachment 134306 [details]
bzip2'd XWD X Window Dump data
Trying to duplicate using an FC5 VMWare guest OS. Attached xwd data is the
result of exporting a directory from the FC5 VMWare guest (2.6.17 kernel) to
a Solaris 9 x86 VMWare guest NFS mounting that directory with
"-vers=2,proto=udp"
NFS options.
This isn't hanging the VMWare guest, but is giving it an oops.
My desktop FC5 is running the SMP kernel, the VMWare guest isn't - this may
be important.
Created attachment 134308 [details]
bzip2'd XWD X Window Dump data
Aha. The hang happens when using an SMP kernel. This is my FC5 VMWare guest
again, but running on 2 processors under VMWare. The desktops I reported the
bug on were dual-processor machines too.
Kernel 2.6.17-1.2174_FC5smp
(was 2.6.17-1.2174_FC5 in previous test)
what should I used to read those dumps? Newer and older versions of ethereal don't seem to understand that format... Use xwud - the files are screendumps of the console showing the backtrace. Or the gimp. You'll find the first screen dump (from the single processor test) is useless as my shell window overlapped the console window. Pah! cool... got them... thanks! Note: the dump in Comment #5 is blocked by an terminal window... Created attachment 134373 [details]
bzip2'd XWD X Window Dump data
New dump of the oops from running the non-SMP kernel. No terminal window
obscuring the backtrace this time!
Created attachment 134504 [details]
Snoop capture file - version 2 (Ethernet)
Added a snoop capture of the traffic between the machines.
194.217.90.103 is the FC5 machine holding the home directory for
user benhaman.
194.217.90.121 is a Solaris 9 x86 machine attempting to automount
that directory using NFS v2, proto=UDP.
The FC5 machine is running the SMP kernel, and as a result of this
traffic the kernel panics and the machine hangs.
snoop captures can be read by wireshark 0.99.2, because I'm doing
that here.
From then snoop trace, it appears the remote quota query that Solaris box sends is never responds to... so I'm thinking that could be the problem.. FC5 is not handling with those messages very well... to see if this is the case, kill the rpc.rquotad or edit /etc/init.d/nfs to not start rpc.rquotad and then have the Solaris machine try the mount. (In reply to comment #14) > From then snoop trace, it appears the remote quota query that Solaris box > sends is never responds to... so I'm thinking that could be the problem.. > FC5 is not handling with those messages very well... to see if this is > the case, kill the rpc.rquotad or edit /etc/init.d/nfs to not start rpc.rquotad > and then have the Solaris machine try the mount. OK. I can try that. It's work mentioning here that if I remove the "-vers=2,proto=udp" NFS mount options from the Solaris machine, then there's no kernel oops, panics, etc. So if we use TCP for NFS it's OK, UDP for NFS is bad. Created attachment 134528 [details]
Snoop capture file - version 2 (Ethernet)
No rpc.quotad running on the FC5 box. Portmapper replies OK to this effect.
Comment 16 should have stated that the kernel still panics. In case you were about to ask: ~/.bash_profile and ~/.bashrc files are the bog standard copies from /etc/skel, and /etc/bashrc is the unmodified file from the setup rpm. Created attachment 134529 [details]
Snoop capture file - version 2 (Ethernet)
Just for comparison purposes, this trace is from where I commented out the
"-vers=2, proto=udp" NFS mount parameters on the Solaris machine. NFS uses TCP
and everything (including quotad) works OK - no kernel panic.
Created attachment 134700 [details] A patch turning off ACL support for v2 Ok... I'm pretty sure I know what the problem is... the Capture file from Comment #6 shows Solaris sending inquiries about ACL support for version 2 of the NFS protocol. Unfortunately, the server is saying yes but the answer should be no. So will just the patch work or would me to supply some test kernels? If so, which machine architectures will be needed. I'll rebuild a kernel when I'm back from holiday Okay.... have a nice holiday and thanks for all your help! Created attachment 135484 [details]
Snoop capture file - version 2 (Ethernet)
That patch didn't help. Wireshark no longer shows the NFSACL traffic, so we
know that I built the kernel OK (!), but the kernel still panics.
I've also tried with rpc.quotad not running, kernel still panics.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you. Created attachment 143014 [details]
This patch address the issue that is causing this oops.
(this is a mass-close to kernel bugs in NEEDINFO state) As indicated previously there has been no update on the progress of this bug therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue still occurs for you and I will try to assist in its resolution. Thank you for taking the time to report the initial bug. If you believe that this bug was closed in error, please feel free to reopen this bug. |