187115 – NFS server panics kernel

Bug 187115 - NFS server panics kernel

Summary: NFS server panics kernel

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:	MassClosed
Depends On:
Blocks:	218726
TreeView+	depends on / blocked

Reported:	2006-03-28 16:22 UTC by Andrew Benham
Modified:	2008-01-20 04:40 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-01-20 04:40:35 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
bzip2'd XWD X Window Dump data (15.98 KB, application/x-bzip2) 2006-08-16 13:39 UTC, Andrew Benham	no flags	Details
bzip2'd XWD X Window Dump data (10.42 KB, application/x-bzip2) 2006-08-16 14:09 UTC, Andrew Benham	no flags	Details
bzip2'd XWD X Window Dump data (25.57 KB, application/x-bzip) 2006-08-17 09:40 UTC, Andrew Benham	no flags	Details
Snoop capture file - version 2 (Ethernet) (5.10 KB, application/octet-stream) 2006-08-19 10:36 UTC, Andrew Benham	no flags	Details
Snoop capture file - version 2 (Ethernet) (5.98 KB, application/octet-stream) 2006-08-20 13:02 UTC, Andrew Benham	no flags	Details
Snoop capture file - version 2 (Ethernet) (12.20 KB, application/octet-stream) 2006-08-20 13:31 UTC, Andrew Benham	no flags	Details
A patch turning off ACL support for v2 (1.58 KB, text/x-patch) 2006-08-23 10:23 UTC, Steve Dickson	no flags	Details
Snoop capture file - version 2 (Ethernet) (5.48 KB, application/octet-stream) 2006-09-04 09:39 UTC, Andrew Benham	no flags	Details
This patch address the issue that is causing this oops. (1.94 KB, patch) 2006-12-07 00:58 UTC, Steve Dickson	no flags	Details \| Diff
View All

Description Andrew Benham 2006-03-28 16:22:25 UTC

Description of problem:

When connecting to a Solaris 8 sparc server from an FC5 desktop,
and automounting the user's home directory from the FC5 desktop,
then NFS settings which worked on FC4 cause the FC5 kernel to panic.


Version-Release number of selected component (if applicable):
nfs-utils-1.0.8.rc2-4.FC5.2

How reproducible:
Every time

Steps to Reproduce:
1. Configure a Solaris 8 sparc machine to automount a user's home
   directory from an FC5 machine, using the automount NFS parameters:
   -vers=2,proto=udp,rsize=4096,wsize=4096
2. From the FC5 machine, get that user to logon to the Solaris 8 machine
  
Actual results:
FC5 machine kernel panics.


Expected results:
Successful mount, or error.

Additional info:
Removing the automount NFS parameters on the Solaris machine fixes this
problem.  But we shouldn't get a kernel panic.
The parameters were probably added to the Solaris machine to get around 
NFS interworking problems with previous RedHat clients.

Comment 1 Andrew Benham 2006-04-10 13:58:46 UTC

It's not just FC5.  It's just been duplicated on an FC4 box where the kernel has
recently been updated to 2.6.16.  So suspect the kernel.

Comment 2 Steve Dickson 2006-07-24 18:10:35 UTC

Has this been fixed in recent kernels? if not, could you please post
the oops backtrace

Comment 3 Andrew Benham 2006-07-28 14:01:08 UTC

Not fixed in recent kernels.

I can't readily provide a backtrace as the fault totally hangs the kernel - guess
it would have to involve a video capture of the console output.

Comment 4 Steve Dickson 2006-07-28 17:47:22 UTC

Ok... how about either a binary tethereal or snoop trace of
when this happens... something simlar to:
   tethereal -w /tmp/data.pcap host <client> ; bzip2 /tmp/data.pcap

Comment 5 Andrew Benham 2006-08-16 13:39:53 UTC

Created attachment 134306 [details]
bzip2'd XWD X Window Dump data

Trying to duplicate using an FC5 VMWare guest OS.  Attached xwd data is the
result of exporting a directory from the FC5 VMWare guest (2.6.17 kernel) to
a Solaris 9 x86 VMWare guest NFS mounting that directory with
     "-vers=2,proto=udp"
NFS options.

This isn't hanging the VMWare guest, but is giving it an oops.

My desktop FC5 is running the SMP kernel, the VMWare guest isn't - this may
be important.

Comment 6 Andrew Benham 2006-08-16 14:09:46 UTC

Created attachment 134308 [details]
bzip2'd XWD X Window Dump data

Aha.  The hang happens when using an SMP kernel.  This is my FC5 VMWare guest
again, but running on 2 processors under VMWare.  The desktops I reported the
bug on were dual-processor machines too.

Kernel 2.6.17-1.2174_FC5smp

(was 2.6.17-1.2174_FC5 in previous test)

Comment 7 Steve Dickson 2006-08-16 15:48:37 UTC

what should I used to read those dumps? Newer and older versions
of ethereal don't seem to understand that format...

Comment 8 Andrew Benham 2006-08-16 15:52:34 UTC

Use xwud - the files are screendumps of the console showing the backtrace.

Comment 9 Andrew Benham 2006-08-16 15:56:59 UTC

Or the gimp.

You'll find the first screen dump (from the single processor test) is useless
as my shell window overlapped the console window. Pah!

Comment 10 Steve Dickson 2006-08-16 16:00:04 UTC

cool... got them... thanks!

Comment 11 Steve Dickson 2006-08-16 16:02:59 UTC

Note: the dump in Comment #5 is blocked by an terminal window...

Comment 12 Andrew Benham 2006-08-17 09:40:10 UTC

Created attachment 134373 [details]
bzip2'd XWD X Window Dump data

New dump of the oops from running the non-SMP kernel.  No terminal window
obscuring the backtrace this time!

Comment 13 Andrew Benham 2006-08-19 10:36:54 UTC

Created attachment 134504 [details]
Snoop capture file - version 2 (Ethernet)

Added a snoop capture of the traffic between the machines.

194.217.90.103 is the FC5 machine holding the home directory for
user benhaman.
194.217.90.121 is a Solaris 9 x86 machine attempting to automount
that directory using NFS v2, proto=UDP.

The FC5 machine is running the SMP kernel, and as a result of this
traffic the kernel panics and the machine hangs.

snoop captures can be read by wireshark 0.99.2, because I'm doing
that here.

Comment 14 Steve Dickson 2006-08-19 12:28:55 UTC

From then snoop trace, it appears the remote quota query that Solaris box
sends is never responds to... so I'm thinking that could be the problem..
FC5 is not handling with those messages very well... to see if this is
the case,  kill the rpc.rquotad or edit /etc/init.d/nfs to not start rpc.rquotad
and then have the Solaris machine try the mount.

Comment 15 Andrew Benham 2006-08-20 10:26:17 UTC

(In reply to comment #14)
> From then snoop trace, it appears the remote quota query that Solaris box
> sends is never responds to... so I'm thinking that could be the problem..
> FC5 is not handling with those messages very well... to see if this is
> the case,  kill the rpc.rquotad or edit /etc/init.d/nfs to not start rpc.rquotad
> and then have the Solaris machine try the mount.

OK. I can try that.  It's work mentioning here that if I remove the
"-vers=2,proto=udp" NFS mount options from the Solaris machine, then
there's no kernel oops, panics, etc.  So if we use TCP for NFS it's OK,
UDP for NFS is bad.

Comment 16 Andrew Benham 2006-08-20 13:02:44 UTC

Created attachment 134528 [details]
Snoop capture file - version 2 (Ethernet)

No rpc.quotad running on the FC5 box.  Portmapper replies OK to this effect.

Comment 17 Andrew Benham 2006-08-20 13:20:49 UTC

Comment 16 should have stated that the kernel still panics.

In case you were about to ask: ~/.bash_profile and ~/.bashrc files are the bog
standard copies from /etc/skel, and /etc/bashrc is the unmodified file from the
setup rpm.

Comment 18 Andrew Benham 2006-08-20 13:31:44 UTC

Created attachment 134529 [details]
Snoop capture file - version 2 (Ethernet)

Just for comparison purposes, this trace is from where I commented out the
"-vers=2, proto=udp" NFS mount parameters on the Solaris machine.  NFS uses TCP
and everything (including quotad) works OK - no kernel panic.

Comment 19 Steve Dickson 2006-08-23 10:23:40 UTC

Created attachment 134700 [details]
A patch turning off ACL support for v2

Ok... I'm pretty sure I know what the problem is... the Capture file from
Comment #6 shows
Solaris sending inquiries about ACL support for version 2 of the NFS protocol.
Unfortunately, the server is saying yes but the answer should be no.

So will just the patch work  or would me to supply some test kernels?
If so, which machine architectures will be needed.

Comment 20 Andrew Benham 2006-08-23 10:56:37 UTC

I'll rebuild a kernel when I'm back from holiday

Comment 21 Steve Dickson 2006-08-23 13:18:02 UTC

Okay.... have a nice holiday and thanks for all your help!

Comment 22 Andrew Benham 2006-09-04 09:39:10 UTC

Created attachment 135484 [details]
Snoop capture file - version 2 (Ethernet)

That patch didn't help.  Wireshark no longer shows the NFSACL traffic, so we
know that I built the kernel OK (!), but the kernel still panics.

I've also tried with rpc.quotad not running, kernel still panics.

Comment 23 Dave Jones 2006-10-16 18:55:01 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 24 Steve Dickson 2006-12-07 00:58:54 UTC

Created attachment 143014 [details]
This patch address the issue that is causing this oops.

Comment 25 Jon Stanley 2008-01-20 04:40:35 UTC

(this is a mass-close to kernel bugs in NEEDINFO state)

As indicated previously there has been no update on the progress of this bug
therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue
still occurs for you and I will try to assist in its resolution. Thank you for
taking the time to report the initial bug.

If you believe that this bug was closed in error, please feel free to reopen
this bug.

Note You need to log in before you can comment on or make changes to this bug.