Bug 218726

Summary:	NFS server panics kernel
Product:	[Fedora] Fedora	Reporter:	Steve Dickson <steved>
Component:	kernel	Assignee:	Steve Dickson <steved>
Status:	CLOSED RAWHIDE	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	6	CC:	davej, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:	NeedsRetesting
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-03-09 18:55:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	187115
Bug Blocks:

Description Steve Dickson 2006-12-07 00:59:54 UTC

+++ This bug was initially created as a clone of Bug #187115 +++

Description of problem:

When connecting to a Solaris 8 sparc server from an FC5 desktop,
and automounting the user's home directory from the FC5 desktop,
then NFS settings which worked on FC4 cause the FC5 kernel to panic.


Version-Release number of selected component (if applicable):
nfs-utils-1.0.8.rc2-4.FC5.2

How reproducible:
Every time

Steps to Reproduce:
1. Configure a Solaris 8 sparc machine to automount a user's home
   directory from an FC5 machine, using the automount NFS parameters:
   -vers=2,proto=udp,rsize=4096,wsize=4096
2. From the FC5 machine, get that user to logon to the Solaris 8 machine
  
Actual results:
FC5 machine kernel panics.


Expected results:
Successful mount, or error.

Additional info:
Removing the automount NFS parameters on the Solaris machine fixes this
problem.  But we shouldn't get a kernel panic.
The parameters were probably added to the Solaris machine to get around 
NFS interworking problems with previous RedHat clients.

-- Additional comment from andrew.benham on 2006-04-10 09:58 EST --
It's not just FC5.  It's just been duplicated on an FC4 box where the kernel has
recently been updated to 2.6.16.  So suspect the kernel.


-- Additional comment from steved on 2006-07-24 14:10 EST --
Has this been fixed in recent kernels? if not, could you please post
the oops backtrace

-- Additional comment from andrew.benham on 2006-07-28 10:01 EST --
Not fixed in recent kernels.

I can't readily provide a backtrace as the fault totally hangs the kernel - guess
it would have to involve a video capture of the console output.

-- Additional comment from steved on 2006-07-28 13:47 EST --
Ok... how about either a binary tethereal or snoop trace of
when this happens... something simlar to:
   tethereal -w /tmp/data.pcap host <client> ; bzip2 /tmp/data.pcap



-- Additional comment from andrew.benham on 2006-08-16 09:39 EST --
Created an attachment (id=134306)
bzip2'd XWD X Window Dump data

Trying to duplicate using an FC5 VMWare guest OS.  Attached xwd data is the
result of exporting a directory from the FC5 VMWare guest (2.6.17 kernel) to
a Solaris 9 x86 VMWare guest NFS mounting that directory with
     "-vers=2,proto=udp"
NFS options.

This isn't hanging the VMWare guest, but is giving it an oops.

My desktop FC5 is running the SMP kernel, the VMWare guest isn't - this may
be important.

-- Additional comment from andrew.benham on 2006-08-16 10:09 EST --
Created an attachment (id=134308)
bzip2'd XWD X Window Dump data

Aha.  The hang happens when using an SMP kernel.  This is my FC5 VMWare guest
again, but running on 2 processors under VMWare.  The desktops I reported the
bug on were dual-processor machines too.

Kernel 2.6.17-1.2174_FC5smp

(was 2.6.17-1.2174_FC5 in previous test)

-- Additional comment from steved on 2006-08-16 11:48 EST --
what should I used to read those dumps? Newer and older versions
of ethereal don't seem to understand that format... 

-- Additional comment from andrew.benham on 2006-08-16 11:52 EST --
Use xwud - the files are screendumps of the console showing the backtrace.


-- Additional comment from andrew.benham on 2006-08-16 11:56 EST --
Or the gimp.

You'll find the first screen dump (from the single processor test) is useless
as my shell window overlapped the console window. Pah!


-- Additional comment from steved on 2006-08-16 12:00 EST --
cool... got them... thanks!


-- Additional comment from steved on 2006-08-16 12:02 EST --
Note: the dump in Comment #5 is blocked by an terminal window...

-- Additional comment from andrew.benham on 2006-08-17 05:40 EST --
Created an attachment (id=134373)
bzip2'd XWD X Window Dump data

New dump of the oops from running the non-SMP kernel.  No terminal window
obscuring the backtrace this time!

-- Additional comment from andrew.benham on 2006-08-19 06:36 EST --
Created an attachment (id=134504)
Snoop capture file - version 2 (Ethernet)

Added a snoop capture of the traffic between the machines.

194.217.90.103 is the FC5 machine holding the home directory for
user benhaman.
194.217.90.121 is a Solaris 9 x86 machine attempting to automount
that directory using NFS v2, proto=UDP.

The FC5 machine is running the SMP kernel, and as a result of this
traffic the kernel panics and the machine hangs.

snoop captures can be read by wireshark 0.99.2, because I'm doing
that here.

-- Additional comment from steved on 2006-08-19 08:28 EST --
From then snoop trace, it appears the remote quota query that Solaris box
sends is never responds to... so I'm thinking that could be the problem..
FC5 is not handling with those messages very well... to see if this is
the case,  kill the rpc.rquotad or edit /etc/init.d/nfs to not start rpc.rquotad
and then have the Solaris machine try the mount.

-- Additional comment from andrew.benham on 2006-08-20 06:26 EST --
(In reply to comment #14)
> From then snoop trace, it appears the remote quota query that Solaris box
> sends is never responds to... so I'm thinking that could be the problem..
> FC5 is not handling with those messages very well... to see if this is
> the case,  kill the rpc.rquotad or edit /etc/init.d/nfs to not start rpc.rquotad
> and then have the Solaris machine try the mount.

OK. I can try that.  It's work mentioning here that if I remove the
"-vers=2,proto=udp" NFS mount options from the Solaris machine, then
there's no kernel oops, panics, etc.  So if we use TCP for NFS it's OK,
UDP for NFS is bad.


-- Additional comment from andrew.benham on 2006-08-20 09:02 EST --
Created an attachment (id=134528)
Snoop capture file - version 2 (Ethernet)

No rpc.quotad running on the FC5 box.  Portmapper replies OK to this effect.


-- Additional comment from andrew.benham on 2006-08-20 09:20 EST --
Comment 16 should have stated that the kernel still panics.

In case you were about to ask: ~/.bash_profile and ~/.bashrc files are the bog
standard copies from /etc/skel, and /etc/bashrc is the unmodified file from the
setup rpm.


-- Additional comment from andrew.benham on 2006-08-20 09:31 EST --
Created an attachment (id=134529)
Snoop capture file - version 2 (Ethernet)

Just for comparison purposes, this trace is from where I commented out the
"-vers=2, proto=udp" NFS mount parameters on the Solaris machine.  NFS uses TCP
and everything (including quotad) works OK - no kernel panic.


-- Additional comment from steved on 2006-08-23 06:23 EST --
Created an attachment (id=134700)
A patch turning off ACL support for v2

Ok... I'm pretty sure I know what the problem is... the Capture file from
Comment #6 shows
Solaris sending inquiries about ACL support for version 2 of the NFS protocol.
Unfortunately, the server is saying yes but the answer should be no.

So will just the patch work  or would me to supply some test kernels?
If so, which machine architectures will be needed.

-- Additional comment from andrew.benham on 2006-08-23 06:56 EST --
I'll rebuild a kernel when I'm back from holiday

-- Additional comment from steved on 2006-08-23 09:18 EST --
Okay.... have a nice holiday and thanks for all your help!



-- Additional comment from andrew.benham on 2006-09-04 05:39 EST --
Created an attachment (id=135484)
Snoop capture file - version 2 (Ethernet)

That patch didn't help.  Wireshark no longer shows the NFSACL traffic, so we
know that I built the kernel OK (!), but the kernel still panics.

I've also tried with rpc.quotad not running, kernel still panics.

-- Additional comment from davej on 2006-10-16 14:55 EST --

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

-- Additional comment from steved on 2006-12-06 19:58 EST --
Created an attachment (id=143014)
This patch address the issue that is causing this oops.

Comment 2 Steve Dickson 2007-03-09 18:55:05 UTC

I believe this has been fixed in the 2.6.19 kernel...