430279 – Kernel oops w/ nfs mounts, particularly w/ bash auto-completion

Bug 430279 - Kernel oops w/ nfs mounts, particularly w/ bash auto-completion

Summary: Kernel oops w/ nfs mounts, particularly w/ bash auto-completion

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Layton
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-25 19:22 UTC by Stephen Warren
Modified:	2014-06-18 07:37 UTC (History)
CC List:	1 user (show)
Fixed In Version:	2.6.24.3-12
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-03-12 19:51:54 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Re-diffed (v.s. 2.6.23) patch from comment #7. (473 bytes, text/x-patch) 2008-02-25 21:27 UTC, Stephen Warren	no flags	Details
View All

Description Stephen Warren 2008-01-25 19:22:27 UTC

Description of problem:
kernel BUG at fs/nfs/namespace.c:108!

Version-Release number of selected component (if applicable):
kernel-2.6.23.14-107.fc8.i686

How reproducible:
Often

Steps to Reproduce:

I have an NFS share mounted as follows (/etc/fstab entry):
  netapp39:/vol/projects1/builds  /mnt/builds  \
  nfs  rsize=32768,wsize=32768,soft,intr  0 0

If I then type "cd /mnt/builds" and hit TAB a few times in bash to perform
completion, the kernel often gives me an Oops message (shown below), and kills
my login/ssh/...

Additional info:

Jan 25 12:10:25 swarren-lx1 kernel: kernel BUG at fs/nfs/namespace.c:108!
Jan 25 12:10:25 swarren-lx1 kernel: invalid opcode: 0000 [#1] SMP
Jan 25 12:10:25 swarren-lx1 kernel: Modules linked in: nvidia(P)(U) nfsd
exportfs auth_rpcgss rfcomm l2cap bluetooth autofs4 nfs lockd nfs_acl sunrpc
loop dm_multipath ipv6 snd_hda_intel snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device firewire_ohci snd_pcm_oss
firewire_core forcedeth k8temp crc_itu_t hwmon i2c_nforce2 snd_mixer_oss snd_pcm
i2c_core snd_timer snd_page_alloc serio_raw button snd_hwdep snd soundcore
parport_pc pcspkr parport sr_mod cdrom sg floppy pata_amd dm_snapshot dm_zero
dm_mirror dm_mod sata_nv ata_generic libata sd_mod scsi_mod ext3 jbd mbcache
uhci_hcd ohci_hcd ehci_hcd
Jan 25 12:10:25 swarren-lx1 kernel: CPU:    0
Jan 25 12:10:25 swarren-lx1 kernel: EIP:    0060:[<f9177509>]    Tainted: P    
   VLI
Jan 25 12:10:25 swarren-lx1 kernel: EFLAGS: 00210246   (2.6.23.14-107.fc8 #1)
Jan 25 12:10:25 swarren-lx1 kernel: EIP is at nfs_follow_mountpoint+0x37/0x34a [nfs]
Jan 25 12:10:25 swarren-lx1 kernel: eax: f7dabc00   ebx: e49e2cc0   ecx:
f918b7e0   edx: f5ef8f04
Jan 25 12:10:25 swarren-lx1 kernel: esi: f5ef8f04   edi: 00000000   ebp:
da04a900   esp: f5ef8cb0
Jan 25 12:10:25 swarren-lx1 kernel: ds: 007b   es: 007b   fs: 00d8  gs: 0033 
ss: 0068
Jan 25 12:10:25 swarren-lx1 kernel: Process bash (pid: 4124, ti=f5ef8000
task=f337cc20 task.ti=f5ef8000)
Jan 25 12:10:25 swarren-lx1 kernel: Stack: f919c240 00001000 c06f7340 f6c79ee0
f7385a80 00000000 c0430006 da04a300
Jan 25 12:10:25 swarren-lx1 kernel:        da04a348 f5ef8000 c8402b0a 00000781
f337cc20 00000002 000041ff 00000009
Jan 25 12:10:25 swarren-lx1 kernel:        00000000 00000000 00001000 00000000
00001000 00000000 00000000 f5ef8f38
Jan 25 12:10:25 swarren-lx1 kernel: Call Trace:
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0430006>] do_exit+0x37/0x6fc
Jan 25 12:10:25 swarren-lx1 kernel:  [<f9179dca>] nfs3_decode_dirent+0x1b/0x163
[nfs]
Jan 25 12:10:25 swarren-lx1 kernel:  [<c046bd8e>] page_address+0x78/0x98
Jan 25 12:10:25 swarren-lx1 kernel:  [<f8b603e6>]
rpcauth_lookup_credcache+0x4c/0x183 [sunrpc]
Jan 25 12:10:25 swarren-lx1 kernel:  [<f916bdf9>]
nfs_access_get_cached+0x97/0xd7 [nfs]
Jan 25 12:10:25 swarren-lx1 kernel:  [<f8b5ff5c>] put_rpccred+0x2c/0xc0 [sunrpc]
Jan 25 12:10:25 swarren-lx1 kernel:  [<f916bfc9>] nfs_permission+0x190/0x19c [nfs]
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0490198>] dput+0x30/0xd7
Jan 25 12:10:25 swarren-lx1 kernel:  [<c048746f>] __follow_mount+0x1e/0x60
Jan 25 12:10:25 swarren-lx1 kernel:  [<c04875c0>] do_lookup+0x4f/0x140
Jan 25 12:10:25 swarren-lx1 kernel:  [<c048950d>] __link_path_walk+0x8c5/0xbaf
Jan 25 12:10:25 swarren-lx1 kernel:  [<c05461e3>] n_tty_receive_buf+0xc77/0xcc3
Jan 25 12:10:25 swarren-lx1 kernel:  [<c048983b>] link_path_walk+0x44/0xb3
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0489b23>] do_path_lookup+0x162/0x1c7
Jan 25 12:10:25 swarren-lx1 kernel:  [<c048896d>] getname+0x59/0xad
Jan 25 12:10:25 swarren-lx1 kernel:  [<c048a2f7>] __user_walk_fd+0x2f/0x40
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0483f37>] vfs_stat_fd+0x19/0x40
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0483feb>] sys_stat64+0xf/0x23
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0459322>] audit_syscall_exit+0x2aa/0x2c6
Jan 25 12:10:25 swarren-lx1 kernel:  [<c045904e>] audit_syscall_entry+0x10d/0x137
Jan 25 12:10:25 swarren-lx1 kernel:  [<c0407f4d>] do_syscall_trace+0xd7/0xde
Jan 25 12:10:25 swarren-lx1 kernel:  [<c040518a>] syscall_call+0x7/0xb
Jan 25 12:10:25 swarren-lx1 kernel:  =======================
Jan 25 12:10:25 swarren-lx1 kernel: Code: 00 00 8b 40 0c f6 05 8c c5 b7 f8 01 8b
80 9c 00 00 00 8b a8 64 01 00 00 74 0c c7 04 24 d3 e9 18 f9 e8 ef 69 2b c7 3b 5b
18 75 04 <0f> 0b eb fe f6 05 8c c5 b7 f8 01 74 14 c7 44 24 04 98 b8 18 f9
Jan 25 12:10:25 swarren-lx1 kernel: EIP: [<f9177509>]
nfs_follow_mountpoint+0x37/0x34a [nfs] SS:ESP 0068:f5ef8cb0

Comment 1 Chuck Ebbert 2008-01-25 19:53:12 UTC

There are fixes in 2.6.24 for some causes of this problem, but those errors are
triggered by a buggy NFS server. Is the server software up-to-date?

Comment 2 Stephen Warren 2008-01-29 01:23:47 UTC

I asked our IT department, and they state the server has the latest OS loaded.

Which specific bug # / patch # / version should the server have from netapp to
solve this?

Comment 3 Stephen Warren 2008-01-30 19:11:01 UTC

Apparently the specific software version on the netapp is 7.2.2.

Is there a fix available for this issue from netapp?

Comment 4 Chuck Ebbert 2008-01-31 23:40:10 UTC

(In reply to comment #3)
> Apparently the specific software version on the netapp is 7.2.2.
> 
> Is there a fix available for this issue from netapp?
> 

7.2.4 is the latest version.

Comment 5 Stephen Warren 2008-02-05 18:27:48 UTC

IT says this:

I need to confirmation on bug before upgrading this filer.  This filer has been
7.2.2 at least 259 days.  This is first time reporting on any client crash.

Can anyone tell me the specific netapp bug number that causes this that will be
fixed by upgrading to 7.2.4?

Comment 6 Stephen Warren 2008-02-15 17:36:39 UTC

Apparently, netapp is not aware of this problem.

Chuck, please tell me what the NFS server bug is exactly, and which netapp bug #
was assigned to this issue.

Comment 7 Jeff Layton 2008-02-25 13:04:00 UTC

According to Neil's description of this patch, it certainly is suggestive that
this is a bug in Linux that's being triggered by buggy server behavior:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4c1fe2f78a08e2c514a39c91a0eb7b55bbd3c0d2

I'm afraid I don't have any info on whether this is a NetApp bug that's already
fixed. If you're hitting this regularly, you might want to try backporting this
patch to 2.6.23. Otherwise, it sounds like this is already fixed in 2.6.24.

Comment 8 Jeff Layton 2008-02-25 13:05:56 UTC

Your best bet though is probably to get some network captures of the problem
behavior, verify whether the problem is what we think it is, and open a case
with NetApp about it.

Comment 9 Stephen Warren 2008-02-25 17:55:47 UTC

FYI, I have supplied packet captures to our IT department, who I assume are in
communication with NetApp support. We also have a 7.2.4 test server, and it
seems like that fixes the issue, in brief testing. We'll see if/when it gets
rolled out to the production server.

Comment 10 Stephen Warren 2008-02-25 21:27:08 UTC

Created attachment 295844 [details]
Re-diffed (v.s. 2.6.23) patch from comment #7.

Comment 11 Stephen Warren 2008-02-25 21:27:25 UTC

I added the patch in comment #7 to the kernel RPM and rebuilt and it *does*
appear to fix the issue.

I've attached the patch to this bug report (basically, just re-diffed against
the 2.6.23 kernel.

Is there any chance of including it in the standard Fedora kernel releases; I'd
rather not be stuck with rebuilding my kernel each time a new one comes out:-)

Comment 12 Chuck Ebbert 2008-02-29 23:30:11 UTC

Kernel 2.6.24.3-12 is in the updates-testing repository now... please test.

Comment 13 Stephen Warren 2008-03-07 18:09:32 UTC

Hmmm. For some reason, I didn't see your earlier comment; must have missed the
email. Sorry.

Anyway, the update just came in via the updates repository, and does appear to
have solved the problem. I am awaiting final confirmation that the netapp wasn't
also upgraded yet, to isolate that it was the kernel fix that solved the issue.
I'll close out the bug when I get that.

Thanks.

Comment 14 Jeff Layton 2008-03-12 18:52:13 UTC

Setting to NEEDINFO based on last comment.

Comment 15 Stephen Warren 2008-03-12 19:51:54 UTC

Well, IT can't be bothered to answer my question.

Since I already tested the fix in a previous kernel, and since I don't see the
issue in the current kernel, I'll assume the kernel update (rather than a netapp
upgrade) fixed the issue.

Hence, you can close this out. (I would do that myself, but I'm not sure whether
to choose upstream/errata/rawhide/...)

Thanks.

Comment 16 Stephen Warren 2008-03-12 19:54:51 UTC

Ick. Clicking in the resolution list in order to see what I might want to select
forceably selected the option to close the close the bug. Damn Javascript.

Oh well, you can change the resolution to whatever is appropriate...

Comment 17 Jeff Layton 2008-03-12 20:08:29 UTC

Thanks for following up!

Since it did look like an actual bug, I'll change this to CURRENTRELEASE.

Note You need to log in before you can comment on or make changes to this bug.