215727 – kernel BUG at fs/gfs2/glock.c:1193! (Recursive attempt for a glock over NFS)

Bug 215727 - kernel BUG at fs/gfs2/glock.c:1193! (Recursive attempt for a glock over NFS)

Summary: kernel BUG at fs/gfs2/glock.c:1193! (Recursive attempt for a glock over NFS)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	GFS-kernel
Sub Component:
Version:	6
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Steve Whitehouse
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	224125 (view as bug list)
Depends On:
Blocks:	218479
TreeView+	depends on / blocked

Reported:	2006-11-15 14:08 UTC by Bryan Holty
Modified:	2007-11-30 22:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.6.19-1.2895
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-01-15 09:28:50 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Bryan Holty 2006-11-15 14:08:59 UTC

Description of problem:
Kernel bug when attempted recursive glock.  NFS mounted over GFS.

Nov 15 07:40:44 cfs1 kernel: original: gfs2_readdir+0x5b/0xa8 [gfs2]
Nov 15 07:40:44 cfs1 kernel: pid : 15863
Nov 15 07:40:44 cfs1 kernel: lock type : 2 lock state : 1
Nov 15 07:40:44 cfs1 kernel: new: gfs2_getattr+0x28/0x58 [gfs2]
Nov 15 07:40:44 cfs1 kernel: pid : 15863
Nov 15 07:40:44 cfs1 kernel: lock type : 2 lock state : 1
Nov 15 07:40:44 cfs1 kernel: ------------[ cut here ]------------
Nov 15 07:40:44 cfs1 kernel: kernel BUG at fs/gfs2/glock.c:1193!
Nov 15 07:40:44 cfs1 kernel: invalid opcode: 0000 [#1]
Nov 15 07:40:44 cfs1 kernel: SMP
Nov 15 07:40:44 cfs1 kernel: last sysfs file:
/fs/gfs2/CFS:cfs_data/counters/reclaimed
Nov 15 07:40:44 cfs1 kernel: Modules linked in: nfsd exportfs lockd nfs_acl md5
sctp lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc
dm_mirror dm_multipath dm_mod video sbs i2c_ec button battery asus_acpi ac ipv6
parport_pc lp parport sg
floppy i2c_piix4 i2c_core pcspkr e100 ide_cd cdrom mii serio_raw aic7xxx
scsi_transport_spi qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd ehci_hcd
ohci_hcd uhci_hcd
Nov 15 07:40:44 cfs1 kernel: CPU:    0
Nov 15 07:40:44 cfs1 kernel: EIP:    0060:[<d0c79a60>]    Not tainted VLI
Nov 15 07:40:44 cfs1 kernel: EFLAGS: 00010296   (2.6.18-1.2849.fc6 #1)
Nov 15 07:40:44 cfs1 kernel: EIP is at gfs2_glock_nq+0xfd/0x1bb [gfs2]
Nov 15 07:40:44 cfs1 kernel: eax: 00000020   ebx: c5832e54   ecx: ffffffff  
edx: 00000046
Nov 15 07:40:44 cfs1 kernel: esi: c5832a84   edi: cba59730   ebp: cba59730  
esp: c5832a58
Nov 15 07:40:44 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
Nov 15 07:40:44 cfs1 kernel: Process nfsd (pid: 15863, ti=c5832000 task=c50d6bf0
task.ti=c5832000)
Nov 15 07:40:44 cfs1 kernel: Stack: d0c90b67 00000002 00000001 ca1c3000 00000000
c5832a84 cced7aa0 cced7aa0
Nov 15 07:40:44 cfs1 kernel:        c5832af4 d0c856d5 c5832a84 c5832a84 c5832a84
cba59730 c50d6bf0 00000003
Nov 15 07:40:44 cfs1 kernel:        00000008 00000000 00000002 00000000 00000001
dead4ead ffffffff ffffffff
Nov 15 07:40:44 cfs1 kernel: Call Trace:
Nov 15 07:40:44 cfs1 kernel:  [<d0c856d5>] gfs2_getattr+0x2f/0x58 [gfs2]
Nov 15 07:40:44 cfs1 kernel:  [<c04777d9>] vfs_getattr+0x40/0x9b
Nov 15 07:40:44 cfs1 kernel:  [<d0d44daa>] encode_post_op_attr+0x37/0x20b [nfsd]
Nov 15 07:40:44 cfs1 kernel:  [<d0d45492>] encode_entry+0x19c/0x421 [nfsd]
Nov 15 07:40:44 cfs1 kernel:  [<d0c834a9>] filldir_func+0x46/0xb6 [gfs2]
Nov 15 07:40:44 cfs1 kernel:  [<d0c74061>] do_filldir_main+0x149/0x189 [gfs2]
Nov 15 07:40:44 cfs1 kernel:  [<d0c745ed>] gfs2_dir_read+0x484/0x4d1 [gfs2]
Nov 15 07:40:44 cfs1 kernel:  [<d0c83b9b>] gfs2_readdir+0x87/0xa8 [gfs2]
Nov 15 07:40:44 cfs1 kernel:  [<c047f6e0>] vfs_readdir+0x66/0x90
Nov 15 07:40:44 cfs1 kernel:  [<d0d3da84>] nfsd_readdir+0x6e/0xc5 [nfsd]
Nov 15 07:40:44 cfs1 kernel:  [<d0d44b49>] nfsd3_proc_readdirplus+0xfd/0x1be [nfsd]
Nov 15 07:40:44 cfs1 kernel:  [<d0d3a0d5>] nfsd_dispatch+0xc5/0x180 [nfsd]
Nov 15 07:40:44 cfs1 kernel:  [<d0bcfb9f>] svc_process+0x3bd/0x631 [sunrpc]
Nov 15 07:40:44 cfs1 kernel:  [<d0d3a604>] nfsd+0x19a/0x2ea [nfsd]
Nov 15 07:40:44 cfs1 kernel:  [<c0404dab>] kernel_thread_helper+0x7/0x10
Nov 15 07:40:44 cfs1 kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x7/0x10
Nov 15 07:40:44 cfs1 kernel: Leftover inexact backtrace:
Nov 15 07:40:44 cfs1 kernel:  =======================
Nov 15 07:40:44 cfs1 kernel: Code: 00 c7 04 24 5a 0b c9 d0 89 44 24 04 e8 81 bd
7a ef 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 67 0b c9 d0 e8 67 bd 7a
ef <0f> 0b a9 04 5e 0a c9 d0 8b 5e 0c 8d 4f 54 8b 47 54 eb 07 39 58
Nov 15 07:40:44 cfs1 kernel: EIP: [<d0c79a60>] gfs2_glock_nq+0xfd/0x1bb [gfs2]
SS:ESP 0068:c5832a58



Version-Release number of selected component (if applicable):
2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:45:28 EST 2006 i686 i686 i386 GNU/Linux
gfs2-utils-0.1.7-1.fc6
nfs-utils-1.0.10-1.fc6
nfs-utils-lib-1.0.8-7.2

How reproducible:
1. Create gfs volume
2. Export gfs volume over nfs
3. Remotely mount nfs volume.
4. 'ls' on remotely mounted nfs volume.



Steps to Reproduce:
1.
2.
3.
  
Actual results:
BUG report in kernel.
remotely mounted nfs volume 'ls' command hangs.

Expected results:
No BUG.
successful remotely mounted nfs volume 'ls'


Additional info:
Thanks.

Comment 1 Steve Whitehouse 2006-11-22 13:43:26 UTC

This is probably due to trying to lock the directory lock again when calling
stat on the directory entry for '.'

I wonder if its worth suggesting to the nfs people that there should be a new
export operation for readdirplus which is what causes all the problems here.
That way we could not only fix this, but also take full advantage of knowing the
order of the required locks for the individual stat operations in order to
request the locks early. The whole thing would work a lot faster I think, as
well as correctly.

In the mean time, I can't think of an easy fix, as most of the problem is in the
NFS code.

Comment 2 Chris Nigbur 2006-11-22 19:47:32 UTC

The client in this case was a SLES 9 system.  If I attempted to do the same
mount and ls between two Fedora machines things worked a little better.  I was
able to get the initial list but then I started getting stale descriptor errors.

Comment 3 Steve Whitehouse 2006-11-23 15:38:57 UTC

I suspect the difference between the two NFS clients is due to whether they do a
readdir or readdirplus operation. Its the latter thats causing the problem here
since NFS's "filldir" callback also calls stat on each entry. GFS2's stat
doesn't recognise that its being called with the directory already locked, so it
tries to get the lock again causing this error. As a result I suspect that the
directory entry in question is '.' since otherwise it would be requesting a
different lock from the one covering the directory.

The big problem here is that there is no way for GFS2's stat to know that its
being called from NFS (or not). I am actively looking at solutions so I hope to
have a patch shortly.

Comment 4 Steve Whitehouse 2006-11-27 10:07:27 UTC

I've just pushed a patch to fix this to the gfs2-2.6-nmw.git git tree at kernel.org.

Comment 5 Russell Cattelan 2006-12-05 18:44:49 UTC

Steve could you please attach the patch or at least
add the git reference to the fix?

What about other trees? should this be pushed around.

Comment 6 Steve Whitehouse 2007-01-15 09:28:50 UTC

Fixed in FC-6 2.6.19-1.2895

Comment 7 Steve Whitehouse 2007-01-24 09:47:12 UTC

*** Bug 224125 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.