Description of problem: Kernel bug when attempted recursive glock. NFS mounted over GFS. Nov 15 07:40:44 cfs1 kernel: original: gfs2_readdir+0x5b/0xa8 [gfs2] Nov 15 07:40:44 cfs1 kernel: pid : 15863 Nov 15 07:40:44 cfs1 kernel: lock type : 2 lock state : 1 Nov 15 07:40:44 cfs1 kernel: new: gfs2_getattr+0x28/0x58 [gfs2] Nov 15 07:40:44 cfs1 kernel: pid : 15863 Nov 15 07:40:44 cfs1 kernel: lock type : 2 lock state : 1 Nov 15 07:40:44 cfs1 kernel: ------------[ cut here ]------------ Nov 15 07:40:44 cfs1 kernel: kernel BUG at fs/gfs2/glock.c:1193! Nov 15 07:40:44 cfs1 kernel: invalid opcode: 0000 [#1] Nov 15 07:40:44 cfs1 kernel: SMP Nov 15 07:40:44 cfs1 kernel: last sysfs file: /fs/gfs2/CFS:cfs_data/counters/reclaimed Nov 15 07:40:44 cfs1 kernel: Modules linked in: nfsd exportfs lockd nfs_acl md5 sctp lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video sbs i2c_ec button battery asus_acpi ac ipv6 parport_pc lp parport sg floppy i2c_piix4 i2c_core pcspkr e100 ide_cd cdrom mii serio_raw aic7xxx scsi_transport_spi qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Nov 15 07:40:44 cfs1 kernel: CPU: 0 Nov 15 07:40:44 cfs1 kernel: EIP: 0060:[<d0c79a60>] Not tainted VLI Nov 15 07:40:44 cfs1 kernel: EFLAGS: 00010296 (2.6.18-1.2849.fc6 #1) Nov 15 07:40:44 cfs1 kernel: EIP is at gfs2_glock_nq+0xfd/0x1bb [gfs2] Nov 15 07:40:44 cfs1 kernel: eax: 00000020 ebx: c5832e54 ecx: ffffffff edx: 00000046 Nov 15 07:40:44 cfs1 kernel: esi: c5832a84 edi: cba59730 ebp: cba59730 esp: c5832a58 Nov 15 07:40:44 cfs1 kernel: ds: 007b es: 007b ss: 0068 Nov 15 07:40:44 cfs1 kernel: Process nfsd (pid: 15863, ti=c5832000 task=c50d6bf0 task.ti=c5832000) Nov 15 07:40:44 cfs1 kernel: Stack: d0c90b67 00000002 00000001 ca1c3000 00000000 c5832a84 cced7aa0 cced7aa0 Nov 15 07:40:44 cfs1 kernel: c5832af4 d0c856d5 c5832a84 c5832a84 c5832a84 cba59730 c50d6bf0 00000003 Nov 15 07:40:44 cfs1 kernel: 00000008 00000000 00000002 00000000 00000001 dead4ead ffffffff ffffffff Nov 15 07:40:44 cfs1 kernel: Call Trace: Nov 15 07:40:44 cfs1 kernel: [<d0c856d5>] gfs2_getattr+0x2f/0x58 [gfs2] Nov 15 07:40:44 cfs1 kernel: [<c04777d9>] vfs_getattr+0x40/0x9b Nov 15 07:40:44 cfs1 kernel: [<d0d44daa>] encode_post_op_attr+0x37/0x20b [nfsd] Nov 15 07:40:44 cfs1 kernel: [<d0d45492>] encode_entry+0x19c/0x421 [nfsd] Nov 15 07:40:44 cfs1 kernel: [<d0c834a9>] filldir_func+0x46/0xb6 [gfs2] Nov 15 07:40:44 cfs1 kernel: [<d0c74061>] do_filldir_main+0x149/0x189 [gfs2] Nov 15 07:40:44 cfs1 kernel: [<d0c745ed>] gfs2_dir_read+0x484/0x4d1 [gfs2] Nov 15 07:40:44 cfs1 kernel: [<d0c83b9b>] gfs2_readdir+0x87/0xa8 [gfs2] Nov 15 07:40:44 cfs1 kernel: [<c047f6e0>] vfs_readdir+0x66/0x90 Nov 15 07:40:44 cfs1 kernel: [<d0d3da84>] nfsd_readdir+0x6e/0xc5 [nfsd] Nov 15 07:40:44 cfs1 kernel: [<d0d44b49>] nfsd3_proc_readdirplus+0xfd/0x1be [nfsd] Nov 15 07:40:44 cfs1 kernel: [<d0d3a0d5>] nfsd_dispatch+0xc5/0x180 [nfsd] Nov 15 07:40:44 cfs1 kernel: [<d0bcfb9f>] svc_process+0x3bd/0x631 [sunrpc] Nov 15 07:40:44 cfs1 kernel: [<d0d3a604>] nfsd+0x19a/0x2ea [nfsd] Nov 15 07:40:44 cfs1 kernel: [<c0404dab>] kernel_thread_helper+0x7/0x10 Nov 15 07:40:44 cfs1 kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x7/0x10 Nov 15 07:40:44 cfs1 kernel: Leftover inexact backtrace: Nov 15 07:40:44 cfs1 kernel: ======================= Nov 15 07:40:44 cfs1 kernel: Code: 00 c7 04 24 5a 0b c9 d0 89 44 24 04 e8 81 bd 7a ef 8b 47 2c 8b 57 14 89 44 24 08 89 54 24 04 c7 04 24 67 0b c9 d0 e8 67 bd 7a ef <0f> 0b a9 04 5e 0a c9 d0 8b 5e 0c 8d 4f 54 8b 47 54 eb 07 39 58 Nov 15 07:40:44 cfs1 kernel: EIP: [<d0c79a60>] gfs2_glock_nq+0xfd/0x1bb [gfs2] SS:ESP 0068:c5832a58 Version-Release number of selected component (if applicable): 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:45:28 EST 2006 i686 i686 i386 GNU/Linux gfs2-utils-0.1.7-1.fc6 nfs-utils-1.0.10-1.fc6 nfs-utils-lib-1.0.8-7.2 How reproducible: 1. Create gfs volume 2. Export gfs volume over nfs 3. Remotely mount nfs volume. 4. 'ls' on remotely mounted nfs volume. Steps to Reproduce: 1. 2. 3. Actual results: BUG report in kernel. remotely mounted nfs volume 'ls' command hangs. Expected results: No BUG. successful remotely mounted nfs volume 'ls' Additional info: Thanks.
This is probably due to trying to lock the directory lock again when calling stat on the directory entry for '.' I wonder if its worth suggesting to the nfs people that there should be a new export operation for readdirplus which is what causes all the problems here. That way we could not only fix this, but also take full advantage of knowing the order of the required locks for the individual stat operations in order to request the locks early. The whole thing would work a lot faster I think, as well as correctly. In the mean time, I can't think of an easy fix, as most of the problem is in the NFS code.
The client in this case was a SLES 9 system. If I attempted to do the same mount and ls between two Fedora machines things worked a little better. I was able to get the initial list but then I started getting stale descriptor errors.
I suspect the difference between the two NFS clients is due to whether they do a readdir or readdirplus operation. Its the latter thats causing the problem here since NFS's "filldir" callback also calls stat on each entry. GFS2's stat doesn't recognise that its being called with the directory already locked, so it tries to get the lock again causing this error. As a result I suspect that the directory entry in question is '.' since otherwise it would be requesting a different lock from the one covering the directory. The big problem here is that there is no way for GFS2's stat to know that its being called from NFS (or not). I am actively looking at solutions so I hope to have a patch shortly.
I've just pushed a patch to fix this to the gfs2-2.6-nmw.git git tree at kernel.org.
Steve could you please attach the patch or at least add the git reference to the fix? What about other trees? should this be pushed around.
Fixed in FC-6 2.6.19-1.2895
*** Bug 224125 has been marked as a duplicate of this bug. ***