Description of problem: When trying to remove the same directory structure on a GFS2 file system from multiple nodes, one node will panic. BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8d5e974 *pde = ea8fa067 Oops: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:02.0/0000:01:1f.0/0000:03:02.1/irq Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac lp e7xxx_edac i2c_i801 parport_pc edac_mc parport i2c_core ide_cd floppy e1000 intel_rng cdrom sg pcspkr dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 1 EIP: 0060:[<f8d5e974>] Not tainted VLI EFLAGS: 00010246 (2.6.18-101.el5 #1) EIP is at gfs2_glock_nq+0x12f/0x23c [gfs2] eax: 00000001 ebx: 00000000 ecx: f4e61010 edx: fffffffb esi: f4e60fc0 edi: f492f614 ebp: f492f614 esp: f48eadbc ds: 007b es: 007b ss: 0068 Process rm (pid: 3198, ti=f48ea000 task=f7f82aa0 task.ti=f48ea000) Stack: 00000000 00000000 f65ed000 00000001 f4e60fc0 00000001 00000000 f8d5eaa6 00000001 f8d6e149 f4e60fc0 00000000 00000000 00000001 00000005 00000001 00020a55 f8d56a21 00000001 f4efb414 f4efb414 f5290800 00000000 f4e7d000 Call Trace: [<f8d5eaa6>] gfs2_glock_nq_m+0x25/0xd7 [gfs2] [<f8d6e149>] gfs2_rlist_alloc+0x50/0x5a [gfs2] [<f8d56a21>] do_strip+0x1aa/0x401 [gfs2] [<f8d648d1>] gfs2_meta_read+0xf/0x51 [gfs2] [<f8d55761>] recursive_scan+0xeb/0x16c [gfs2] [<f8d55894>] trunc_dealloc+0xb2/0xf3 [gfs2] [<f8d56877>] do_strip+0x0/0x401 [gfs2] [<c0430000>] get_signal_to_deliver+0x328/0x39f [<f8d6ade9>] gfs2_delete_inode+0xdb/0x178 [gfs2] [<f8d6ad4c>] gfs2_delete_inode+0x3e/0x178 [gfs2] [<f8d6ad0e>] gfs2_delete_inode+0x0/0x178 [gfs2] [<c0486ab1>] generic_delete_inode+0xa5/0x10f [<c0486555>] iput+0x64/0x66 [<c0485545>] dput+0xd5/0xed [<c047d7a5>] __lookup_hash+0x94/0xe1 [<c047f141>] do_unlinkat+0x57/0x10e [<c0407f12>] do_syscall_trace+0xab/0xb1 [<c0404f17>] syscall_call+0x7/0xb ======================= Code: 89 f0 e8 d8 eb ff ff e9 fb 00 00 00 8b 43 1c a8 40 75 16 f6 46 14 10 74 10 8b 44 24 04 83 7c 24 04 00 0f 44 c3 89 44 24 04 8b 1b <8b> 03 0f 18 00 90 8d 47 38 39 c3 0f 85 51 ff ff ff 83 7c 24 04 EIP: [<f8d5e974>] gfs2_glock_nq+0x12f/0x23c [gfs2] SS:ESP 0068:f48eadbc <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): kernel-2.6.18-101.el5 How reproducible: 100% Steps to Reproduce: 1. create a directory structure (genesis -i 5000) 2. on all nodes `rm -rf gendir*` 3. panic Actual results: See message above Expected results: The directory structure should be removed with some rm commands returning errors that files do not exist. Additional info:
We've got at least two problems here, but I've got one solved and the other shouldn't be tough. I'll be posting a patch for the first one shortly.
Created attachment 314011 [details] Patch to to fix the first problem This patch fixes the first problem, which has multiple symptoms. The problem was that a glock was being enqueued on the holders list, then, when the error is discovered (because the other node did the delete so this node couldn't), the glock was not deallocated off the list, but when the function exits, the memory for the holder gets reused for a multitude of purposes. Often, the holder is reused for the same purpose (another holder) which causes the holders list to get hosed: the list pointing to itself or the list pointers getting zeroed out, etc. Much badness. We need to fix this in RHEL5. This patch allows the code to go further without all the weird symptoms. However, there is still a glock hang that occurs when you attempt simultaneous deletes, so this should not be considered a "complete" solution at this time.
If you apply the patch from comment #2 and try to recreate the problem, you will likely get into a glock hang. The problem should not be too difficult to get sorted out. Basically, there are two nodes, each of which is waiting for the other. These snippets from the glock dumps explain it all: roth-01: G: s:UN n:3/340598 f:l t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:W e:0 p:2934 [rm] gfs2_unlink+0xa4/0x199 [gfs2] ffff81004fa13de8 G: s:EX n:2/3449f5 f:D t:EX d:UN/204737000 l:0 a:0 r:4 H: s:EX f:H e:0 p:2934 [rm] gfs2_unlink+0x64/0x199 [gfs2] ffff81004fa13d78 I: n:3291/3426805 t:4 f:0x00000010 roth-03: G: s:EX n:3/340598 f:Dy t:EX d:UN/204130000 l:0 a:0 r:4 H: s:EX f:H e:0 p:3049 [rm] gfs2_rmdir+0x93/0x182 [gfs2] ffff81006b3c1df8 R: n:3409304 G: s:UN n:2/3449f5 f:l t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:W e:0 p:3049 [rm] gfs2_rmdir+0x6f/0x182 [gfs2] ffff81006b3c1dc0 So apparently gfs2_unlink grabs a inode glock and waits for a rgrp glock, but gfs2_rmdir apparently grabs that rgrp glock and wait for that inode glock. I would hope it would be easy to fix with a lock ordering switch, but I haven't dug through it.
Requesting flags for inclusion in RHEL5.3. We absolutely need to do this fix.
I think the problem is that gfs2_rmdir uses nq_m_sync, which sorts the holders by address, but gfs2_unlink doesn't do that sorting; it does dip followed by ip, followed by rgrpd. So I guess I'll let Steve decide which it should be. We either need to make gfs2_unlink use gfs2_glock_nq_m or rmdir do them individually. Since there are a bunch of other places that use gfs2_glock_nq_m, my personal preference is to make gfs2_unlink use gfs2_glock_nq_m as well. Perhaps I'll give it a try; it would make my previous patch obsolete and make the code less messy as well. If it works, I'll attach another patch.
Created attachment 314040 [details] Patch to fix both problems This patch fixes it the "right" way. By using gfs2_glock_nq_m rather than individually, it avoids deadlocks and avoids the problem described in comment #2. I'll submit this to cluster-devel for upstream.
I submitted by revised patch to cluster-devel for scrutiny. If Steve likes it and pushes it upstream, I can submit it for inclusion in the RHEL5 kernel. One more note before I forget: I noticed that function gfs2_glock_dq_m does this: for (x = 0; x < num_gh; x++) gfs2_glock_dq(&ghs[x]); It seems like we could get into trouble here. If the original sort order of the locks is different, can't we get into a situation where it does: (1) lock(a) (2) lock(b); ... (3) unlock(a); (4) unlock(b); And doesn't that break locking? Can't someone sneak in and grab a glock for (b) then wait for (a) between steps 3 and 4? If so, we have many other potential hangs. It just seems like if gfs2_glock_nq_m sorts the locks, then gfs2_glock_dq_m should sort them in the same fashion and unlock them in reverse order, otherwise we're exposed. I need to speak with Steve about this.
Please also fix gfs2_link which appears to have the same problem.
gfs2_glock_nq_m doesn't know all the information required to sort locks in the correct order. The ordering is explained in the Documentation/filesystems/gfs2-glock.txt file. I'd like to see as little use of gfs2_glock_nq_m as possible. Ideally it would only ever be used for rgrps but I guess we still need it in some cases in rename for inodes.
Also, regarding comment #7, the unlock order makes no difference.
*** Bug 458303 has been marked as a duplicate of this bug. ***
Created attachment 314124 [details] Revised patch This RHEL5 patch gets rid of the gfs2_glock_nq_m calls in favor of the more favored individual calls. There are potential hangs in all the code that calls gfs2_glock_nq_m except for rgrps, so I've now changed link, unlink, rmdir and rename. This version has gotten more testing than the last one did, and I couldn't confuse it, hang it or cause it to panic.
Created attachment 314125 [details] Revised patch (upstream version) This is an upstream version of the revised patch. The revised patch doesn't apply cleanly to the upstream GFS2, so I made this one.
RHEL5 patch was tested on the roth-0{1,3} cluster, and accepted into the upstream NMW git tree. I posted it to rhkernel-list too. Reassigning to Don Zickus for inclusion in the RHEL5.3 kernel.
*** Bug 459843 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-107.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html