458289 – GFS2: rm on multiple nodes causes panic

Bug 458289 - GFS2: rm on multiple nodes causes panic

Summary: GFS2: rm on multiple nodes causes panic

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	458303 459843 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-07 14:28 UTC by Nate Straz
Modified:	2009-01-20 20:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:19:17 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to to fix the first problem (616 bytes, patch) 2008-08-11 22:30 UTC, Robert Peterson	no flags	Details \| Diff
Patch to fix both problems (1.34 KB, patch) 2008-08-12 03:34 UTC, Robert Peterson	no flags	Details \| Diff
Revised patch (3.13 KB, patch) 2008-08-12 18:22 UTC, Robert Peterson	no flags	Details \| Diff
Revised patch (upstream version) (3.33 KB, patch) 2008-08-12 18:24 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Nate Straz 2008-08-07 14:28:33 UTC

Description of problem:

When trying to remove the same directory structure on a GFS2 file system from multiple nodes, one node will panic.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
f8d5e974
*pde = ea8fa067
Oops: 0000 [#1]
SMP 
last sysfs file: /devices/pci0000:00/0000:00:02.0/0000:01:1f.0/0000:03:02.1/irq
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac lp e7xxx_edac i2c_i801 parport_pc edac_mc parport i2c_core ide_cd floppy e1000 intel_rng cdrom sg pcspkr dm_snapshot dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    1
EIP:    0060:[<f8d5e974>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-101.el5 #1) 
EIP is at gfs2_glock_nq+0x12f/0x23c [gfs2]
eax: 00000001   ebx: 00000000   ecx: f4e61010   edx: fffffffb
esi: f4e60fc0   edi: f492f614   ebp: f492f614   esp: f48eadbc
ds: 007b   es: 007b   ss: 0068
Process rm (pid: 3198, ti=f48ea000 task=f7f82aa0 task.ti=f48ea000)
Stack: 00000000 00000000 f65ed000 00000001 f4e60fc0 00000001 00000000 f8d5eaa6 
       00000001 f8d6e149 f4e60fc0 00000000 00000000 00000001 00000005 00000001 
       00020a55 f8d56a21 00000001 f4efb414 f4efb414 f5290800 00000000 f4e7d000 
Call Trace:
 [<f8d5eaa6>] gfs2_glock_nq_m+0x25/0xd7 [gfs2]
 [<f8d6e149>] gfs2_rlist_alloc+0x50/0x5a [gfs2]
 [<f8d56a21>] do_strip+0x1aa/0x401 [gfs2]
 [<f8d648d1>] gfs2_meta_read+0xf/0x51 [gfs2]
 [<f8d55761>] recursive_scan+0xeb/0x16c [gfs2]
 [<f8d55894>] trunc_dealloc+0xb2/0xf3 [gfs2]
 [<f8d56877>] do_strip+0x0/0x401 [gfs2]
 [<c0430000>] get_signal_to_deliver+0x328/0x39f
 [<f8d6ade9>] gfs2_delete_inode+0xdb/0x178 [gfs2]
 [<f8d6ad4c>] gfs2_delete_inode+0x3e/0x178 [gfs2]
 [<f8d6ad0e>] gfs2_delete_inode+0x0/0x178 [gfs2]
 [<c0486ab1>] generic_delete_inode+0xa5/0x10f
 [<c0486555>] iput+0x64/0x66
 [<c0485545>] dput+0xd5/0xed
 [<c047d7a5>] __lookup_hash+0x94/0xe1
 [<c047f141>] do_unlinkat+0x57/0x10e
 [<c0407f12>] do_syscall_trace+0xab/0xb1
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
Code: 89 f0 e8 d8 eb ff ff e9 fb 00 00 00 8b 43 1c a8 40 75 16 f6 46 14 10 74 10 8b 44 24 04 83 7c 24 04 00 0f 44 c3 89 44 24 04 8b 1b <8b> 03 0f 18 00 90 8d 47 38 39 c3 0f 85 51 ff ff ff 83 7c 24 04 
EIP: [<f8d5e974>] gfs2_glock_nq+0x12f/0x23c [gfs2] SS:ESP 0068:f48eadbc
 <0>Kernel panic - not syncing: Fatal exception



Version-Release number of selected component (if applicable):
kernel-2.6.18-101.el5

How reproducible:
100%

Steps to Reproduce:
1. create a directory structure (genesis -i 5000)
2. on all nodes `rm -rf gendir*`
3. panic
  
Actual results:

See message above

Expected results:
The directory structure should be removed with some rm commands returning errors that files do not exist.

Additional info:

Comment 1 Robert Peterson 2008-08-11 22:21:02 UTC

We've got at least two problems here, but I've got one solved and the
other shouldn't be tough.  I'll be posting a patch for the first one
shortly.

Comment 2 Robert Peterson 2008-08-11 22:30:05 UTC

Created attachment 314011 [details]
Patch to to fix the first problem

This patch fixes the first problem, which has multiple symptoms.

The problem was that a glock was being enqueued on the holders list,
then, when the error is discovered (because the other node did the
delete so this node couldn't), the glock was not deallocated off the
list, but when the function exits, the memory for the holder gets
reused for a multitude of purposes.  Often, the holder is reused for
the same purpose (another holder) which causes the holders list to
get hosed: the list pointing to itself or the list pointers getting
zeroed out, etc.  Much badness.

We need to fix this in RHEL5.  This patch allows the code to go
further without all the weird symptoms.  However, there is still a
glock hang that occurs when you attempt simultaneous deletes, so this
should not be considered a "complete" solution at this time.

Comment 3 Robert Peterson 2008-08-11 22:34:24 UTC

If you apply the patch from comment #2 and try to recreate the problem,
you will likely get into a glock hang.  The problem should not be too
difficult to get sorted out.  Basically, there are two nodes, each of
which is waiting for the other.  These snippets from the glock dumps
explain it all:

roth-01:
G:  s:UN n:3/340598 f:l t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:W e:0 p:2934 [rm] gfs2_unlink+0xa4/0x199 [gfs2] ffff81004fa13de8
G:  s:EX n:2/3449f5 f:D t:EX d:UN/204737000 l:0 a:0 r:4
 H: s:EX f:H e:0 p:2934 [rm] gfs2_unlink+0x64/0x199 [gfs2] ffff81004fa13d78
 I: n:3291/3426805 t:4 f:0x00000010

roth-03:
G:  s:EX n:3/340598 f:Dy t:EX d:UN/204130000 l:0 a:0 r:4
 H: s:EX f:H e:0 p:3049 [rm] gfs2_rmdir+0x93/0x182 [gfs2] ffff81006b3c1df8
 R: n:3409304
G:  s:UN n:2/3449f5 f:l t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:W e:0 p:3049 [rm] gfs2_rmdir+0x6f/0x182 [gfs2] ffff81006b3c1dc0

So apparently gfs2_unlink grabs a inode glock and waits for a rgrp glock,
but gfs2_rmdir apparently grabs that rgrp glock and wait for that inode
glock.  I would hope it would be easy to fix with a lock ordering switch,
but I haven't dug through it.

Comment 4 Robert Peterson 2008-08-11 22:36:22 UTC

Requesting flags for inclusion in RHEL5.3.  We absolutely need to do
this fix.

Comment 5 Robert Peterson 2008-08-12 02:54:03 UTC

I think the problem is that gfs2_rmdir uses nq_m_sync, which sorts
the holders by address, but gfs2_unlink doesn't do that sorting; it
does dip followed by ip, followed by rgrpd.  So I guess I'll let
Steve decide which it should be.  We either need to make gfs2_unlink
use gfs2_glock_nq_m or rmdir do them individually.  Since there are
a bunch of other places that use gfs2_glock_nq_m, my personal
preference is to make gfs2_unlink use gfs2_glock_nq_m as well.
Perhaps I'll give it a try; it would make my previous patch obsolete
and make the code less messy as well.  If it works, I'll attach
another patch.

Comment 6 Robert Peterson 2008-08-12 03:34:20 UTC

Created attachment 314040 [details]
Patch to fix both problems

This patch fixes it the "right" way.  By using gfs2_glock_nq_m
rather than individually, it avoids deadlocks and avoids the problem
described in comment #2.  I'll submit this to cluster-devel for
upstream.

Comment 7 Robert Peterson 2008-08-12 03:54:23 UTC

I submitted by revised patch to cluster-devel for scrutiny.  If Steve
likes it and pushes it upstream, I can submit it for inclusion in the
RHEL5 kernel.

One more note before I forget:  I noticed that function gfs2_glock_dq_m
does this:

	for (x = 0; x < num_gh; x++)
		gfs2_glock_dq(&ghs[x]);

It seems like we could get into trouble here.  If the original sort
order of the locks is different, can't we get into a situation where
it does:  (1) lock(a) (2) lock(b); ... (3) unlock(a); (4) unlock(b);
And doesn't that break locking?  Can't someone sneak in and grab
a glock for (b) then wait for (a) between steps 3 and 4?  If so, we
have many other potential hangs.

It just seems like if gfs2_glock_nq_m sorts the locks, then
gfs2_glock_dq_m should sort them in the same fashion and unlock them
in reverse order, otherwise we're exposed.  I need to speak with Steve
about this.

Comment 8 Steve Whitehouse 2008-08-12 09:16:57 UTC

Please also fix gfs2_link which appears to have the same problem.

Comment 9 Steve Whitehouse 2008-08-12 09:23:10 UTC

gfs2_glock_nq_m doesn't know all the information required to sort locks in the correct order. The ordering is explained in the Documentation/filesystems/gfs2-glock.txt file. I'd like to see as little use of gfs2_glock_nq_m as possible. Ideally it would only ever be used for rgrps but I guess we still need it in some cases in rename for inodes.

Comment 10 Steve Whitehouse 2008-08-12 09:25:26 UTC

Also, regarding comment #7, the unlock order makes no difference.

Comment 11 Nate Straz 2008-08-12 12:10:00 UTC

*** Bug 458303 has been marked as a duplicate of this bug. ***

Comment 12 Robert Peterson 2008-08-12 18:22:46 UTC

Created attachment 314124 [details]
Revised patch

This RHEL5 patch gets rid of the gfs2_glock_nq_m calls in favor of
the more favored individual calls.  There are potential hangs in all
the code that calls gfs2_glock_nq_m except for rgrps, so I've now
changed link, unlink, rmdir and rename.  This version has gotten more
testing than the last one did, and I couldn't confuse it, hang it or
cause it to panic.

Comment 13 Robert Peterson 2008-08-12 18:24:31 UTC

Created attachment 314125 [details]
Revised patch (upstream version)

This is an upstream version of the revised patch.  The revised patch
doesn't apply cleanly to the upstream GFS2, so I made this one.

Comment 15 Robert Peterson 2008-08-13 16:39:32 UTC

RHEL5 patch was tested on the roth-0{1,3} cluster, and accepted
into the upstream NMW git tree.  I posted it to rhkernel-list too.
Reassigning to Don Zickus for inclusion in the RHEL5.3 kernel.

Comment 17 Steve Whitehouse 2008-08-26 07:48:26 UTC

*** Bug 459843 has been marked as a duplicate of this bug. ***

Comment 18 Don Zickus 2008-09-03 03:41:22 UTC

in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 22 errata-xmlrpc 2009-01-20 20:19:17 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.