Bug 690555 - GFS2: resource group bitmap corruption resulting in panics and withdraws
Summary: GFS2: resource group bitmap corruption resulting in panics and withdraws
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.8
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Robert Peterson
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 711519
TreeView+ depends on / blocked
 
Reported: 2011-03-24 16:19 UTC by Adam Drew
Modified: 2018-11-26 19:38 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.
Clone Of:
: 702057 (view as bug list)
Environment:
Last Closed: 2011-07-21 09:58:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
writechaos.c (1.42 KB, text/plain)
2011-03-24 18:53 UTC, Robert Peterson
no flags Details
Test script provided by customer (17.93 KB, text/x-python)
2011-03-31 18:39 UTC, Adam Drew
no flags Details
Initial patch (4.13 KB, patch)
2011-04-07 03:03 UTC, Robert Peterson
no flags Details | Diff
Second patch (5.89 KB, patch)
2011-04-25 21:49 UTC, Robert Peterson
no flags Details | Diff
Untested example of my thoughts wrt the uptodate flag (5.09 KB, patch)
2011-04-27 13:41 UTC, Steve Whitehouse
no flags Details | Diff
Glock / Directory history tarball (117.86 KB, application/x-gzip)
2011-04-29 19:18 UTC, Robert Peterson
no flags Details
Simpler recreation program (1.34 KB, text/plain)
2011-05-03 19:29 UTC, Robert Peterson
no flags Details
Glock / Directory history tarball (268.49 KB, application/octet-stream)
2011-05-03 19:50 UTC, Robert Peterson
no flags Details
Third patch (5.11 KB, patch)
2011-05-04 15:49 UTC, Robert Peterson
no flags Details | Diff
The 5th RHEL5 patch (7.04 KB, patch)
2011-05-05 19:09 UTC, Robert Peterson
no flags Details | Diff
Sixth version of the patch (4.54 KB, patch)
2011-05-10 17:04 UTC, Robert Peterson
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Adam Drew 2011-03-24 16:19:48 UTC
Description of problem:
Experienced GFS2 hangs on kernel 2.6.18-248 in the course of working through bug 656032 (see comments 183 through 190.) Once the traces were analyzed it appears that the issue results from the work queue not calling run_workqueue. From comment 190 in 656032:

I've been analyzing the data Adam mentioned in comment #183.
The problem does not appear to be related to the patches I've
made for these bugs.  In fact, I tend to believe it's not a
GFS2 bug at all, but a bug in the workqueue kernel code.
The data Adam provided seems to be enough to see what's going
on to some extent.  This is what I see:

All nodes are waiting for glock 2/4c8a8a.  That lock is mastered
on ibmblade03, who only knows this about the glock:

G:  s:EX n:2/4c8a8a f:yq t:EX d:SH/0 l:0 a:0 r:3 m:50
 I: n:1346036/5016202 t:4 f:0x10 d:0x00000003 s:2048/2048

However, the dlm knows this:

Resource ffff81036604ac80 Name (len=24) "       2          4c8a8a"  
Master Copy
Granted Queue
018744f4 EX
Conversion Queue
Waiting Queue
027f45a8 -- (PR) Remote:   2 012444cc
0134448f -- (PR) Remote:   1 022138b0
036e454d -- (PR) Remote:   4 000243df

The dlm "_locks" file has:

 id nodeid remid   pid  xid exflags flags sts grmode rqmode time_ms r_nodeid
r_len      r_name
18744f4 0  0       8652  0     0      0    2    5      -1    0         0      
24 "       2          4c8a8a"
27f45a8 2  12444cc 8003  0     0    10000  1   -1       3  209982009   0      
24 "       2          4c8a8a"
134448f 1  22138b0 21045 0     0    10000  1   -1       3  209979505   0      
24 "       2          4c8a8a"
36e454d 4  243df   6839  0     0    10000  1   -1       3  209972811   0      
24 "       2          4c8a8a"

So dlm says the glock is being held by pid 8652 on node3.
That pid isn't even in gfs2:

 glock_workque S ffff810001015120     0  8652    281          8653  8651
(L-TLB)
  ffff81081ad65e70 0000000000000046 0000000000000001 ffffffff8002d2c9
  0000000300000000 000000000000000a ffff81082f5be860 ffff81011cb9d080
  000179b6cc403da2 00000000000008da ffff81082f5bea48 000000028002e45b
 Call Trace:
  [<ffffffff8002d2c9>] wake_up_bit+0x11/0x22
  [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4
  [<ffffffff8004a2ac>] worker_thread+0x0/0x122
  [<ffffffff8004a36c>] worker_thread+0xc0/0x122
  [<ffffffff8008e433>] default_wake_function+0x0/0xe
  [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4
  [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032b9b>] kthread+0xfe/0x132
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4
  [<ffffffff80032a9d>] kthread+0x0/0x132
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

This tells me either (1) the work queue code has not yet
called run_workqueue which ought to have called function
glock_work_func, or (2) function glock_work_func has finished
after the work was queued without have done anything.
Case (1) would be a bug in the workqueue code.  Case (2)
could still be a gfs2 bug.  At any rate, this should likely
be treated as a new bug.

As I've mentioned in older bug records, there has been a lot
of work and changes to the kernel's workqueue code to solve
bugs like this, since 2.6.18 was first released.  Some of these
have been ported, but many have not.

Version-Release number of selected component (if applicable):
2.6.18-248

How reproducible:
Can reproduce at customer site after running for approximately 4 days.


Steps to Reproduce:
1. Run customer workload. Multiple nodes accessing the same GFS2 filesystem. Nodes perform jobs processing data on disk. One node assumes the task of deleting files en masse from the share storage.
2. After 4+ days of intense activity a sudden hang occurs where all PIDs accessing GFS2 fall into D state. 
3. Look at the call traces from hung_task_timeout, the glocks, and the dlm data and observe the data mention above
  
Actual results:
GFS2 gets hung up waiting for glock_work_func to run but it either already ran and did nothing or did not run because run_workqueue was never called.

Expected results:
The GFS2 work queue proceeds without issue.

Additional info:
Further data on the issue:

No specialized lock_dlm messages:
[adrew@adrew-virts MAr23]$ grep lock_dlm dmsg/*
dmsg/ibmblade01_dmsg: [<ffffffff88a953d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade01_dmsg: [<ffffffff88a953d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade01_dmsg: [<ffffffff88a952c1>] :lock_dlm:gdlm_bast+0x0/0xd4
dmsg/ibmblade01_dmsg:lock_dlm      S ffff810001004420     0 11426    281       
 11427 11339 (L-TLB)
dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a96882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade01_dmsg:lock_dlm      S 0000000000000282     0 11456    281       
 11457 11448 (L-TLB)
dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a96882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade01_dmsg:lock_dlm      S ffff810001004420     0 11488    281       
 11490 11480 (L-TLB)
dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a96882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade01_dmsg: [<ffffffff88a95594>] :lock_dlm:gdlm_ast+0x0/0x313
dmsg/ibmblade02_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:appl01"
dmsg/ibmblade02_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:data01"
dmsg/ibmblade02_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:data02"
dmsg/ibmblade02_dmsg: [<ffffffff889ed3d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade02_dmsg: [<ffffffff889ed3d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade02_dmsg: [<ffffffff889ed3d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade02_dmsg: [<ffffffff889ed2c1>] :lock_dlm:gdlm_bast+0x0/0xd4
dmsg/ibmblade02_dmsg:lock_dlm      S ffff810001004420     0  8271    281       
  8272  8192 (L-TLB)
dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade02_dmsg: [<ffffffff889ee882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade02_dmsg:lock_dlm      S ffff81000100caa0     0  8297    281       
  8298  8283 (L-TLB)
dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade02_dmsg: [<ffffffff889ee882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade02_dmsg:lock_dlm      S ffff81000100caa0     0  8321    281       
  8322  8311 (L-TLB)
dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade02_dmsg: [<ffffffff889ee882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade03_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:appl01"
dmsg/ibmblade03_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:data01"
dmsg/ibmblade03_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:data02"
dmsg/ibmblade03_dmsg: [<ffffffff88a973d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade03_dmsg: [<ffffffff88a972c1>] :lock_dlm:gdlm_bast+0x0/0xd4
dmsg/ibmblade03_dmsg:lock_dlm      S ffff810001004420     0  8919    281       
  8921  8840 (L-TLB)
dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade03_dmsg: [<ffffffff88a98882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade03_dmsg:lock_dlm      S 0000000000000282     0  8943    281       
  8945  8933 (L-TLB)
dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade03_dmsg: [<ffffffff88a98882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade03_dmsg:lock_dlm      S 0000000000000282     0  8967    281       
  8968  8958 (L-TLB)
dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade03_dmsg: [<ffffffff88a98882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade04_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:appl01"
dmsg/ibmblade04_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:data01"
dmsg/ibmblade04_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm",
"rhel55:data02"
dmsg/ibmblade04_dmsg: [<ffffffff88a983d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade04_dmsg: [<ffffffff88a983d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade04_dmsg: [<ffffffff88a983d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84
dmsg/ibmblade04_dmsg: [<ffffffff88a982c1>] :lock_dlm:gdlm_bast+0x0/0xd4
dmsg/ibmblade04_dmsg:lock_dlm      S 0000000000000282     0  8713    281       
  8714  8633 (L-TLB)
dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a99882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade04_dmsg:lock_dlm      S ffff810001004420     0  8739    281       
  8740  8724 (L-TLB)
dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a99882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a98265>] :lock_dlm:gdlm_put_lock+0x0/0x5
dmsg/ibmblade04_dmsg:lock_dlm      S ffff810001004420     0  8763    281       
  8764  8753 (L-TLB)
dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a99882>] :lock_dlm:gdlm_thread+0xd8/0x1c6
dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6

Here's where we hung up first:
Mar 18 14:52:16 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0     0 21045
 19117         21053       (NOTLB)
Mar 18 14:52:16 ibmblade01 kernel:  ffff81056a3b3ce8 0000000000000086
0000000000000000 ffff8108216b6800
Mar 18 14:52:16 ibmblade01 kernel:  0000000000000018 0000000000000009
ffff8100549ce860 ffff810061b8f7a0
Mar 18 14:52:16 ibmblade01 kernel:  00008eee45f4237d 0000000000002834
ffff8100549cea48 00000001889f26d8
Mar 18 14:52:16 ibmblade01 kernel: Call Trace:
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff889f0d68>]
:dlm:dlm_lock+0x117/0x129
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a95594>]
:lock_dlm:gdlm_ast+0x0/0x313
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80062ff2>]
thread_return+0x62/0xfe
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8003a2b4>]
fcntl_setlk+0x243/0x273
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Same PID days later:
Mar 21 14:39:55 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0     0 21045
 19117         21061       (NOTLB)
Mar 21 14:39:55 ibmblade01 kernel:  ffff81056a3b3ce8 0000000000000086
0000000000000000 ffff8108216b6800
Mar 21 14:39:55 ibmblade01 kernel:  0000000000000018 0000000000000009
ffff8100549ce860 ffff810061b8f7a0
Mar 21 14:39:55 ibmblade01 kernel:  00008eee45f4237d 0000000000002834
ffff8100549cea48 00000001889f26d8
Mar 21 14:39:55 ibmblade01 kernel: Call Trace:
Mar 21 14:39:55 ibmblade01 kernel:  [<ffffffff889f0d68>]
:dlm:dlm_lock+0x117/0x129
Mar 21 14:39:55 ibmblade01 kernel:  [<ffffffff88a95594>]
:lock_dlm:gdlm_ast+0x0/0x313
Mar 21 14:39:55 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 21 14:39:55 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 21 14:39:55 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 21 14:39:55 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff80062ff2>]
thread_return+0x62/0xfe
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff8003a2b4>]
fcntl_setlk+0x243/0x273
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 21 14:39:56 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Same call trace so we're definitely hung up.

Other nodes @ same time frame:
Mar 18 14:50:58 ibmblade02 kernel: dcs_operation D ffff81000100caa0     0  1379
 13868          1382       (NOTLB)
Mar 18 14:50:58 ibmblade02 kernel:  ffff8105b8b43ce8 0000000000000082
ffff81034bb414b0 ffffffff8000d5a8
Mar 18 14:50:58 ibmblade02 kernel:  ffff8105b8b43c68 0000000000000009
ffff81043b6e8080 ffff81011cb8e100
Mar 18 14:50:58 ibmblade02 kernel:  00008e5876e301ce 0000000000021dd4
ffff81043b6e8268 00000001b8b43c68
Mar 18 14:50:58 ibmblade02 kernel: Call Trace:
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8000d5a8>] dput+0x2c/0x114
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff889720af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff889720b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff889720af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8002cc76>]
mntput_no_expire+0x19/0x89
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8897206e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff88980653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8898064b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8003a2b4>]
fcntl_setlk+0x243/0x273
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:50:58 ibmblade02 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:50:59 ibmblade02 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Mar 18 14:52:06 ibmblade04 kernel: dcs_operation D ffff810001015120     0  6839
  4799          6845       (NOTLB)
Mar 18 14:52:06 ibmblade04 kernel:  ffff8102b312fbb8 0000000000000086
0000000000000000 ffff81081a5ff800
Mar 18 14:52:06 ibmblade04 kernel:  0000000000000018 0000000000000009
ffff81008c0eb040 ffff81011cb9d080
Mar 18 14:52:06 ibmblade04 kernel:  00008eefac0d05f9 00000000000004ff
ffff81008c0eb228 00000002889f56d8
Mar 18 14:52:06 ibmblade04 kernel: Call Trace:
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff88a1d0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff88a1d0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff88a1d0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff88a1d06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff88a2d2a8>]
:gfs2:gfs2_permission+0x83/0xd3
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff88a2d2a0>]
:gfs2:gfs2_permission+0x7b/0xd3
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff8000d081>] do_lookup+0x65/0x1e6
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff8000db08>] permission+0x81/0xc8
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff80009a1c>]
__link_path_walk+0x173/0xfd1
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff800a2908>]
autoremove_wake_function+0x0/0x2e
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff8000eb7d>]
link_path_walk+0x45/0xb8
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff8000ce9c>]
do_path_lookup+0x294/0x310
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff8001298f>] getname+0x15b/0x1c2
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff80023b21>]
__user_walk_fd+0x37/0x4c
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff80028a7e>] vfs_stat_fd+0x1b/0x4a
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff8000e60b>]
free_pages_and_swap_cache+0x67/0x7e
Mar 18 14:52:06 ibmblade04 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:52:07 ibmblade04 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:52:07 ibmblade04 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Glocks:
[adrew@adrew-virts gfs2_hang_248_w_DLM_ko_Mar_19_11_RH_333124]$ grep 'p:21045'
ibmblade01-hangdata/gfs2/rhel55\:data01/glocks  | egrep -i 'f:aw|f:ew|f:w'
 H: s:SH f:aW e:0 p:21045 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
[adrew@adrew-virts gfs2_hang_248_w_DLM_ko_Mar_19_11_RH_333124]$ grep 'p:21045'
ibmblade01-hangdata/gfs2/rhel55\:data02/glocks  | egrep -i 'f:aw|f:ew|f:w'
[adrew@adrew-virts gfs2_hang_248_w_DLM_ko_Mar_19_11_RH_333124]$

Only 1 waiter for the pid that hung up first:
G:  s:UN n:2/4c8a8a f:lq t:SH d:EX/0 l:0 a:0 r:12 m:0
 H: s:SH f:aW e:0 p:21045 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
 H: s:SH f:aW e:0 p:21076 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
 H: s:SH f:aW e:0 p:21068 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
 H: s:SH f:aW e:0 p:21061 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
 H: s:SH f:aW e:0 p:13484 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
 H: s:SH f:aW e:0 p:16445 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2]
 H: s:SH f:aW e:0 p:16446 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2]
 H: s:SH f:aW e:0 p:16447 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2]
 H: s:SH f:aW e:0 p:16448 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2]

That looks like our culprit then.

Just in case it is useful:
G:  s:SH n:5/4c8a8a f:q t:SH d:EX/0 l:0 a:0 r:3 m:200
 H: s:SH f:EH e:0 p:11954 [(ended)] gfs2_inode_lookup+0x11a/0x284 [gfs2]

I checked the full call traces in the logs on ibmblade01. Every single one of
those waiters on 2/4c8a8ais hung up. 21045 is in dlm:dlm_lock from
gfs2_glock_wait but everything else is in dput & just_schedule.





Mar 18 14:52:16 ibmblade01 kernel: INFO: task dcs_operation_m:21045 blocked for
more than 120 seconds.
Mar 18 14:52:16 ibmblade01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 18 14:52:16 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0     0 21045
 19117         21053       (NOTLB)
Mar 18 14:52:16 ibmblade01 kernel:  ffff81056a3b3ce8 0000000000000086
0000000000000000 ffff8108216b6800
Mar 18 14:52:16 ibmblade01 kernel:  0000000000000018 0000000000000009
ffff8100549ce860 ffff810061b8f7a0
Mar 18 14:52:16 ibmblade01 kernel:  00008eee45f4237d 0000000000002834
ffff8100549cea48 00000001889f26d8
Mar 18 14:52:16 ibmblade01 kernel: Call Trace:
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff889f0d68>]
:dlm:dlm_lock+0x117/0x129
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a95594>]
:lock_dlm:gdlm_ast+0x0/0x313
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80062ff2>]
thread_return+0x62/0xfe
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8003a2b4>]
fcntl_setlk+0x243/0x273
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:52:16 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 18 14:52:16 ibmblade01 kernel: 
Mar 18 14:52:16 ibmblade01 kernel: INFO: task dcs_operation_m:21061 blocked for
more than 120 seconds.
Mar 18 14:52:16 ibmblade01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 18 14:52:16 ibmblade01 kernel: dcs_operation D ffff810001015120     0 21061
 19117         21068 21055 (NOTLB)
Mar 18 14:52:16 ibmblade01 kernel:  ffff81059a2f3ce8 0000000000000082
ffff810465ddcb70 ffffffff8000d5a8
Mar 18 14:52:17 ibmblade01 kernel:  ffff81059a2f3c68 0000000000000009
ffff810061b8f7a0 ffff81011cb9d080
Mar 18 14:52:17 ibmblade01 kernel:  00008eefe41c6ead 0000000000012302
ffff810061b8f988 000000029a2f3c68
Mar 18 14:52:17 ibmblade01 kernel: Call Trace:
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8000d5a8>] dput+0x2c/0x114
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8002cc76>]
mntput_no_expire+0x19/0x89
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8003a2b4>]
fcntl_setlk+0x243/0x273
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 18 14:52:17 ibmblade01 kernel: 
Mar 18 14:52:17 ibmblade01 kernel: INFO: task dcs_operation_m:21068 blocked for
more than 120 seconds.
Mar 18 14:52:17 ibmblade01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 18 14:52:17 ibmblade01 kernel: dcs_operation D ffff810001015120     0 21068
 19117         21076 21061 (NOTLB)
Mar 18 14:52:17 ibmblade01 kernel:  ffff81011d999ce8 0000000000000082
ffff810465ddcb70 ffffffff8000d5a8
Mar 18 14:52:17 ibmblade01 kernel:  ffff81011d999c68 000000000000000a
ffff8103e29b2040 ffff81011cb9d080
Mar 18 14:52:17 ibmblade01 kernel:  00008eefddf035e3 000000000000f41b
ffff8103e29b2228 000000021d999c68
Mar 18 14:52:17 ibmblade01 kernel: Call Trace:
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8000d5a8>] dput+0x2c/0x114
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8002cc76>]
mntput_no_expire+0x19/0x89
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff8000e60b>]
free_pages_and_swap_cache+0x67/0x7e
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80033987>]
do_setitimer+0xd0/0x67b
Mar 18 14:52:17 ibmblade01 kernel:  [<ffffffff80028f23>]
do_sigaction+0x76/0x199
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 18 14:52:18 ibmblade01 kernel: 
Mar 18 14:52:18 ibmblade01 kernel: INFO: task dcs_operation_m:21076 blocked for
more than 120 seconds.
Mar 18 14:52:18 ibmblade01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 18 14:52:18 ibmblade01 kernel: dcs_operation D ffff81011cb9f0c0     0 21076
 19117         21083 21068 (NOTLB)
Mar 18 14:52:18 ibmblade01 kernel:  ffff810345687ce8 0000000000000086
ffff810465ddcb70 ffffffff8000d5a8
Mar 18 14:52:18 ibmblade01 kernel:  ffff810345687c68 0000000000000009
ffff8104040d1820 ffff81011cb9f0c0
Mar 18 14:52:18 ibmblade01 kernel:  00008eef6b271bc4 000000000000ab7d
ffff8104040d1a08 0000000345687c68
Mar 18 14:52:18 ibmblade01 kernel: Call Trace:
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8000d5a8>] dput+0x2c/0x114
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8000985a>] __d_lookup+0xb0/0xff
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8002cc76>]
mntput_no_expire+0x19/0x89
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff80062ff2>]
thread_return+0x62/0xfe
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:52:18 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 18 14:52:18 ibmblade01 kernel: 
Mar 18 14:54:16 ibmblade01 kernel: INFO: task dcs_operation_m:21045 blocked for
more than 120 seconds.
Mar 18 14:54:16 ibmblade01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 18 14:54:16 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0     0 21045
 19117         21053       (NOTLB)
Mar 18 14:54:16 ibmblade01 kernel:  ffff81056a3b3ce8 0000000000000086
0000000000000000 ffff8108216b6800
Mar 18 14:54:16 ibmblade01 kernel:  0000000000000018 0000000000000009
ffff8100549ce860 ffff810061b8f7a0
Mar 18 14:54:16 ibmblade01 kernel:  00008eee45f4237d 0000000000002834
ffff8100549cea48 00000001889f26d8
Mar 18 14:54:16 ibmblade01 kernel: Call Trace:
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff889f0d68>]
:dlm:dlm_lock+0x117/0x129
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a95594>]
:lock_dlm:gdlm_ast+0x0/0x313
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a1a0b8>]
:gfs2:just_schedule+0x9/0xd
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff800639fa>]
__wait_on_bit+0x40/0x6e
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a1a0af>]
:gfs2:just_schedule+0x0/0xd
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff80063a94>]
out_of_line_wait_on_bit+0x6c/0x78
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff800a2936>]
wake_bit_function+0x0/0x23
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a1a06e>]
:gfs2:gfs2_glock_wait+0x37/0x78
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a28653>]
:gfs2:gfs2_getattr+0x85/0xc4
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff88a2864b>]
:gfs2:gfs2_getattr+0x7d/0xc4
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff80062ff2>]
thread_return+0x62/0xfe
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff8003a2b4>]
fcntl_setlk+0x243/0x273
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff80023851>] sys_newstat+0x19/0x31
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Mar 18 14:54:16 ibmblade01 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Mar 18 14:54:16 ibmblade01 kernel: 

Node ibmblade02:
G:  s:UN n:2/4c8a8a f:lq t:SH d:EX/0 l:0 a:0 r:6 m:50
 H: s:SH f:W e:0 p:1412 [dcs_operation_m] gfs2_drevalidate+0xff/0x219 [gfs2]
 H: s:SH f:aW e:0 p:1379 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
 H: s:SH f:aW e:0 p:1382 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]

Node ibmblade03:
G:  s:EX n:2/4c8a8a f:yq t:EX d:SH/0 l:0 a:0 r:3 m:50
 I: n:1346036/5016202 t:4 f:0x10 d:0x00000003 s:2048/2048

Node ibmblade04:
G:  s:UN n:2/4c8a8a f:lq t:SH d:EX/0 l:0 a:0 r:6 m:0
 H: s:SH f:aW e:0 p:6839 [dcs_operation_m] gfs2_permission+0x7b/0xd3 [gfs2]
 H: s:SH f:aW e:0 p:7197 [dcs_output_mgr] gfs2_permission+0x7b/0xd3 [gfs2]
 H: s:SH f:aW e:0 p:11394 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]

Comment 1 Robert Peterson 2011-03-24 18:53:46 UTC
Created attachment 487410 [details]
writechaos.c

Good news: I believe I've recreated this problem on my cluster.
I started with a fresh mkfs.gfs2.  I used -J32 but I don't think
that's important, since I'm not doing much with metadata.
Then, simultaneously on all four nodes I ran this little test
program I wrote with these parameters:

time /home/bob/tools/writechaos /mnt/gfs2 4 1000000

The first three times I ran it, there was no hang.
On the fourth attempt, node3 ended much quicker than the
other nodes and the other three nodes hung.

Basically, all four nodes are competing for the same EX glock
to write data to the same file.  Somewhere along the line it
gets hung up.  Here are the final glock states for each node:

roth-01:
G:  s:EX n:2/84a1 f:lDpyiq t:UN d:UN/3000 l:0 a:1 r:5 m:0
 H: s:EX f:W e:0 p:19242 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2]
 I: n:25/33953 t:8 f:0x10 d:0x00000000 s:214944/214944

 id nodeid remid   pid  xid exflags flags sts grmode rqmode time_ms r_nodeid r_len      r_name
32f0077 2 3e60017 26562 0 0 10000 2 5 -1 23 0 24 "       2            84a1"
28e006a 0 0 19015 0 0 0 1 -1 5 0 0 24 "       2            84a1"
30f0075 4 17d0027 26644 0 0 10000 1 -1 5 5 0 24 "       2            84a1"

Process 19015 is the glock_workqueue, same as the customer site.

roth-02:
G:  s:UN n:2/84a1 f:lq t:EX d:EX/0 l:0 a:0 r:4 m:0
 H: s:EX f:W e:0 p:26779 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2]

roth-03:
none

roth-04:
G:  s:EX n:2/84a1 f:lDpiq t:UN d:UN/7000 l:0 a:0 r:5 m:0
 H: s:EX f:W e:0 p:26863 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2]
 I: n:25/33953 t:8 f:0x10 d:0x00000000 s:214944/214944

Now that I can recreate it, I can add sufficient instrumentation
to solve this in a timely manner.  BTW, each run of 1000000
writes only takes 2m30s.

Comment 2 Robert Peterson 2011-03-24 20:09:26 UTC
Now that I can recreate the problem, I'm adding instrumentation.
I added a glock flag GLF_CB_OWED to indicate that a callback is
owed. I set the flag in function do_xmote right before we call
gfs2_lm_lock and I clear the flag in function blocking_cb and in
the LM_CB_ASYNC case, to indicate we got the callback.
After recreating the failure, I found the flag left on.  It
displays as "o":

G:  s:UN n:2/84a1 f:lqo t:EX d:EX/0 l:0 a:0 r:4 m:0
 H: s:EX f:W e:0 p:4758 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2]

So it seems as if Steve's intuition was right: We're missing
a callback.  The problem could be in dlm or lock_dlm.  It's
time to add Dave T to the cc list, and add instrumentation to
lock_dlm.

Comment 3 Steve Whitehouse 2011-03-25 09:56:29 UTC
Can you move the flag down the stack a bit? If we had a flag that was set when the dlm returned success for a shared/deferred/exclusive lock mode and was cleared for null or unlocked, lock modes, then we could also clear it when a demote request arrives. If that was left on, then it would confirm what is going on. It is a big step forward that we can recreate the issue locally at least.

Comment 4 Robert Peterson 2011-03-25 21:45:02 UTC
Update from the tests I performed today:
In the scenario I recreated, gfs2 was, with each run of the
test program, adjusting its hold time on the critical glock.
When the value got down to zero (0) it started doing lots of
journal log flushes, which tied up the system doing IO.
So it was not hung at all, it was just making VERY slow
progress due to all the IO.  Dumping the locks appeared to
show it hung, but in reality, we were watching a cluster in
motion (not hung) so the glock values were not stable.
The excessive IO was making it look hung.  In my case, the
test program went from 2.5minutes to 40+ minutes per run!

I created a special version of gfs2 that does two things:
(1) Adds instrumentation to lock_dlm to detect any cases in
which dlm fails to send us a callback.  (2) Changes the
minimum hold time to 10.  That makes my test program run
longer.  This is not a solution, but it may make things run
a lot better.  I still need to track down why we're doing so
many log flushes as the value gets shorter. The test modules
are on my people page as:

gfs2.690555.2011mar25.ko and lock_dlm.690555.2011mar25.ko

They're compiled against the 2.6.18-248.el5.bz656032-x86_64
kernel.

Comment 5 Robert Peterson 2011-03-28 02:46:05 UTC
Here is my (as yet unproven) theory of what's happening:
Function do_xmote is called for a dinode glock.  It submits
a TRY_1CB lock to dlm, but doesn't get it due to "TRYFAILED".
So it tries again in rapid succession, based on shortness of
the glock hold time value.  With each try, it calls
go_xmote_th which is essentially inode_go_sync.  Each call
to inode_go_sync calls gfs2_log_flush, which does disk IO,
followed by gfs2_ail_empty_gl, which also calls gfs2_log_flush
a second time for even more disk IO.

So if my theory is correct, this all goes back to the problem
that function do_xmote sets the TRY_1CB bit on.  When we've
tried to remove it, everything seemed to blow up.  It's long
been a mystery that we've never had the luxury of time to
solve.  But now I think we are being forced to take the time
to figure out why.  Once we can get rid of the TRY lock in
favor of a blocking lock, we should be able to shorten the
hold time back to 1 or even 0 again.

Comment 10 Adam Drew 2011-03-31 16:35:48 UTC
Caught the panic on the serial console:

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/locks.c:2080
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/platform/host1/session2/target1:0:0/1:0:0:1/timeout
CPU 0 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) lock_dlm(U) gfs2(U) dlm configfs lockd sunrpc dm_round_robin sg be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_piix4 snd_timer snd soundcore snd_page_alloc pcspkr virtio_net i2c_core tpm_tis 8139too virtio_balloon ide_cd 8139cp tpm mii serio_raw cdrom tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 5210, comm: python Tainted: G      2.6.18-248.el5.bz656032 #1
RIP: 0010:[<ffffffff8002766c>]  [<ffffffff8002766c>] locks_remove_flock+0xe2/0x124
RSP: 0018:ffff810015abbda8  EFLAGS: 00010246
RAX: 0000000000000081 RBX: ffff8100146e0788 RCX: 0000000000000000
RDX: ffff810015683100 RSI: ffff810015abbda8 RDI: ffffffff80335c60
RBP: ffff8100156745c0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff810015abbda8 R11: 00000000000000b0 R12: ffff8100146e0688
R13: ffff8100146e0688 R14: ffff81000e5cce80 R15: ffff81001440eb70
FS:  0000000000000000(0000) GS:ffffffff80427000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003c6e49a6e0 CR3: 0000000000201000 CR4: 00000000000006e0
Process python (pid: 5210, threadinfo ffff810015aba000, task ffff810015683100)
Stack:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 000000000000145a 0000000000000000
 0000000000000000 0000000000000000 ffff8100156745c0 0000000000000202
Call Trace:
 [<ffffffff80012bd8>] __fput+0x94/0x1bd
 [<ffffffff80023d8b>] filp_close+0x5c/0x64
 [<ffffffff80039227>] put_files_struct+0x63/0xae
 [<ffffffff8001597d>] do_exit+0x376/0x955
 [<ffffffff8004948c>] cpuset_exit+0x0/0x88
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 0f 0b 68 0b ae 2b 80 c2 20 08 eb fe 48 89 c3 48 8b 03 48 85 
RIP  [<ffffffff8002766c>] locks_remove_flock+0xe2/0x124
 RSP <ffff810015abbda8>

I took a screenshot of the other ones and it was in the same code in fs/locks.c. This was on a brand new FS mounted on one node running 20 threads of reading and writing to files.

A

Comment 11 Steve Whitehouse 2011-03-31 18:03:16 UTC
Adam, can you attach the python script to the bug?

Comment 12 Adam Drew 2011-03-31 18:39:48 UTC
Created attachment 489194 [details]
Test script provided by customer

This is the test script that the customer developed for filesystem testing. It wasn't intended as a reproducer but it was found that the panics could be reproduced outside of the context of the usual applications test using that script.

I've verified that I can produce panics and withdraws with this script. But, I had to test to arrive at just the right combination of parameters. If the test is too aggressive the system will just get bogged down and then fenced. If the test is too light it'll take absurd amounts of runs to encounter a problem. I invoke the tool like so:

fs-test.py --filesize 50000 --loop 2 --repeat 100 --concurrency 15 --delopt 1 --quiet --filename /mnt/gfs2/PID_01 4

The customer used the following settings:

filesystem_test.py --filesize 1000000 --loop 100 --repeat 2 --concurrency 20 --delopt 1 --quiet --filename /dcs/appl01/DATA/01_PID 4

Their system is much more robust than mine so I think you'll need to tune concurrency and size to find the right fit for your system.

The code forks a number of children that then proceed to create files, EX lock them, fill them with data, and then delete them. There's more nuance to it than that, but that is the essentials of it. It is test case 4 in the code.

The customer developed the tool as a general filesystem performance test and it is pretty robust. I am going to work on basing a simpler test on its behavior to see if we can simplify and independently arrive at the same symptom.

Comment 13 Robert Peterson 2011-04-02 14:25:21 UTC
Okay, I think I understand what's going on in the error I
reported in comment #7.  The hard part is understanding
the glock state machine, which is absurdly complex.

The best way to explain what's going on is by starting with
a pseudo-code snippet of function gfs2_glock_nq:

gfs2_glock_nq
   run_queue
      do_promote
         ret = glops->go_lock(gh);
            rgrp_go_lock
               gfs2_rgrp_bh_get
                  gfs2_meta_read - re-reads bitmaps
      do_xmote
         glops->go_xmote_th
            meta_go_sync
              gfs2_log_flush
              gfs2_meta_sync - syncs bitmaps to the media
                 struct address_space *mapping = gl->gl_aspace->i_mapping;
                 filemap_fdatawrite(mapping);
                 error = filemap_fdatawait(mapping);

Here is the view from 20,000 feet:
What I think is the problem, in a nutshell, is that in
function run_queue, it calls do_promote before do_xmote.

Function do_xmote is like a "top half" for glocks, and for
rgrps, its job is to sync the data to the media before the
lock is granted to the next waiter.

Function do_promote is like a "bottom half" for glocks,
and it finishes what was started.  In the case of rgrps,
the "bottom half" not only does the promotion as far as
dlm is concerned, it also calls "go_lock" which has the
effect of re-reading the rgrp bitmaps from the media.

So I believe the problem is that the bottom half is being
executed before the top half.  In other words, the bitmap
is being re-read before the previous IO is being synced to
the media.  Instead, we should be syncing the data to the
media before we call go_lock.

But it's not as simple as all that.  Function do_xmote makes
the assumption that the promotion has happened.  We can't
just reverse the order.

So this appears to be a bug in the glock state machine,
which is probably the second most complex piece of code in
gfs2.  But understanding the problem is the most difficult
part, and now that I know what's going on, I should be able
to design a fix that won't break everything else.
I've tried a couple of prototypes, but they failed for
various reasons.  I need to think of a proper way to fix it.
Unfortunately, Steve is away on holiday, so I won't be able
to consult with him, and he's the only one who thoroughly
understands the glock state machine.

Comment 14 Robert Peterson 2011-04-04 20:29:38 UTC
Just an update for people out there who are interested in the
progress of this bug.  Regarding the code snippet of comment #13:
Way back in the ancient history of GFS2, the code path used
to look something like this:

gfs2_glock_nq
   add_to_queue(gh);
   run_queue(gl);
      rq_promote
         glops->go_xmote_th
            inode_go_xmote_th
               gfs2_glock_xmote_th
                  glops->go_sync
                  gfs2_lm_lock()
                  if (!async)
                     xmote_bh
   if (!async)
      glock_wait_internal
         glops->go_lock
	    rgrp_go_lock
	       gfs2_rgrp_bh_get
	          gfs2_meta_read - re-reads bitmaps

So the go_sync (top half) was normally executed before the
go_lock (bottom half), which guarantees any writes happen
before the glock changes holders.  I've tried several ways
to make the sync happen first, but there are pitfalls and
problems with every design I've tried.  I'll keep trying to
design a better way to do this, but it's extremely complex.
We can't just go back to the previous method above because
the glock state machine has changed too much in the past
five years.

Comment 15 Robert Peterson 2011-04-07 03:03:58 UTC
Created attachment 490467 [details]
Initial patch

This patch seems to fix the problem with bitmap corruption,
and also the problem with the "writechaos" test program.

This patch adds a "go_sync" glops function for rgrp metadata
that is executed earlier in the glock cycle to sync the
data to media before releasing the glock.  It's nearly
identical to the go_xmote_th function, which makes it a bit
redundant.  I need to do more testing on it and see if I
can eliminate the go_xmote_th.  Previous attempts to do that
ended in failure, but could have been due to problems with
my instrumentation and other experimental code.  This patch
applies to the -255 kernel which contains the fixes for
bug #656032, bug #650494 and others.

However, if I run the scenario from bug #650494 repeatedly,
I can recreate another problem.  I'm not sure if it's related.
Here is the symptom:

GFS2: fsid=hlab_cluster:svr_home.0: fatal: filesystem consistency error
GFS2: fsid=hlab_cluster:svr_home.0:   inode = 38966 920929
GFS2: fsid=hlab_cluster:svr_home.0:   function = gfs2_change_nlink, file = /home/bob/gfs2.690555/inode.c, line = 426
GFS2: fsid=hlab_cluster:svr_home.0: about to withdraw this file system
GFS2: fsid=hlab_cluster:svr_home.0: telling LM to withdraw
GFS2: fsid=hlab_cluster:svr_home.0: withdrawn

Call Trace:
 [<ffffffff8868879c>] :gfs2:gfs2_lm_withdraw+0xd3/0x100
 [<ffffffff8868bd59>] :gfs2:gfs2_getbuf+0x108/0x117
 [<ffffffff8867ec89>] :gfs2:gfs2_dirent_prev+0x0/0x4e
 [<ffffffff8868bd7f>] :gfs2:gfs2_meta_read+0x17/0x65
 [<ffffffff8868a6e2>] :gfs2:gfs2_pin+0xb4/0xbc
 [<ffffffff8869c76f>] :gfs2:gfs2_consist_inode_i+0x43/0x48
 [<ffffffff88686def>] :gfs2:gfs2_change_nlink+0x56/0x104
 [<ffffffff88692940>] :gfs2:gfs2_unlink+0x12a/0x199
 [<ffffffff8869287a>] :gfs2:gfs2_unlink+0x64/0x199
 [<ffffffff88692898>] :gfs2:gfs2_unlink+0x82/0x199
 [<ffffffff886928ba>] :gfs2:gfs2_unlink+0xa4/0x199
 [<ffffffff8869914d>] :gfs2:gfs2_rindex_hold+0x2b/0x8d
 [<ffffffff8008fd34>] __cond_resched+0x1c/0x44
 [<ffffffff8004a076>] vfs_unlink+0xc4/0x120
 [<ffffffff8003c629>] do_unlinkat+0xaa/0x141
 [<ffffffff8006153d>] sysenter_tracesys+0x48/0x83
 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76

GFS2: fsid=hlab_cluster:svr_home.0: gfs2_delete_inode: -5
INFO: task gfs2_logd:10540 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D ffffffff80153e37     0 10540    403         10541 10539 (L-TLB)
 ffff81067f233df0 0000000000000046 0000000000000002 ffff81047df31d30
 0000000000000002 000000000000000a ffff81047d7b1040 ffff81087fcc0820
 00000316c87faa2b 00000000000088b0 ffff81047d7b1228 000000147e1fb250
Call Trace:
 [<ffffffff8006ec35>] do_gettimeofday+0x40/0x90
 [<ffffffff800a279c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800637ce>] io_schedule+0x3f/0x67
 [<ffffffff88689682>] :gfs2:gfs2_log_flush+0x361/0x472
 [<ffffffff800a29b4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88688d24>] :gfs2:gfs2_ail1_empty+0x36/0x95
 [<ffffffff88689c6d>] :gfs2:gfs2_logd+0xa2/0x15c
 [<ffffffff88689bcb>] :gfs2:gfs2_logd+0x0/0x15c
 [<ffffffff80032733>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a279c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88685ffc>] :gfs2:inode_go_demote_ok+0x0/0x26
 [<ffffffff80032635>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

We've seen this symptom before, so I'm inclined to think it
might be a different problem.  It could also be similar:
It could be a result of dinode glocks not having one of the
"go_sync" glops, as I added for rgrp glocks, and therefore
their commits to media are happening too late.

The "proper" fix is probably to scrap and redesign the entire
glock state machine.  But design issues are for upstream, not
minor releases, so I'm just looking for a simple way to
circumvent the problem for this release.

Comment 16 Steve Whitehouse 2011-04-11 10:33:18 UTC
This doesn't look right to me.... what should be happening is this:

->go_lock() runs immediately before the first holder is granted
->go_unlock() runs immediately after the final holder is dropped

The glock cannot change state when there is a (granted) holder queued. When we call ->go_unlock() we take the GLF_LOCK while we drop the spinlock in order to prevent any state changes from occurring during this interval.

Comment 17 Robert Peterson 2011-04-11 15:34:18 UTC
Yes, go_unlock is called when the last holder releases the glock.
The trouble is, that's not the end of the rgrp's buffers.
They hang around as long as rd_bh_count remains positive, which means
during the journal flush process.  So long after the glock is
relinquished, there are still buffers floating around.  During that
time, another holder can hold the glock, which causes go_lock, which
calls gfs2_rgrp_bh_get, which calls gfs2_meta_read to re-read the
bitmaps from the media, thus losing bits.  And I suspect it's the
same issue for dinode glocks as well, which is likely the cause
of the problems we've seen in the past where a dinode is being
unlinked and somehow mistakenly ends up with a block count.

Comment 18 Steve Whitehouse 2011-04-11 15:46:20 UTC
gfs2_meta_read should be a no-op if the pages/buffers are uptodate.

When the glock is dropped, the buffers should all be synced back (if dirty) and then invalidated, so that none of them should still be in the journal if the glock is no longer held. We always do a journal flush as part of the process of writeback when the glock is dropped.

Ideally I'd like to be rid of the gfs2_rgrp_bh_get/put functions being called from ->go_lock/unlock as it would make the code rather simpler. Also it would allow locking an rgrp without reading the bitmap in, something that will be required to make ->statfs faster in the future. In order to do that we need to audit all users of the rgrp glock and ensure that they read the bitmaps in when required themselves.

Comment 19 Robert Peterson 2011-04-12 20:50:45 UTC
I've established that in the failing scenario, the glock is being
tranferred from one node to the node that reports the problem.

The rgrp write is done correctly on the node releasing the lock.
The rgrp read, however, is bypassed on the failing node due to 
rd_bh_count != 0.  So really this problem is an issue of
mismanaging rd_bh_count in some way.

Note that this error only manifests when a bitmap OTHER THAN the
initial one (on the rgrp block) is changed.  I initially thought
the problem was with the management of the buffers, i.e., when
function trans_add_bh adds a buffer, it calls gfs2_rgrp_bh_hold.
That makes the most sense given the symptom I just mentioned.
However, it seems to be more complicated than that.  After
extensive analysis, that code all looks correct to me.  I'm
continuing to backtrack with my "glock history" instrumentation.

Comment 20 Robert Peterson 2011-04-16 14:20:20 UTC
I know what I said earlier in bug #656032 about the corrupt
bits not being set by gfs2_setbit, but apparently they are.
It seems like I'm dealing with multiple problems, but they
may all be the same problem manifesting in different ways.

What I now know is this:

The sequence of events appears simple or complex depending
on whether other nodes in the cluster are involved.
What seems to be happening is this:

(1) The dinode block starts out in "unlinked" status.

(2) We probably try to reclaim the dinode, which calls a
    final iput.  Function iput_final calls drop_inode, which
    eventually gets down to gfs2_delete_inode, and that calls
    gfs2_dinode_dealloc which changes the block from UNLINKED
    to FREE via gfs2_free_di.

(3) Sometime shortly thereafter, vfs calls us to unlink the
    inode, which does a lookup, then calls gfs2_unlink.
    That finds the block improperly freed.  You can't go
    from free back to unlinked.

I've been all over the code paths and I don't see where any
bad decisions are made or blocks are improperly calculated.

If steps (1) and (2) were done on a different node, then
step (3) will flag the error on a different node.  That's when
the symptoms make it look like I previously thought, which
is that a rgrp_iget re-reads the bitmap with the bad bit setting.

I spent a lot of time checking and making sure the glock
is properly locking, writing and reading the rgrp at the
right times.

In fact, the re-read seems proper, and the bad bit setting
are due to the fact that steps (1) and (2) occurred on the other
node.  When the bitmap is not re-read, it's correct because
the rgrp glock was not passed to another node in the cluster.

BTW, regarding step (3) the problem can also appear in the
rename code, when rename decrements the link count (on a file,
not a directory).

I did some experiments on the problem:

(1) At first I thought the problem might be due to function
    gfs2_drop_inode setting the nlink count to zero when it
    sees a glock demote.  I temporarily disabled the check.
    It made the failure harder to recreate, but the problem
    still did occur if another node was _not_ involved.
(2) I tried adding a check for gfs2_inode_dropped (nlink==0)
    to function gfs2_unlink, but again, it just makes the
    problem harder to recreate.  Then it only recreates
    when other nodes are involved, and not in the single-node
    case.

The big question is why gfs2_unlink is called for a dinode
that's already been deleted.  The unlink keys off directory
entries, so maybe someone isn't removing a dirent before the
dinode itself is unlinked.  But I checked the code, and it
looks to be correct.

So I still need to backtrack, and now I'm in less familiar
territory.  Maybe a process is improperly setting the demote
bit which causes a node to improperly set the nlink to 0,
(all happening before step (1) in the sequence above)
which causes it to be improperly unlinked, then freed
(reclaimed) while there are still live dirents pointing to it.

Either that or the maybe the bitmap calculations are wrong.
But I've checked those pretty carefully too and they also
seem correct.

Comment 24 Robert Peterson 2011-04-20 04:11:03 UTC
Here's the latest on the problem I'm working on:
I've rebased my code to the latest kernel. I threw away a
lot of my instrumentation that was cluttering up everything.
As a result I am now able to recreate the problem again.

My instrumentation indicates that the problem goes back to
someone calling function gfs2_dir_del while unlinking a file,
for example "DOCNET0.LST."  It finds the correct dirent on the
correct leaf, and it has the correct hash value, but for
some odd reason, the de_inum.no_addr is overwritten with
the disk address of the dinode that was just previously
unlinked.  In other words:

op #  no_addr  de_hash     leaf blk   filename   operation
----- -------  ----------  -------- -----------  ---------
1771: 0xe1887  0xd73a3e37  0x1ac67e DOCNET0.LSO  gfs2_dir_del
1772: 0xe1887  0x5d5ff7db  0x1eb31a DOCNET0.LST  gfs2_dir_del

A previous entry for when DOCNET0.LST was added indicates
the block address should have been 0x1cc68f but was overwritten.

The first unlink was to delete file "DOCNET0.LSO", and it
sets the dinode to "unlinked".  Shortly thereafter, the dinode
is set "free" by the auto-reclaim process.

The second unlink is to delete file "DOCNET0.LST", but it
mistakenly has the previous delete's inum in the actual buffer.
(It's actually comes in big endian from the buffer and has
to be converted).

There is further churn, but the end result is that GFS2
then tries to unlink the block again, but now it's in a free
state, which is clearly wrong.

The bad news is that this is buffer corruption which is a
royal pain to track down.  A needle in a haystack.

The good news is that the corruption looks very distinct, so
I hopefully can track it down.  The needle is neon red, so
I have a pretty good chance of spotting it in the haystack.
It's not random garbage, it's a specific value, so I just have to
figure out where that value is coming from.  The big clue here
is that the block address overwriting the buffer was just used
in the previous unlink operation, so I just have to trace it
through.

Comment 25 Steve Whitehouse 2011-04-21 10:29:54 UTC
Bob, is there any pattern to the position in the leaf block where the problem occurs?

In particular, does the problem always occur when the entry is either the first or second entry in the leaf block? I wonder whether the issue is a confusion of "prev" and "cur" entries and whether the check for !gfs2_dirent_sentinel(dent) in __gfs2_dirent_find() is causing problems for unlink here.

Comment 27 Robert Peterson 2011-04-25 21:49:50 UTC
Created attachment 494765 [details]
Second patch

All the problems are happening on a jdata directory.  Examining
the journals with instrumentation, it's apparent that leaf
block updates are missing from the journal.  Looks like they're
getting dropped rather than synced to disk.  My belief is that
the problem is a statement in function inode_go_sync that looks
like this:

if (ip && !S_ISREG(ip->i_inode.i_mode))
   ip = NULL;

It basically says, "Don't sync the metadata if this is a directory."
That line has been in gfs2 for many years now.  It was added for
performance reasons.  I need to talk to Steve Whitehouse about it,
but the patch basically makes an exception for jdata directories,
and it seems to fix the problem.  Lots of testing needs to be
done on it, including: (1) many more iterations of the scenario
I've been using, (2) The writechaos program, (3) The customer test
script previously posted.

This patch includes the following (in order of appearance):

(1) A minor correction to a BUG_ON that I found.
(2) Comment corrections.
(3) Removal of the try_lock that seems to circumvent the problem
    but it causes other coherency problems detected by the
    writechaos program I previously attached.
(4) A minor change to delay the clearing of the glock GLF_DIRTY
    flag, which seems more proper.
(5) The actual fix I described above.
(6) Better error reporting when bitmap errors are detected.

FWIW, the journals look something like this:
(1) Node 1 does absolutely nothing with the jdata directory.
(2) Node 2 does several directory adds and unlinks, and I see
    the update numbers (my instrumentation) occasionally
    jumping.  It removes a directory, but that update goes
    mysteriously missing, along with several others.
    But directory history (more instrumentation) shows an
    uninterrupted sequence of updates.
(3) Node 3 does a handful of directory updates, and I also
    see its update numbers (my instrumentation) jumping in
    the journal, but not in the directory history.

So what I should be seeing in the journal is this:

           NODE 1                   NODE 2
1. directory update 1200
2. directory update 1201
3. directory update 1202 (handoff)
4.                            directory update 1203
5.                            directory update 1204 (handoff)
6. directory update 1205
   etc.

Instead what I'm seeing is something like this:
1. directory update 1200
2. directory update 1201
3. directory update 1202 (handoff)
4.                            directory update 1203
5.                            directory update 1208 (handoff)
6. directory update 1212
   (error detected)

Comment 28 Steve Whitehouse 2011-04-26 10:06:23 UTC
The test:

if (ip && !S_ISREG(ip->i_inode.i_mode))
   ip = NULL;

does not skip metadata writeback, but data writeback and that is simply because only regular files have any data to write back. If blocks are landing up in the data address space for directories, then that is a problem which needs to be resolved.

The paths for a directory with the jdata flag, and one without are identical. The flag doesn't affect the directory at all, it is just a default for any files (or subdirectories) which are created in the directory.

There should be no difference in the clearing of the dirty bit. It it always set under the read side of the log rwsem and it is always cleared under the write side of that rwsem, so there should be no difference at all if the test and clear are split up.

Comment 29 Steve Whitehouse 2011-04-26 10:17:58 UTC
Also, glock_free() already disposes of the aspace, so there is no need to do that separately.

Comment 30 Steve Whitehouse 2011-04-27 10:27:45 UTC
I'm wondering about the GFS2_RDF_UPTODATE flag. This is reset in read_rindex_entry() and will cause the content of the rgrp to be updated next time it is used, whatever state it is in. It may be that we never reread the rindex after it has been read for the first time. On the other hand, if it did happen, then it looks like it might cause a problem since it doesn't appear to be synchronised to ensure that there are never pending rgrp updated waiting to be written back.

We could probably get rid of that flag completely if we ensured that its uptodate status matched that of the rgrp's data itself. In other words, we'd always copy in the data when we'd sent an I/O request which had returned successfully. That way the uptodateness of the rgrp could be tested simply by looking at the uptodateness of the associated bh.

It does appear that the uptodate state of rgrps is overly complicated since we have to worry about both the rgrp itself and the rindex as well. So this would simplify the situation. I've been trying to convince myself that the clearing of the GFS2_RDF_UPTODATE flag in the rgrp's go_inval is actually correct from a locking standpoint. I can't find any obvious races, but we need to bear in mind that this flags operation is not atomic so must be covered by appropriate locking in case any of the other flag operations might overlap with it.

Comment 31 Steve Whitehouse 2011-04-27 13:41:59 UTC
Created attachment 495234 [details]
Untested example of my thoughts wrt the uptodate flag

Hopefully this will make my thoughts a bit clearer, even if it is not exactly right. It shows the idea to simplify the code a bit anyway. Potentially we could go further and get rid of the mutex and make the function a bit cleaner still.

Also, we need to get stop using the rindex spin lock to cover the rgrp ref counts, since that lock is global. Either an atomic_t or using a per-rgrp spinlock would be preferable. That can wait until later though.

I did also wonder whether it would be a good plan to add a test to the ->go_inval for the rgrp to ensure that at that stage there were no bh's left hanging off it. Would be a good thing for a test patch, even if we don't put that in upstream.

Comment 32 Robert Peterson 2011-04-27 13:55:29 UTC
This isn't a rgrp or bitmap problem, this is apparently a
glock problem.  I now have definitive proof that two different
nodes were making changes to the same directory at the same time:

Excerpt from my instrumentation on hlab1:
Entry PID   Function         dir dinode addr Leaf block Dirent filename update#
----- ----- ---------------- --------------- ---------- --------------- -------
1447: 18041 get_leaf read    0x130a95/000077 l:0x130c71 none 9 1f8
1448: 18041 gfs2_dir_del 7   0x130a95/14931d l:0x130c71 LAUNCH.LSO 4 1f9
1449: 10595 gfs2_unpin       0x000077/000077 l:0x130c71 none 15 1fa
1450: 10595 remove_from_ail3 0x000077/000077 l:0x130c71 none 12 1fb
...
1452: 18041 get_leaf read    0x130a95/000077 l:0x130c71 none 9 1fb
1453: 18041 gfs2_dir_del 7   0x130a95/14931d l:0x130c71 LAUNCH.LSO 5 1fc

The last number, (e.g. 1f8) is the revision number for the leaf block.
Interpretation: hlab1 read the leaf, deleted a dirent from it,
thus revising the leaf from 1f8 to 1f9.  Then it unpinned it and
removed it from the ail.  The label "remove_from_ail3" means it
was called from gfs2_ail_empty_gl().  This looks like a normal
directory unlink.

Excerpt from my instrumentation on hlab2:
623: 13923 get_leaf read    0x130a95/000077 l:0x130c71 none 9 1f8
...
626: 13923 gfs2_dir_del 7   0x130a95/09cfab l:0x130c71 RLAUNCH_HLAB2.TRO 4 1f9
627: 10590 gfs2_unpin       0x000077/000077 l:0x130c71 none 15 1fa
628: 10590 remove_from_ail3 0x000077/000077 l:0x130c71 none 12 1fb
629: 13923 get_leaf read    0x130a95/000077 l:0x130c71 none 9 1fb

Interpretation:
Meanwhile, hlab2 read in the leaf block and found revision 1f8 which
is the version of the leaf before hlab1 did its deleting from the
leaf.  Then it proceeds to do its own delete, updating the leaf
from 1f8 to 1f9, unpinning and removing it from the ail, just like
hlab1 did.

This is proof that both nodes were updating the leaf at the same
time.  Yet each node should have had the glock locked exclusively
while they were updating the leaf.  So this has got to be a glock
holding problem.

Earlier versions of my instrumentation had glock history, so I'm
going to have to go back and reinstate it to see what's happening
to the glock during this period.

The problem is, the problem is getting harder and harder to
recreate with my instrumentation.  Adding more may make it nearly
impossible.

Consequences:
If the instrumentation didn't stop things there, after hlab1
re-reads the leaf block, it would find its changes to the leaf
missing, so it would try to unlink the file again.  The second
unlink would catch the fact that the block is already free in the
bitmap, and the assert_withdraw is thrown, which is the symptom
I've been chasing for many weeks now.

Comment 33 Robert Peterson 2011-04-29 19:18:48 UTC
Created attachment 495847 [details]
Glock / Directory history tarball

What we've got here is three nodes all making changes to
the same directory at the same time, thus causing corruption
with my patch that removes the try lock.  We need that patch to
solve the customer's problem, but obviously we can't ship it
until we fix what's causing the corruption.

This is a zipped tarball of directory "intec28".  It contains
six files: Three are the glock dumps for the three nodes, which
aren't particularly important.  The other three are the
problem.28.hlab{1,2,3} files, which contain my latest combined
glock and history trace information for my latest recreation of
the failure.  Pay particular attention to the items I marked
with "Bob:" near the end of each of these files.

All three nodes were sharing a glock in SH mode, which is fine.
Then they all wanted to make changes to the directory, so they
tried to promote to an EX mode, which is fine too.  The problem 
is that all three got it, and so they were all incorrectly given
permission to modify the leaf block.  One of them detected the
corruption; in this case it was hlab3.

I need to go over this with a fine-tooth comb and figure out
where the breakdown was.  My hunch is that the glock state
machine called do_promote in a case where it shouldn't have.

Comment 34 Robert Peterson 2011-04-29 19:51:40 UTC
Legend for the glock/directory history:

GLOCK History:
G 41029: do_error4              8682, f:    lq tgt: SH, dmt: EX, St: UN LEmt g:75677
            H: s:SH f:aW e:0 p:9576 [PWLPOST.EXE] gfs2_permission+0x7b/0xd3 [gfs2]
            H: s:SH f:aW e:0 p:9575 [PWLPOST.EXE] gfs2_permission+0x7b/0xd3 [gfs2]

G:          This is a glock history entry (D for directory history event)
41029:      This is the history number.  In this case, this is the 41029th entry logged
do_error4:  Point of interest or calling function.  do_error4 is the do_error from run_queue
            responsible for failing any queued "try" locks.
8682        This is the pid of the process that logged the event
f: lq       These are the glock flags, same as a glock dump
tgt: SH     Glock's target glock state, same as a glock dump (Target lock is SH lock)
dmt: EX     Glock's demote glock state, same as a glock dump (Demote to EX lock)
st: UN      Glock's current state, same as a glock dump (this glock is unlocked)
LEmt        Short for "Log Element list is empty"
g:75677     This is a generation number. More important for directory history.
            The H: lines are the first two glock holders queued at the time of the event,
            same as a glock dump

Directory History:
D 41048: get_leaf read          9576 0x130a95/000077 ig:18446744073709551519 l:0x1273542 none 9 332

D:          This is a directory event
41048:      This is the history number.
get_leaf read: The event.  This means that get_leaf was called and the block was read (!uptodate)
9576        This is the pid of the process that logged the event
0x130a95    This is the directory's block address
000077      This is the dent's block address (if known, or 77 if not known)
ig:X        This is the dinode's generation number.  Not always reliable
l:0x1273542 This is the leaf block address (this happens to be the leaf where corruption was found)
none        This is the dirent file name, or "none" if not applicable for this event
9           This is debug information (function caller identifier)
332         This is the leaf block's generation number (which is updated with every event logged)
            It's okay for this to skip backwards a little because logging an event is sometimes
            done after the leaf is written to the media, so the next re-read may cause it
            to go backward.

Comment 36 Robert Peterson 2011-05-03 14:45:50 UTC
Here is the current status of things.

1. For years, customers have, under rare circumstances, complained
   about GFS2 withdrawls due to file system corruption but we've
   never been able to find the cause of the corruption.  We've always
   closed these bug records as INSUFFICIENT_DATA and asked for a
   scenario that can recreate the problem.  No customer has given us
   a scenario.
2. The symptoms are: (1) Invalid bitmap transition: A bit in the
   bitmap is transitioned from "free" to "unlinked" which is invalid.
   (2) Inodes being deleted have their old block count get
   off, so that after a delete, there are still supposedly blocks to
   delete.  (3) Other misc. problems with gfs2 metadata that seems to
   revert to older versions.
3. While testing the fix for bug #656032, we stumbled upon a scenario
   that recreates the corruption.  To do this, we are using a very
   complex multi-node application written by one of our customers
   (not the customer for this bug).
4. In debugging the problem, we determined that most customers are
   spared the corruption problem by a section of code in function
   do_xmote that turns all normal glocks into "try" glocks. Thie code
   affects the timing and denies some glocks in cases where they
   should be waiting.  This piece of code should not be necessary and
   has likely caused other problems and symptoms we saw while
   debugging bug #656032, so we know this code needs to be removed.
   This may also be adversely affecting performance.
5. To debug the problem, we removed that section of code, and it
   makes the problem easier to recreate.  We decided to focus on the
   first symptom, the invalid bitmap transition.  To debug that, we
   added instrumentation to log bitmap transitions and the management
   of resource groups.  This was a very long and arduous task but
   it eventually paid off and told us that the problem was, in fact,
   due to a dinode that was unlinked twice (it should not have
   existed when the second unlink occurred).
6. We initially pointed our fingers at the GFS2 code responsible
   for reclaiming unlinked dinodes, because that's been our most
   problematic area in the past.  In fact, it was an area of code
   we just fixed for the fourth or fifth time, in the name of bug
   #656032.  After intense debugging and complex instrumentation,
   that section of code was exhonerated.  The problem is due to a
   leaf block that has a directory entry removed, only to have it
   magically reappear.
7. We proceeded to add more instrumentation to log the history of
   all directory operations.  After more long, arduous debugging,
   it appeared that the leaf block was reverting to an older version.
   This indicates a cluster coherency problem, which should be
   policed by the glock code.
8. We proceeded to add more instrumentation to log the history of
   glocks.  This proved to be insuffient and we proceeded to add
   more instrumentation to log some history for lock_dlm, which
   the lock harness between GFS2 and the distributed lock manager,
   DLM.
9. Now we're facing several problems:
   (a) The more instrumentation we add, the harder the problem is
       to recreate.  With out current level of instrumentation, it
       takes about a half-hour to recreate.
   (b) With this level of instrumentation, the corruption over a
       large span of instrumented data.  The problem can go
       undetected a very long time, and by the time it's caught,
       the data we've instrumented has often been lost to do memory
       contraints.
   (c) The sheer amount of data we're gathering due to the
       instrumentation makes our analysis very slow and painstaking.
       First, the instrumentation tells us the block in the bitmap
       that is marked wrong.  We have to backtrack from that to the
       leaf block.  From the leaf block, we have to find its most
       recent update.  That tells us the leaf block revision number.
       Then we have to backtrack through all the data on all the
       nodes to find out what revisions were done to that leaf by
       all the nodes.  From there, we have to backtrack the glock
       operations to find out which node really had the DLM's
       lock to make updates and which one didn't.
    (d) Half the time, one of the other types of corruption is
       recreated, so our directory history does not help us to
       find what caused the corruption at all.
10. We continue to try to get a set of "clean" data from the
    instrumentation, which should tell us what's going wrong at
    the glock level.  It's my belief that this is either a glock
    problem or a dlm problem.  There's currently no way to know.
    I'm betting it's a glock problem.  If it's a dlm problem, we
    will need the help of the DLM experts, and likely more
    instrumentation will be needed at that level.  I really hope
    it doesn't come down to that.

Comment 37 Robert Peterson 2011-05-03 19:29:22 UTC
Created attachment 496619 [details]
Simpler recreation program

This simple program allows you to do simultaneous unlinkat
commands, which easily recreates the second problem (not the
directory leaf problem, but the one where function
gfs2_change_nlink is trying to decrement i_nlink, when, in
fact, i_nlink == 0.  I don't know if it will recreate the
directory leaf problem, but it might.

I've gathered what I hope is good glock history for this
problem and I'm starting to analyze it now.  Much simpler
program means much simpler set of data to analyze.

Comment 38 Robert Peterson 2011-05-03 19:41:00 UTC
Regarding comment #37: To recreate the problem I used my four
node roth cluster and did this:

1. Compile the program with:
   gcc -o unlinkat unlinkat.c
2. Mount a gfs2 file system as /mnt/gfs2/
3. Use a program like cssh to simultaneously type on multiple
   nodes:

   ./unlinkat 0 4 (on node 1)
   ./unlinkat 1 4 (on node 2)
   ./unlinkat 2 4 (on node 3)
   ./unlinkat 3 4 (on node 4)

After the files are all created in the directory
/mnt/gfs2/bigdir/, press <enter> simultaneously on all
nodes to simultaneously try to "unlinkat" a file.  If it
does not recreate immediately, press <enter> again.
For me, the second problem recreated about my 7th or 8th
simultaneous unlinkat.

By "simpler set of data to analyze" I mean I've got somewhere
on the order of 10,000 log entries to sift through, rather than
the customer scenario that had 150,000 - 300,000 entries on
each and every node.

Comment 39 Robert Peterson 2011-05-03 19:50:05 UTC
Created attachment 496626 [details]
Glock / Directory history tarball

Tarball of data gathered from unlinkat.c recreation program

Comment 44 Robert Peterson 2011-05-04 15:49:23 UTC
Created attachment 496831 [details]
Third patch

This patch combines three patches:
(1) The patch suggested by Steve in comment #42
(2) The patch Steve gave in bug #701577
(3) The patch we need that eliminates the "try lock" stuff.

The good news is that it completely fixes the "unlinkat"
problem and program I posted with comment #37, what I've
been calling the "second" or "change_nlink" problem.
With only the third patch (to eliminate the "try lock")
my test program could recreate a i_nlink problem on the
first few iterations of the test program.  With this combined
patch, I ran through a thousand iterations of the program with
no failures.

The bad news is that it does not fix the "first" problem,
which is the bitmap inconsistency problem.  With the customer
app, I'm still able to recreate this problem in about a
half-hour.

I have not tried Adam's script, nor my original test program,
writechaos, with the patch.

Since we're obviously dealing with two or more problems, should
we treat this as two bugs and open another bug record?  I hope
the customer would not view that negatively.

Comment 46 John Hadad 2011-05-04 16:38:47 UTC
Hi Bob.

If splitting the problem into two bug reports would result in a more efficient
path to complete resolution, we see no problem with this approach.

Thanks for checking with us.

John Hadad
Intec/CSG

Comment 48 RHEL Program Management 2011-05-05 09:29:38 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 51 Robert Peterson 2011-05-05 19:09:18 UTC
Created attachment 497195 [details]
The 5th RHEL5 patch

Here is the RHEL5 patch with Steve's revisions.  This has passed
the GFS2 coherency test on two clusters of various speeds.  It
has also passed several combinations of the customer's script,
plus a thousand iterations of the two reproducer programs.
It has not undergone our full battery of QE tests.  I'm going to
build this into a test kernel and post to my people page ASAP.
I'll post the link when it's ready, but there's lots of prep
work so it probably won't be today.

Comment 52 Robert Peterson 2011-05-06 02:03:06 UTC
Build 3304784 is complete.  I loaded the test kernel rpms
to my people page.  You can find them here:
http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/kernel*
This kernel has been successful under some heavy cluster
workloads, including the aforementioned tests, but it has
not gone through the Red Hat cluster QE process, so there are
no promises.  It is, however, very promising.

Comment 54 John Hadad 2011-05-07 17:52:17 UTC
Hi Bob.

259 kernel test results.

The test ran for about 90 min at which point the system hung (processes stuck in 'D' state).
I collected the usual hang data and uploaded it to the FTP dropbox as gfs2_hang_259_May_6_11_RH_333124.tgz 
For reference (as you look at the data), the test ran from about 17:30 and the hang occurred around 19:00 (May 6).
None of the nodes panicked (as they did with the previous kernel/patch).

FYI, I re-created all the GFS2 filesystems before starting the test (as I always do).

I collected some performance data while the test was running so that I can compare to previous runs/kernels.
The rate at which the data was flowing through the system was close to 194 (around 200 sec/batch). Acceptable but a bit faster would significantly improve our chances. 
The delete rate was not good: 1.5 - 3.5 files/sec

I will keep the system in this state in case you need further info from it.

John H
Intec/CSG

Comment 58 Robert Peterson 2011-05-10 17:04:31 UTC
Created attachment 498105 [details]
Sixth version of the patch

This is the version I posted today.

Comment 59 Robert Peterson 2011-05-10 17:06:22 UTC
Posted for inclusion into 5.7.  Changing status to POST.

Comment 61 Jarod Wilson 2011-05-13 22:20:06 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 63 Alan Brown 2011-05-18 18:04:03 UTC
Bob - one thing not covered in this ticket is that the node(s) where the FS has withdrawn usually needs to be rebooted before the filesystem can be fscked - this is disruptive at best.

Can that part of the problem be addressed too?

Comment 64 Adam Drew 2011-05-18 18:10:39 UTC
(In reply to comment #63)
> Bob - one thing not covered in this ticket is that the node(s) where the FS has
> withdrawn usually needs to be rebooted before the filesystem can be fscked -
> this is disruptive at best.
> 
> Can that part of the problem be addressed too?

Hi Alan,

That would be a separate request for enhancement. This bug deals with a specific issue. If you are a Red Hat customer you should open a case with support stating that you'd like to request that we add a facility that allows us to unmount withdrawn GFS2 filesystems without a reboot.

Thanks

Comment 65 Nate Straz 2011-06-03 13:32:21 UTC
I haven't had much luck reproducing this.  Do we have more recent results from the customer?

Comment 68 Nate Straz 2011-06-13 14:38:02 UTC
Marking VERIFIED based on customer feedback.  I have not been able to reproduce this in house.

Comment 69 Martin Prpič 2011-07-12 11:51:55 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.

Comment 70 errata-xmlrpc 2011-07-21 09:58:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.