Description of problem: Experienced GFS2 hangs on kernel 2.6.18-248 in the course of working through bug 656032 (see comments 183 through 190.) Once the traces were analyzed it appears that the issue results from the work queue not calling run_workqueue. From comment 190 in 656032: I've been analyzing the data Adam mentioned in comment #183. The problem does not appear to be related to the patches I've made for these bugs. In fact, I tend to believe it's not a GFS2 bug at all, but a bug in the workqueue kernel code. The data Adam provided seems to be enough to see what's going on to some extent. This is what I see: All nodes are waiting for glock 2/4c8a8a. That lock is mastered on ibmblade03, who only knows this about the glock: G: s:EX n:2/4c8a8a f:yq t:EX d:SH/0 l:0 a:0 r:3 m:50 I: n:1346036/5016202 t:4 f:0x10 d:0x00000003 s:2048/2048 However, the dlm knows this: Resource ffff81036604ac80 Name (len=24) " 2 4c8a8a" Master Copy Granted Queue 018744f4 EX Conversion Queue Waiting Queue 027f45a8 -- (PR) Remote: 2 012444cc 0134448f -- (PR) Remote: 1 022138b0 036e454d -- (PR) Remote: 4 000243df The dlm "_locks" file has: id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name 18744f4 0 0 8652 0 0 0 2 5 -1 0 0 24 " 2 4c8a8a" 27f45a8 2 12444cc 8003 0 0 10000 1 -1 3 209982009 0 24 " 2 4c8a8a" 134448f 1 22138b0 21045 0 0 10000 1 -1 3 209979505 0 24 " 2 4c8a8a" 36e454d 4 243df 6839 0 0 10000 1 -1 3 209972811 0 24 " 2 4c8a8a" So dlm says the glock is being held by pid 8652 on node3. That pid isn't even in gfs2: glock_workque S ffff810001015120 0 8652 281 8653 8651 (L-TLB) ffff81081ad65e70 0000000000000046 0000000000000001 ffffffff8002d2c9 0000000300000000 000000000000000a ffff81082f5be860 ffff81011cb9d080 000179b6cc403da2 00000000000008da ffff81082f5bea48 000000028002e45b Call Trace: [<ffffffff8002d2c9>] wake_up_bit+0x11/0x22 [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a2ac>] worker_thread+0x0/0x122 [<ffffffff8004a36c>] worker_thread+0xc0/0x122 [<ffffffff8008e433>] default_wake_function+0x0/0xe [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032b9b>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a26f0>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032a9d>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 This tells me either (1) the work queue code has not yet called run_workqueue which ought to have called function glock_work_func, or (2) function glock_work_func has finished after the work was queued without have done anything. Case (1) would be a bug in the workqueue code. Case (2) could still be a gfs2 bug. At any rate, this should likely be treated as a new bug. As I've mentioned in older bug records, there has been a lot of work and changes to the kernel's workqueue code to solve bugs like this, since 2.6.18 was first released. Some of these have been ported, but many have not. Version-Release number of selected component (if applicable): 2.6.18-248 How reproducible: Can reproduce at customer site after running for approximately 4 days. Steps to Reproduce: 1. Run customer workload. Multiple nodes accessing the same GFS2 filesystem. Nodes perform jobs processing data on disk. One node assumes the task of deleting files en masse from the share storage. 2. After 4+ days of intense activity a sudden hang occurs where all PIDs accessing GFS2 fall into D state. 3. Look at the call traces from hung_task_timeout, the glocks, and the dlm data and observe the data mention above Actual results: GFS2 gets hung up waiting for glock_work_func to run but it either already ran and did nothing or did not run because run_workqueue was never called. Expected results: The GFS2 work queue proceeds without issue. Additional info: Further data on the issue: No specialized lock_dlm messages: [adrew@adrew-virts MAr23]$ grep lock_dlm dmsg/* dmsg/ibmblade01_dmsg: [<ffffffff88a953d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade01_dmsg: [<ffffffff88a953d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade01_dmsg: [<ffffffff88a952c1>] :lock_dlm:gdlm_bast+0x0/0xd4 dmsg/ibmblade01_dmsg:lock_dlm S ffff810001004420 0 11426 281 11427 11339 (L-TLB) dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a96882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade01_dmsg:lock_dlm S 0000000000000282 0 11456 281 11457 11448 (L-TLB) dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a96882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade01_dmsg:lock_dlm S ffff810001004420 0 11488 281 11490 11480 (L-TLB) dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a96882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a967aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade01_dmsg: [<ffffffff88a95594>] :lock_dlm:gdlm_ast+0x0/0x313 dmsg/ibmblade02_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:appl01" dmsg/ibmblade02_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:data01" dmsg/ibmblade02_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:data02" dmsg/ibmblade02_dmsg: [<ffffffff889ed3d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade02_dmsg: [<ffffffff889ed3d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade02_dmsg: [<ffffffff889ed3d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade02_dmsg: [<ffffffff889ed2c1>] :lock_dlm:gdlm_bast+0x0/0xd4 dmsg/ibmblade02_dmsg:lock_dlm S ffff810001004420 0 8271 281 8272 8192 (L-TLB) dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade02_dmsg: [<ffffffff889ee882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade02_dmsg:lock_dlm S ffff81000100caa0 0 8297 281 8298 8283 (L-TLB) dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade02_dmsg: [<ffffffff889ee882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade02_dmsg:lock_dlm S ffff81000100caa0 0 8321 281 8322 8311 (L-TLB) dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade02_dmsg: [<ffffffff889ee882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade02_dmsg: [<ffffffff889ee7aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade03_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:appl01" dmsg/ibmblade03_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:data01" dmsg/ibmblade03_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:data02" dmsg/ibmblade03_dmsg: [<ffffffff88a973d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade03_dmsg: [<ffffffff88a972c1>] :lock_dlm:gdlm_bast+0x0/0xd4 dmsg/ibmblade03_dmsg:lock_dlm S ffff810001004420 0 8919 281 8921 8840 (L-TLB) dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade03_dmsg: [<ffffffff88a98882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade03_dmsg:lock_dlm S 0000000000000282 0 8943 281 8945 8933 (L-TLB) dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade03_dmsg: [<ffffffff88a98882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade03_dmsg:lock_dlm S 0000000000000282 0 8967 281 8968 8958 (L-TLB) dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade03_dmsg: [<ffffffff88a98882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade03_dmsg: [<ffffffff88a987aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade04_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:appl01" dmsg/ibmblade04_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:data01" dmsg/ibmblade04_dmsg:GFS2: fsid=: Trying to join cluster "lock_dlm", "rhel55:data02" dmsg/ibmblade04_dmsg: [<ffffffff88a983d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade04_dmsg: [<ffffffff88a983d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade04_dmsg: [<ffffffff88a983d0>] :lock_dlm:gdlm_do_unlock+0x3b/0x84 dmsg/ibmblade04_dmsg: [<ffffffff88a982c1>] :lock_dlm:gdlm_bast+0x0/0xd4 dmsg/ibmblade04_dmsg:lock_dlm S 0000000000000282 0 8713 281 8714 8633 (L-TLB) dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a99882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade04_dmsg:lock_dlm S ffff810001004420 0 8739 281 8740 8724 (L-TLB) dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a99882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a98265>] :lock_dlm:gdlm_put_lock+0x0/0x5 dmsg/ibmblade04_dmsg:lock_dlm S ffff810001004420 0 8763 281 8764 8753 (L-TLB) dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a99882>] :lock_dlm:gdlm_thread+0xd8/0x1c6 dmsg/ibmblade04_dmsg: [<ffffffff88a997aa>] :lock_dlm:gdlm_thread+0x0/0x1c6 Here's where we hung up first: Mar 18 14:52:16 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0 0 21045 19117 21053 (NOTLB) Mar 18 14:52:16 ibmblade01 kernel: ffff81056a3b3ce8 0000000000000086 0000000000000000 ffff8108216b6800 Mar 18 14:52:16 ibmblade01 kernel: 0000000000000018 0000000000000009 ffff8100549ce860 ffff810061b8f7a0 Mar 18 14:52:16 ibmblade01 kernel: 00008eee45f4237d 0000000000002834 ffff8100549cea48 00000001889f26d8 Mar 18 14:52:16 ibmblade01 kernel: Call Trace: Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff889f0d68>] :dlm:dlm_lock+0x117/0x129 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a95594>] :lock_dlm:gdlm_ast+0x0/0x313 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8003a2b4>] fcntl_setlk+0x243/0x273 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Same PID days later: Mar 21 14:39:55 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0 0 21045 19117 21061 (NOTLB) Mar 21 14:39:55 ibmblade01 kernel: ffff81056a3b3ce8 0000000000000086 0000000000000000 ffff8108216b6800 Mar 21 14:39:55 ibmblade01 kernel: 0000000000000018 0000000000000009 ffff8100549ce860 ffff810061b8f7a0 Mar 21 14:39:55 ibmblade01 kernel: 00008eee45f4237d 0000000000002834 ffff8100549cea48 00000001889f26d8 Mar 21 14:39:55 ibmblade01 kernel: Call Trace: Mar 21 14:39:55 ibmblade01 kernel: [<ffffffff889f0d68>] :dlm:dlm_lock+0x117/0x129 Mar 21 14:39:55 ibmblade01 kernel: [<ffffffff88a95594>] :lock_dlm:gdlm_ast+0x0/0x313 Mar 21 14:39:55 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 21 14:39:55 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 21 14:39:55 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 21 14:39:55 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff8003a2b4>] fcntl_setlk+0x243/0x273 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 21 14:39:56 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Same call trace so we're definitely hung up. Other nodes @ same time frame: Mar 18 14:50:58 ibmblade02 kernel: dcs_operation D ffff81000100caa0 0 1379 13868 1382 (NOTLB) Mar 18 14:50:58 ibmblade02 kernel: ffff8105b8b43ce8 0000000000000082 ffff81034bb414b0 ffffffff8000d5a8 Mar 18 14:50:58 ibmblade02 kernel: ffff8105b8b43c68 0000000000000009 ffff81043b6e8080 ffff81011cb8e100 Mar 18 14:50:58 ibmblade02 kernel: 00008e5876e301ce 0000000000021dd4 ffff81043b6e8268 00000001b8b43c68 Mar 18 14:50:58 ibmblade02 kernel: Call Trace: Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8000d5a8>] dput+0x2c/0x114 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff889720af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff889720b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff889720af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8002cc76>] mntput_no_expire+0x19/0x89 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8897206e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff88980653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8898064b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8003a2b4>] fcntl_setlk+0x243/0x273 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:50:58 ibmblade02 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:50:59 ibmblade02 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 18 14:52:06 ibmblade04 kernel: dcs_operation D ffff810001015120 0 6839 4799 6845 (NOTLB) Mar 18 14:52:06 ibmblade04 kernel: ffff8102b312fbb8 0000000000000086 0000000000000000 ffff81081a5ff800 Mar 18 14:52:06 ibmblade04 kernel: 0000000000000018 0000000000000009 ffff81008c0eb040 ffff81011cb9d080 Mar 18 14:52:06 ibmblade04 kernel: 00008eefac0d05f9 00000000000004ff ffff81008c0eb228 00000002889f56d8 Mar 18 14:52:06 ibmblade04 kernel: Call Trace: Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff88a1d0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff88a1d0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff88a1d0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff88a1d06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff88a2d2a8>] :gfs2:gfs2_permission+0x83/0xd3 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff88a2d2a0>] :gfs2:gfs2_permission+0x7b/0xd3 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff8000d081>] do_lookup+0x65/0x1e6 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff8000db08>] permission+0x81/0xc8 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff80009a1c>] __link_path_walk+0x173/0xfd1 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff800a2908>] autoremove_wake_function+0x0/0x2e Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff8000eb7d>] link_path_walk+0x45/0xb8 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff8000ce9c>] do_path_lookup+0x294/0x310 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff8001298f>] getname+0x15b/0x1c2 Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff80023b21>] __user_walk_fd+0x37/0x4c Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff80028a7e>] vfs_stat_fd+0x1b/0x4a Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff8000e60b>] free_pages_and_swap_cache+0x67/0x7e Mar 18 14:52:06 ibmblade04 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:52:07 ibmblade04 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:52:07 ibmblade04 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Glocks: [adrew@adrew-virts gfs2_hang_248_w_DLM_ko_Mar_19_11_RH_333124]$ grep 'p:21045' ibmblade01-hangdata/gfs2/rhel55\:data01/glocks | egrep -i 'f:aw|f:ew|f:w' H: s:SH f:aW e:0 p:21045 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] [adrew@adrew-virts gfs2_hang_248_w_DLM_ko_Mar_19_11_RH_333124]$ grep 'p:21045' ibmblade01-hangdata/gfs2/rhel55\:data02/glocks | egrep -i 'f:aw|f:ew|f:w' [adrew@adrew-virts gfs2_hang_248_w_DLM_ko_Mar_19_11_RH_333124]$ Only 1 waiter for the pid that hung up first: G: s:UN n:2/4c8a8a f:lq t:SH d:EX/0 l:0 a:0 r:12 m:0 H: s:SH f:aW e:0 p:21045 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] H: s:SH f:aW e:0 p:21076 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] H: s:SH f:aW e:0 p:21068 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] H: s:SH f:aW e:0 p:21061 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] H: s:SH f:aW e:0 p:13484 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] H: s:SH f:aW e:0 p:16445 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2] H: s:SH f:aW e:0 p:16446 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2] H: s:SH f:aW e:0 p:16447 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2] H: s:SH f:aW e:0 p:16448 [dcs_filreg_mgr_] gfs2_permission+0x7b/0xd3 [gfs2] That looks like our culprit then. Just in case it is useful: G: s:SH n:5/4c8a8a f:q t:SH d:EX/0 l:0 a:0 r:3 m:200 H: s:SH f:EH e:0 p:11954 [(ended)] gfs2_inode_lookup+0x11a/0x284 [gfs2] I checked the full call traces in the logs on ibmblade01. Every single one of those waiters on 2/4c8a8ais hung up. 21045 is in dlm:dlm_lock from gfs2_glock_wait but everything else is in dput & just_schedule. Mar 18 14:52:16 ibmblade01 kernel: INFO: task dcs_operation_m:21045 blocked for more than 120 seconds. Mar 18 14:52:16 ibmblade01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 18 14:52:16 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0 0 21045 19117 21053 (NOTLB) Mar 18 14:52:16 ibmblade01 kernel: ffff81056a3b3ce8 0000000000000086 0000000000000000 ffff8108216b6800 Mar 18 14:52:16 ibmblade01 kernel: 0000000000000018 0000000000000009 ffff8100549ce860 ffff810061b8f7a0 Mar 18 14:52:16 ibmblade01 kernel: 00008eee45f4237d 0000000000002834 ffff8100549cea48 00000001889f26d8 Mar 18 14:52:16 ibmblade01 kernel: Call Trace: Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff889f0d68>] :dlm:dlm_lock+0x117/0x129 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a95594>] :lock_dlm:gdlm_ast+0x0/0x313 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8003a2b4>] fcntl_setlk+0x243/0x273 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:52:16 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 18 14:52:16 ibmblade01 kernel: Mar 18 14:52:16 ibmblade01 kernel: INFO: task dcs_operation_m:21061 blocked for more than 120 seconds. Mar 18 14:52:16 ibmblade01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 18 14:52:16 ibmblade01 kernel: dcs_operation D ffff810001015120 0 21061 19117 21068 21055 (NOTLB) Mar 18 14:52:16 ibmblade01 kernel: ffff81059a2f3ce8 0000000000000082 ffff810465ddcb70 ffffffff8000d5a8 Mar 18 14:52:17 ibmblade01 kernel: ffff81059a2f3c68 0000000000000009 ffff810061b8f7a0 ffff81011cb9d080 Mar 18 14:52:17 ibmblade01 kernel: 00008eefe41c6ead 0000000000012302 ffff810061b8f988 000000029a2f3c68 Mar 18 14:52:17 ibmblade01 kernel: Call Trace: Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8000d5a8>] dput+0x2c/0x114 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8002cc76>] mntput_no_expire+0x19/0x89 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8003a2b4>] fcntl_setlk+0x243/0x273 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 18 14:52:17 ibmblade01 kernel: Mar 18 14:52:17 ibmblade01 kernel: INFO: task dcs_operation_m:21068 blocked for more than 120 seconds. Mar 18 14:52:17 ibmblade01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 18 14:52:17 ibmblade01 kernel: dcs_operation D ffff810001015120 0 21068 19117 21076 21061 (NOTLB) Mar 18 14:52:17 ibmblade01 kernel: ffff81011d999ce8 0000000000000082 ffff810465ddcb70 ffffffff8000d5a8 Mar 18 14:52:17 ibmblade01 kernel: ffff81011d999c68 000000000000000a ffff8103e29b2040 ffff81011cb9d080 Mar 18 14:52:17 ibmblade01 kernel: 00008eefddf035e3 000000000000f41b ffff8103e29b2228 000000021d999c68 Mar 18 14:52:17 ibmblade01 kernel: Call Trace: Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8000d5a8>] dput+0x2c/0x114 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8002cc76>] mntput_no_expire+0x19/0x89 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff8000e60b>] free_pages_and_swap_cache+0x67/0x7e Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80033987>] do_setitimer+0xd0/0x67b Mar 18 14:52:17 ibmblade01 kernel: [<ffffffff80028f23>] do_sigaction+0x76/0x199 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 18 14:52:18 ibmblade01 kernel: Mar 18 14:52:18 ibmblade01 kernel: INFO: task dcs_operation_m:21076 blocked for more than 120 seconds. Mar 18 14:52:18 ibmblade01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 18 14:52:18 ibmblade01 kernel: dcs_operation D ffff81011cb9f0c0 0 21076 19117 21083 21068 (NOTLB) Mar 18 14:52:18 ibmblade01 kernel: ffff810345687ce8 0000000000000086 ffff810465ddcb70 ffffffff8000d5a8 Mar 18 14:52:18 ibmblade01 kernel: ffff810345687c68 0000000000000009 ffff8104040d1820 ffff81011cb9f0c0 Mar 18 14:52:18 ibmblade01 kernel: 00008eef6b271bc4 000000000000ab7d ffff8104040d1a08 0000000345687c68 Mar 18 14:52:18 ibmblade01 kernel: Call Trace: Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8000d5a8>] dput+0x2c/0x114 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8000985a>] __d_lookup+0xb0/0xff Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8002cc76>] mntput_no_expire+0x19/0x89 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:52:18 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 18 14:52:18 ibmblade01 kernel: Mar 18 14:54:16 ibmblade01 kernel: INFO: task dcs_operation_m:21045 blocked for more than 120 seconds. Mar 18 14:54:16 ibmblade01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 18 14:54:16 ibmblade01 kernel: dcs_operation D ffff81011cab3cf0 0 21045 19117 21053 (NOTLB) Mar 18 14:54:16 ibmblade01 kernel: ffff81056a3b3ce8 0000000000000086 0000000000000000 ffff8108216b6800 Mar 18 14:54:16 ibmblade01 kernel: 0000000000000018 0000000000000009 ffff8100549ce860 ffff810061b8f7a0 Mar 18 14:54:16 ibmblade01 kernel: 00008eee45f4237d 0000000000002834 ffff8100549cea48 00000001889f26d8 Mar 18 14:54:16 ibmblade01 kernel: Call Trace: Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff889f0d68>] :dlm:dlm_lock+0x117/0x129 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a95594>] :lock_dlm:gdlm_ast+0x0/0x313 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a1a0b8>] :gfs2:just_schedule+0x9/0xd Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a1a0af>] :gfs2:just_schedule+0x0/0xd Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff800a2936>] wake_bit_function+0x0/0x23 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a1a06e>] :gfs2:gfs2_glock_wait+0x37/0x78 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a28653>] :gfs2:gfs2_getattr+0x85/0xc4 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff88a2864b>] :gfs2:gfs2_getattr+0x7d/0xc4 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff8000e4c0>] vfs_getattr+0x2d/0xa9 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff80028a95>] vfs_stat_fd+0x32/0x4a Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff80062ff2>] thread_return+0x62/0xfe Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff8003a2b4>] fcntl_setlk+0x243/0x273 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff80023851>] sys_newstat+0x19/0x31 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Mar 18 14:54:16 ibmblade01 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Mar 18 14:54:16 ibmblade01 kernel: Node ibmblade02: G: s:UN n:2/4c8a8a f:lq t:SH d:EX/0 l:0 a:0 r:6 m:50 H: s:SH f:W e:0 p:1412 [dcs_operation_m] gfs2_drevalidate+0xff/0x219 [gfs2] H: s:SH f:aW e:0 p:1379 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] H: s:SH f:aW e:0 p:1382 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2] Node ibmblade03: G: s:EX n:2/4c8a8a f:yq t:EX d:SH/0 l:0 a:0 r:3 m:50 I: n:1346036/5016202 t:4 f:0x10 d:0x00000003 s:2048/2048 Node ibmblade04: G: s:UN n:2/4c8a8a f:lq t:SH d:EX/0 l:0 a:0 r:6 m:0 H: s:SH f:aW e:0 p:6839 [dcs_operation_m] gfs2_permission+0x7b/0xd3 [gfs2] H: s:SH f:aW e:0 p:7197 [dcs_output_mgr] gfs2_permission+0x7b/0xd3 [gfs2] H: s:SH f:aW e:0 p:11394 [dcs_operation_m] gfs2_getattr+0x7d/0xc4 [gfs2]
Created attachment 487410 [details] writechaos.c Good news: I believe I've recreated this problem on my cluster. I started with a fresh mkfs.gfs2. I used -J32 but I don't think that's important, since I'm not doing much with metadata. Then, simultaneously on all four nodes I ran this little test program I wrote with these parameters: time /home/bob/tools/writechaos /mnt/gfs2 4 1000000 The first three times I ran it, there was no hang. On the fourth attempt, node3 ended much quicker than the other nodes and the other three nodes hung. Basically, all four nodes are competing for the same EX glock to write data to the same file. Somewhere along the line it gets hung up. Here are the final glock states for each node: roth-01: G: s:EX n:2/84a1 f:lDpyiq t:UN d:UN/3000 l:0 a:1 r:5 m:0 H: s:EX f:W e:0 p:19242 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2] I: n:25/33953 t:8 f:0x10 d:0x00000000 s:214944/214944 id nodeid remid pid xid exflags flags sts grmode rqmode time_ms r_nodeid r_len r_name 32f0077 2 3e60017 26562 0 0 10000 2 5 -1 23 0 24 " 2 84a1" 28e006a 0 0 19015 0 0 0 1 -1 5 0 0 24 " 2 84a1" 30f0075 4 17d0027 26644 0 0 10000 1 -1 5 5 0 24 " 2 84a1" Process 19015 is the glock_workqueue, same as the customer site. roth-02: G: s:UN n:2/84a1 f:lq t:EX d:EX/0 l:0 a:0 r:4 m:0 H: s:EX f:W e:0 p:26779 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2] roth-03: none roth-04: G: s:EX n:2/84a1 f:lDpiq t:UN d:UN/7000 l:0 a:0 r:5 m:0 H: s:EX f:W e:0 p:26863 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2] I: n:25/33953 t:8 f:0x10 d:0x00000000 s:214944/214944 Now that I can recreate it, I can add sufficient instrumentation to solve this in a timely manner. BTW, each run of 1000000 writes only takes 2m30s.
Now that I can recreate the problem, I'm adding instrumentation. I added a glock flag GLF_CB_OWED to indicate that a callback is owed. I set the flag in function do_xmote right before we call gfs2_lm_lock and I clear the flag in function blocking_cb and in the LM_CB_ASYNC case, to indicate we got the callback. After recreating the failure, I found the flag left on. It displays as "o": G: s:UN n:2/84a1 f:lqo t:EX d:EX/0 l:0 a:0 r:4 m:0 H: s:EX f:W e:0 p:4758 [writechaos] gfs2_write_begin+0x61/0x368 [gfs2] So it seems as if Steve's intuition was right: We're missing a callback. The problem could be in dlm or lock_dlm. It's time to add Dave T to the cc list, and add instrumentation to lock_dlm.
Can you move the flag down the stack a bit? If we had a flag that was set when the dlm returned success for a shared/deferred/exclusive lock mode and was cleared for null or unlocked, lock modes, then we could also clear it when a demote request arrives. If that was left on, then it would confirm what is going on. It is a big step forward that we can recreate the issue locally at least.
Update from the tests I performed today: In the scenario I recreated, gfs2 was, with each run of the test program, adjusting its hold time on the critical glock. When the value got down to zero (0) it started doing lots of journal log flushes, which tied up the system doing IO. So it was not hung at all, it was just making VERY slow progress due to all the IO. Dumping the locks appeared to show it hung, but in reality, we were watching a cluster in motion (not hung) so the glock values were not stable. The excessive IO was making it look hung. In my case, the test program went from 2.5minutes to 40+ minutes per run! I created a special version of gfs2 that does two things: (1) Adds instrumentation to lock_dlm to detect any cases in which dlm fails to send us a callback. (2) Changes the minimum hold time to 10. That makes my test program run longer. This is not a solution, but it may make things run a lot better. I still need to track down why we're doing so many log flushes as the value gets shorter. The test modules are on my people page as: gfs2.690555.2011mar25.ko and lock_dlm.690555.2011mar25.ko They're compiled against the 2.6.18-248.el5.bz656032-x86_64 kernel.
Here is my (as yet unproven) theory of what's happening: Function do_xmote is called for a dinode glock. It submits a TRY_1CB lock to dlm, but doesn't get it due to "TRYFAILED". So it tries again in rapid succession, based on shortness of the glock hold time value. With each try, it calls go_xmote_th which is essentially inode_go_sync. Each call to inode_go_sync calls gfs2_log_flush, which does disk IO, followed by gfs2_ail_empty_gl, which also calls gfs2_log_flush a second time for even more disk IO. So if my theory is correct, this all goes back to the problem that function do_xmote sets the TRY_1CB bit on. When we've tried to remove it, everything seemed to blow up. It's long been a mystery that we've never had the luxury of time to solve. But now I think we are being forced to take the time to figure out why. Once we can get rid of the TRY lock in favor of a blocking lock, we should be able to shorten the hold time back to 1 or even 0 again.
Caught the panic on the serial console: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/locks.c:2080 invalid opcode: 0000 [1] SMP last sysfs file: /devices/platform/host1/session2/target1:0:0/1:0:0:1/timeout CPU 0 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) lock_dlm(U) gfs2(U) dlm configfs lockd sunrpc dm_round_robin sg be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_piix4 snd_timer snd soundcore snd_page_alloc pcspkr virtio_net i2c_core tpm_tis 8139too virtio_balloon ide_cd 8139cp tpm mii serio_raw cdrom tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 5210, comm: python Tainted: G 2.6.18-248.el5.bz656032 #1 RIP: 0010:[<ffffffff8002766c>] [<ffffffff8002766c>] locks_remove_flock+0xe2/0x124 RSP: 0018:ffff810015abbda8 EFLAGS: 00010246 RAX: 0000000000000081 RBX: ffff8100146e0788 RCX: 0000000000000000 RDX: ffff810015683100 RSI: ffff810015abbda8 RDI: ffffffff80335c60 RBP: ffff8100156745c0 R08: 0000000000000000 R09: 0000000000000000 R10: ffff810015abbda8 R11: 00000000000000b0 R12: ffff8100146e0688 R13: ffff8100146e0688 R14: ffff81000e5cce80 R15: ffff81001440eb70 FS: 0000000000000000(0000) GS:ffffffff80427000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000003c6e49a6e0 CR3: 0000000000201000 CR4: 00000000000006e0 Process python (pid: 5210, threadinfo ffff810015aba000, task ffff810015683100) Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000000000000145a 0000000000000000 0000000000000000 0000000000000000 ffff8100156745c0 0000000000000202 Call Trace: [<ffffffff80012bd8>] __fput+0x94/0x1bd [<ffffffff80023d8b>] filp_close+0x5c/0x64 [<ffffffff80039227>] put_files_struct+0x63/0xae [<ffffffff8001597d>] do_exit+0x376/0x955 [<ffffffff8004948c>] cpuset_exit+0x0/0x88 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 0f 0b 68 0b ae 2b 80 c2 20 08 eb fe 48 89 c3 48 8b 03 48 85 RIP [<ffffffff8002766c>] locks_remove_flock+0xe2/0x124 RSP <ffff810015abbda8> I took a screenshot of the other ones and it was in the same code in fs/locks.c. This was on a brand new FS mounted on one node running 20 threads of reading and writing to files. A
Adam, can you attach the python script to the bug?
Created attachment 489194 [details] Test script provided by customer This is the test script that the customer developed for filesystem testing. It wasn't intended as a reproducer but it was found that the panics could be reproduced outside of the context of the usual applications test using that script. I've verified that I can produce panics and withdraws with this script. But, I had to test to arrive at just the right combination of parameters. If the test is too aggressive the system will just get bogged down and then fenced. If the test is too light it'll take absurd amounts of runs to encounter a problem. I invoke the tool like so: fs-test.py --filesize 50000 --loop 2 --repeat 100 --concurrency 15 --delopt 1 --quiet --filename /mnt/gfs2/PID_01 4 The customer used the following settings: filesystem_test.py --filesize 1000000 --loop 100 --repeat 2 --concurrency 20 --delopt 1 --quiet --filename /dcs/appl01/DATA/01_PID 4 Their system is much more robust than mine so I think you'll need to tune concurrency and size to find the right fit for your system. The code forks a number of children that then proceed to create files, EX lock them, fill them with data, and then delete them. There's more nuance to it than that, but that is the essentials of it. It is test case 4 in the code. The customer developed the tool as a general filesystem performance test and it is pretty robust. I am going to work on basing a simpler test on its behavior to see if we can simplify and independently arrive at the same symptom.
Okay, I think I understand what's going on in the error I reported in comment #7. The hard part is understanding the glock state machine, which is absurdly complex. The best way to explain what's going on is by starting with a pseudo-code snippet of function gfs2_glock_nq: gfs2_glock_nq run_queue do_promote ret = glops->go_lock(gh); rgrp_go_lock gfs2_rgrp_bh_get gfs2_meta_read - re-reads bitmaps do_xmote glops->go_xmote_th meta_go_sync gfs2_log_flush gfs2_meta_sync - syncs bitmaps to the media struct address_space *mapping = gl->gl_aspace->i_mapping; filemap_fdatawrite(mapping); error = filemap_fdatawait(mapping); Here is the view from 20,000 feet: What I think is the problem, in a nutshell, is that in function run_queue, it calls do_promote before do_xmote. Function do_xmote is like a "top half" for glocks, and for rgrps, its job is to sync the data to the media before the lock is granted to the next waiter. Function do_promote is like a "bottom half" for glocks, and it finishes what was started. In the case of rgrps, the "bottom half" not only does the promotion as far as dlm is concerned, it also calls "go_lock" which has the effect of re-reading the rgrp bitmaps from the media. So I believe the problem is that the bottom half is being executed before the top half. In other words, the bitmap is being re-read before the previous IO is being synced to the media. Instead, we should be syncing the data to the media before we call go_lock. But it's not as simple as all that. Function do_xmote makes the assumption that the promotion has happened. We can't just reverse the order. So this appears to be a bug in the glock state machine, which is probably the second most complex piece of code in gfs2. But understanding the problem is the most difficult part, and now that I know what's going on, I should be able to design a fix that won't break everything else. I've tried a couple of prototypes, but they failed for various reasons. I need to think of a proper way to fix it. Unfortunately, Steve is away on holiday, so I won't be able to consult with him, and he's the only one who thoroughly understands the glock state machine.
Just an update for people out there who are interested in the progress of this bug. Regarding the code snippet of comment #13: Way back in the ancient history of GFS2, the code path used to look something like this: gfs2_glock_nq add_to_queue(gh); run_queue(gl); rq_promote glops->go_xmote_th inode_go_xmote_th gfs2_glock_xmote_th glops->go_sync gfs2_lm_lock() if (!async) xmote_bh if (!async) glock_wait_internal glops->go_lock rgrp_go_lock gfs2_rgrp_bh_get gfs2_meta_read - re-reads bitmaps So the go_sync (top half) was normally executed before the go_lock (bottom half), which guarantees any writes happen before the glock changes holders. I've tried several ways to make the sync happen first, but there are pitfalls and problems with every design I've tried. I'll keep trying to design a better way to do this, but it's extremely complex. We can't just go back to the previous method above because the glock state machine has changed too much in the past five years.
Created attachment 490467 [details] Initial patch This patch seems to fix the problem with bitmap corruption, and also the problem with the "writechaos" test program. This patch adds a "go_sync" glops function for rgrp metadata that is executed earlier in the glock cycle to sync the data to media before releasing the glock. It's nearly identical to the go_xmote_th function, which makes it a bit redundant. I need to do more testing on it and see if I can eliminate the go_xmote_th. Previous attempts to do that ended in failure, but could have been due to problems with my instrumentation and other experimental code. This patch applies to the -255 kernel which contains the fixes for bug #656032, bug #650494 and others. However, if I run the scenario from bug #650494 repeatedly, I can recreate another problem. I'm not sure if it's related. Here is the symptom: GFS2: fsid=hlab_cluster:svr_home.0: fatal: filesystem consistency error GFS2: fsid=hlab_cluster:svr_home.0: inode = 38966 920929 GFS2: fsid=hlab_cluster:svr_home.0: function = gfs2_change_nlink, file = /home/bob/gfs2.690555/inode.c, line = 426 GFS2: fsid=hlab_cluster:svr_home.0: about to withdraw this file system GFS2: fsid=hlab_cluster:svr_home.0: telling LM to withdraw GFS2: fsid=hlab_cluster:svr_home.0: withdrawn Call Trace: [<ffffffff8868879c>] :gfs2:gfs2_lm_withdraw+0xd3/0x100 [<ffffffff8868bd59>] :gfs2:gfs2_getbuf+0x108/0x117 [<ffffffff8867ec89>] :gfs2:gfs2_dirent_prev+0x0/0x4e [<ffffffff8868bd7f>] :gfs2:gfs2_meta_read+0x17/0x65 [<ffffffff8868a6e2>] :gfs2:gfs2_pin+0xb4/0xbc [<ffffffff8869c76f>] :gfs2:gfs2_consist_inode_i+0x43/0x48 [<ffffffff88686def>] :gfs2:gfs2_change_nlink+0x56/0x104 [<ffffffff88692940>] :gfs2:gfs2_unlink+0x12a/0x199 [<ffffffff8869287a>] :gfs2:gfs2_unlink+0x64/0x199 [<ffffffff88692898>] :gfs2:gfs2_unlink+0x82/0x199 [<ffffffff886928ba>] :gfs2:gfs2_unlink+0xa4/0x199 [<ffffffff8869914d>] :gfs2:gfs2_rindex_hold+0x2b/0x8d [<ffffffff8008fd34>] __cond_resched+0x1c/0x44 [<ffffffff8004a076>] vfs_unlink+0xc4/0x120 [<ffffffff8003c629>] do_unlinkat+0xaa/0x141 [<ffffffff8006153d>] sysenter_tracesys+0x48/0x83 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 GFS2: fsid=hlab_cluster:svr_home.0: gfs2_delete_inode: -5 INFO: task gfs2_logd:10540 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D ffffffff80153e37 0 10540 403 10541 10539 (L-TLB) ffff81067f233df0 0000000000000046 0000000000000002 ffff81047df31d30 0000000000000002 000000000000000a ffff81047d7b1040 ffff81087fcc0820 00000316c87faa2b 00000000000088b0 ffff81047d7b1228 000000147e1fb250 Call Trace: [<ffffffff8006ec35>] do_gettimeofday+0x40/0x90 [<ffffffff800a279c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800637ce>] io_schedule+0x3f/0x67 [<ffffffff88689682>] :gfs2:gfs2_log_flush+0x361/0x472 [<ffffffff800a29b4>] autoremove_wake_function+0x0/0x2e [<ffffffff88688d24>] :gfs2:gfs2_ail1_empty+0x36/0x95 [<ffffffff88689c6d>] :gfs2:gfs2_logd+0xa2/0x15c [<ffffffff88689bcb>] :gfs2:gfs2_logd+0x0/0x15c [<ffffffff80032733>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a279c>] keventd_create_kthread+0x0/0xc4 [<ffffffff88685ffc>] :gfs2:inode_go_demote_ok+0x0/0x26 [<ffffffff80032635>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 We've seen this symptom before, so I'm inclined to think it might be a different problem. It could also be similar: It could be a result of dinode glocks not having one of the "go_sync" glops, as I added for rgrp glocks, and therefore their commits to media are happening too late. The "proper" fix is probably to scrap and redesign the entire glock state machine. But design issues are for upstream, not minor releases, so I'm just looking for a simple way to circumvent the problem for this release.
This doesn't look right to me.... what should be happening is this: ->go_lock() runs immediately before the first holder is granted ->go_unlock() runs immediately after the final holder is dropped The glock cannot change state when there is a (granted) holder queued. When we call ->go_unlock() we take the GLF_LOCK while we drop the spinlock in order to prevent any state changes from occurring during this interval.
Yes, go_unlock is called when the last holder releases the glock. The trouble is, that's not the end of the rgrp's buffers. They hang around as long as rd_bh_count remains positive, which means during the journal flush process. So long after the glock is relinquished, there are still buffers floating around. During that time, another holder can hold the glock, which causes go_lock, which calls gfs2_rgrp_bh_get, which calls gfs2_meta_read to re-read the bitmaps from the media, thus losing bits. And I suspect it's the same issue for dinode glocks as well, which is likely the cause of the problems we've seen in the past where a dinode is being unlinked and somehow mistakenly ends up with a block count.
gfs2_meta_read should be a no-op if the pages/buffers are uptodate. When the glock is dropped, the buffers should all be synced back (if dirty) and then invalidated, so that none of them should still be in the journal if the glock is no longer held. We always do a journal flush as part of the process of writeback when the glock is dropped. Ideally I'd like to be rid of the gfs2_rgrp_bh_get/put functions being called from ->go_lock/unlock as it would make the code rather simpler. Also it would allow locking an rgrp without reading the bitmap in, something that will be required to make ->statfs faster in the future. In order to do that we need to audit all users of the rgrp glock and ensure that they read the bitmaps in when required themselves.
I've established that in the failing scenario, the glock is being tranferred from one node to the node that reports the problem. The rgrp write is done correctly on the node releasing the lock. The rgrp read, however, is bypassed on the failing node due to rd_bh_count != 0. So really this problem is an issue of mismanaging rd_bh_count in some way. Note that this error only manifests when a bitmap OTHER THAN the initial one (on the rgrp block) is changed. I initially thought the problem was with the management of the buffers, i.e., when function trans_add_bh adds a buffer, it calls gfs2_rgrp_bh_hold. That makes the most sense given the symptom I just mentioned. However, it seems to be more complicated than that. After extensive analysis, that code all looks correct to me. I'm continuing to backtrack with my "glock history" instrumentation.
I know what I said earlier in bug #656032 about the corrupt bits not being set by gfs2_setbit, but apparently they are. It seems like I'm dealing with multiple problems, but they may all be the same problem manifesting in different ways. What I now know is this: The sequence of events appears simple or complex depending on whether other nodes in the cluster are involved. What seems to be happening is this: (1) The dinode block starts out in "unlinked" status. (2) We probably try to reclaim the dinode, which calls a final iput. Function iput_final calls drop_inode, which eventually gets down to gfs2_delete_inode, and that calls gfs2_dinode_dealloc which changes the block from UNLINKED to FREE via gfs2_free_di. (3) Sometime shortly thereafter, vfs calls us to unlink the inode, which does a lookup, then calls gfs2_unlink. That finds the block improperly freed. You can't go from free back to unlinked. I've been all over the code paths and I don't see where any bad decisions are made or blocks are improperly calculated. If steps (1) and (2) were done on a different node, then step (3) will flag the error on a different node. That's when the symptoms make it look like I previously thought, which is that a rgrp_iget re-reads the bitmap with the bad bit setting. I spent a lot of time checking and making sure the glock is properly locking, writing and reading the rgrp at the right times. In fact, the re-read seems proper, and the bad bit setting are due to the fact that steps (1) and (2) occurred on the other node. When the bitmap is not re-read, it's correct because the rgrp glock was not passed to another node in the cluster. BTW, regarding step (3) the problem can also appear in the rename code, when rename decrements the link count (on a file, not a directory). I did some experiments on the problem: (1) At first I thought the problem might be due to function gfs2_drop_inode setting the nlink count to zero when it sees a glock demote. I temporarily disabled the check. It made the failure harder to recreate, but the problem still did occur if another node was _not_ involved. (2) I tried adding a check for gfs2_inode_dropped (nlink==0) to function gfs2_unlink, but again, it just makes the problem harder to recreate. Then it only recreates when other nodes are involved, and not in the single-node case. The big question is why gfs2_unlink is called for a dinode that's already been deleted. The unlink keys off directory entries, so maybe someone isn't removing a dirent before the dinode itself is unlinked. But I checked the code, and it looks to be correct. So I still need to backtrack, and now I'm in less familiar territory. Maybe a process is improperly setting the demote bit which causes a node to improperly set the nlink to 0, (all happening before step (1) in the sequence above) which causes it to be improperly unlinked, then freed (reclaimed) while there are still live dirents pointing to it. Either that or the maybe the bitmap calculations are wrong. But I've checked those pretty carefully too and they also seem correct.
Here's the latest on the problem I'm working on: I've rebased my code to the latest kernel. I threw away a lot of my instrumentation that was cluttering up everything. As a result I am now able to recreate the problem again. My instrumentation indicates that the problem goes back to someone calling function gfs2_dir_del while unlinking a file, for example "DOCNET0.LST." It finds the correct dirent on the correct leaf, and it has the correct hash value, but for some odd reason, the de_inum.no_addr is overwritten with the disk address of the dinode that was just previously unlinked. In other words: op # no_addr de_hash leaf blk filename operation ----- ------- ---------- -------- ----------- --------- 1771: 0xe1887 0xd73a3e37 0x1ac67e DOCNET0.LSO gfs2_dir_del 1772: 0xe1887 0x5d5ff7db 0x1eb31a DOCNET0.LST gfs2_dir_del A previous entry for when DOCNET0.LST was added indicates the block address should have been 0x1cc68f but was overwritten. The first unlink was to delete file "DOCNET0.LSO", and it sets the dinode to "unlinked". Shortly thereafter, the dinode is set "free" by the auto-reclaim process. The second unlink is to delete file "DOCNET0.LST", but it mistakenly has the previous delete's inum in the actual buffer. (It's actually comes in big endian from the buffer and has to be converted). There is further churn, but the end result is that GFS2 then tries to unlink the block again, but now it's in a free state, which is clearly wrong. The bad news is that this is buffer corruption which is a royal pain to track down. A needle in a haystack. The good news is that the corruption looks very distinct, so I hopefully can track it down. The needle is neon red, so I have a pretty good chance of spotting it in the haystack. It's not random garbage, it's a specific value, so I just have to figure out where that value is coming from. The big clue here is that the block address overwriting the buffer was just used in the previous unlink operation, so I just have to trace it through.
Bob, is there any pattern to the position in the leaf block where the problem occurs? In particular, does the problem always occur when the entry is either the first or second entry in the leaf block? I wonder whether the issue is a confusion of "prev" and "cur" entries and whether the check for !gfs2_dirent_sentinel(dent) in __gfs2_dirent_find() is causing problems for unlink here.
Created attachment 494765 [details] Second patch All the problems are happening on a jdata directory. Examining the journals with instrumentation, it's apparent that leaf block updates are missing from the journal. Looks like they're getting dropped rather than synced to disk. My belief is that the problem is a statement in function inode_go_sync that looks like this: if (ip && !S_ISREG(ip->i_inode.i_mode)) ip = NULL; It basically says, "Don't sync the metadata if this is a directory." That line has been in gfs2 for many years now. It was added for performance reasons. I need to talk to Steve Whitehouse about it, but the patch basically makes an exception for jdata directories, and it seems to fix the problem. Lots of testing needs to be done on it, including: (1) many more iterations of the scenario I've been using, (2) The writechaos program, (3) The customer test script previously posted. This patch includes the following (in order of appearance): (1) A minor correction to a BUG_ON that I found. (2) Comment corrections. (3) Removal of the try_lock that seems to circumvent the problem but it causes other coherency problems detected by the writechaos program I previously attached. (4) A minor change to delay the clearing of the glock GLF_DIRTY flag, which seems more proper. (5) The actual fix I described above. (6) Better error reporting when bitmap errors are detected. FWIW, the journals look something like this: (1) Node 1 does absolutely nothing with the jdata directory. (2) Node 2 does several directory adds and unlinks, and I see the update numbers (my instrumentation) occasionally jumping. It removes a directory, but that update goes mysteriously missing, along with several others. But directory history (more instrumentation) shows an uninterrupted sequence of updates. (3) Node 3 does a handful of directory updates, and I also see its update numbers (my instrumentation) jumping in the journal, but not in the directory history. So what I should be seeing in the journal is this: NODE 1 NODE 2 1. directory update 1200 2. directory update 1201 3. directory update 1202 (handoff) 4. directory update 1203 5. directory update 1204 (handoff) 6. directory update 1205 etc. Instead what I'm seeing is something like this: 1. directory update 1200 2. directory update 1201 3. directory update 1202 (handoff) 4. directory update 1203 5. directory update 1208 (handoff) 6. directory update 1212 (error detected)
The test: if (ip && !S_ISREG(ip->i_inode.i_mode)) ip = NULL; does not skip metadata writeback, but data writeback and that is simply because only regular files have any data to write back. If blocks are landing up in the data address space for directories, then that is a problem which needs to be resolved. The paths for a directory with the jdata flag, and one without are identical. The flag doesn't affect the directory at all, it is just a default for any files (or subdirectories) which are created in the directory. There should be no difference in the clearing of the dirty bit. It it always set under the read side of the log rwsem and it is always cleared under the write side of that rwsem, so there should be no difference at all if the test and clear are split up.
Also, glock_free() already disposes of the aspace, so there is no need to do that separately.
I'm wondering about the GFS2_RDF_UPTODATE flag. This is reset in read_rindex_entry() and will cause the content of the rgrp to be updated next time it is used, whatever state it is in. It may be that we never reread the rindex after it has been read for the first time. On the other hand, if it did happen, then it looks like it might cause a problem since it doesn't appear to be synchronised to ensure that there are never pending rgrp updated waiting to be written back. We could probably get rid of that flag completely if we ensured that its uptodate status matched that of the rgrp's data itself. In other words, we'd always copy in the data when we'd sent an I/O request which had returned successfully. That way the uptodateness of the rgrp could be tested simply by looking at the uptodateness of the associated bh. It does appear that the uptodate state of rgrps is overly complicated since we have to worry about both the rgrp itself and the rindex as well. So this would simplify the situation. I've been trying to convince myself that the clearing of the GFS2_RDF_UPTODATE flag in the rgrp's go_inval is actually correct from a locking standpoint. I can't find any obvious races, but we need to bear in mind that this flags operation is not atomic so must be covered by appropriate locking in case any of the other flag operations might overlap with it.
Created attachment 495234 [details] Untested example of my thoughts wrt the uptodate flag Hopefully this will make my thoughts a bit clearer, even if it is not exactly right. It shows the idea to simplify the code a bit anyway. Potentially we could go further and get rid of the mutex and make the function a bit cleaner still. Also, we need to get stop using the rindex spin lock to cover the rgrp ref counts, since that lock is global. Either an atomic_t or using a per-rgrp spinlock would be preferable. That can wait until later though. I did also wonder whether it would be a good plan to add a test to the ->go_inval for the rgrp to ensure that at that stage there were no bh's left hanging off it. Would be a good thing for a test patch, even if we don't put that in upstream.
This isn't a rgrp or bitmap problem, this is apparently a glock problem. I now have definitive proof that two different nodes were making changes to the same directory at the same time: Excerpt from my instrumentation on hlab1: Entry PID Function dir dinode addr Leaf block Dirent filename update# ----- ----- ---------------- --------------- ---------- --------------- ------- 1447: 18041 get_leaf read 0x130a95/000077 l:0x130c71 none 9 1f8 1448: 18041 gfs2_dir_del 7 0x130a95/14931d l:0x130c71 LAUNCH.LSO 4 1f9 1449: 10595 gfs2_unpin 0x000077/000077 l:0x130c71 none 15 1fa 1450: 10595 remove_from_ail3 0x000077/000077 l:0x130c71 none 12 1fb ... 1452: 18041 get_leaf read 0x130a95/000077 l:0x130c71 none 9 1fb 1453: 18041 gfs2_dir_del 7 0x130a95/14931d l:0x130c71 LAUNCH.LSO 5 1fc The last number, (e.g. 1f8) is the revision number for the leaf block. Interpretation: hlab1 read the leaf, deleted a dirent from it, thus revising the leaf from 1f8 to 1f9. Then it unpinned it and removed it from the ail. The label "remove_from_ail3" means it was called from gfs2_ail_empty_gl(). This looks like a normal directory unlink. Excerpt from my instrumentation on hlab2: 623: 13923 get_leaf read 0x130a95/000077 l:0x130c71 none 9 1f8 ... 626: 13923 gfs2_dir_del 7 0x130a95/09cfab l:0x130c71 RLAUNCH_HLAB2.TRO 4 1f9 627: 10590 gfs2_unpin 0x000077/000077 l:0x130c71 none 15 1fa 628: 10590 remove_from_ail3 0x000077/000077 l:0x130c71 none 12 1fb 629: 13923 get_leaf read 0x130a95/000077 l:0x130c71 none 9 1fb Interpretation: Meanwhile, hlab2 read in the leaf block and found revision 1f8 which is the version of the leaf before hlab1 did its deleting from the leaf. Then it proceeds to do its own delete, updating the leaf from 1f8 to 1f9, unpinning and removing it from the ail, just like hlab1 did. This is proof that both nodes were updating the leaf at the same time. Yet each node should have had the glock locked exclusively while they were updating the leaf. So this has got to be a glock holding problem. Earlier versions of my instrumentation had glock history, so I'm going to have to go back and reinstate it to see what's happening to the glock during this period. The problem is, the problem is getting harder and harder to recreate with my instrumentation. Adding more may make it nearly impossible. Consequences: If the instrumentation didn't stop things there, after hlab1 re-reads the leaf block, it would find its changes to the leaf missing, so it would try to unlink the file again. The second unlink would catch the fact that the block is already free in the bitmap, and the assert_withdraw is thrown, which is the symptom I've been chasing for many weeks now.
Created attachment 495847 [details] Glock / Directory history tarball What we've got here is three nodes all making changes to the same directory at the same time, thus causing corruption with my patch that removes the try lock. We need that patch to solve the customer's problem, but obviously we can't ship it until we fix what's causing the corruption. This is a zipped tarball of directory "intec28". It contains six files: Three are the glock dumps for the three nodes, which aren't particularly important. The other three are the problem.28.hlab{1,2,3} files, which contain my latest combined glock and history trace information for my latest recreation of the failure. Pay particular attention to the items I marked with "Bob:" near the end of each of these files. All three nodes were sharing a glock in SH mode, which is fine. Then they all wanted to make changes to the directory, so they tried to promote to an EX mode, which is fine too. The problem is that all three got it, and so they were all incorrectly given permission to modify the leaf block. One of them detected the corruption; in this case it was hlab3. I need to go over this with a fine-tooth comb and figure out where the breakdown was. My hunch is that the glock state machine called do_promote in a case where it shouldn't have.
Legend for the glock/directory history: GLOCK History: G 41029: do_error4 8682, f: lq tgt: SH, dmt: EX, St: UN LEmt g:75677 H: s:SH f:aW e:0 p:9576 [PWLPOST.EXE] gfs2_permission+0x7b/0xd3 [gfs2] H: s:SH f:aW e:0 p:9575 [PWLPOST.EXE] gfs2_permission+0x7b/0xd3 [gfs2] G: This is a glock history entry (D for directory history event) 41029: This is the history number. In this case, this is the 41029th entry logged do_error4: Point of interest or calling function. do_error4 is the do_error from run_queue responsible for failing any queued "try" locks. 8682 This is the pid of the process that logged the event f: lq These are the glock flags, same as a glock dump tgt: SH Glock's target glock state, same as a glock dump (Target lock is SH lock) dmt: EX Glock's demote glock state, same as a glock dump (Demote to EX lock) st: UN Glock's current state, same as a glock dump (this glock is unlocked) LEmt Short for "Log Element list is empty" g:75677 This is a generation number. More important for directory history. The H: lines are the first two glock holders queued at the time of the event, same as a glock dump Directory History: D 41048: get_leaf read 9576 0x130a95/000077 ig:18446744073709551519 l:0x1273542 none 9 332 D: This is a directory event 41048: This is the history number. get_leaf read: The event. This means that get_leaf was called and the block was read (!uptodate) 9576 This is the pid of the process that logged the event 0x130a95 This is the directory's block address 000077 This is the dent's block address (if known, or 77 if not known) ig:X This is the dinode's generation number. Not always reliable l:0x1273542 This is the leaf block address (this happens to be the leaf where corruption was found) none This is the dirent file name, or "none" if not applicable for this event 9 This is debug information (function caller identifier) 332 This is the leaf block's generation number (which is updated with every event logged) It's okay for this to skip backwards a little because logging an event is sometimes done after the leaf is written to the media, so the next re-read may cause it to go backward.
Here is the current status of things. 1. For years, customers have, under rare circumstances, complained about GFS2 withdrawls due to file system corruption but we've never been able to find the cause of the corruption. We've always closed these bug records as INSUFFICIENT_DATA and asked for a scenario that can recreate the problem. No customer has given us a scenario. 2. The symptoms are: (1) Invalid bitmap transition: A bit in the bitmap is transitioned from "free" to "unlinked" which is invalid. (2) Inodes being deleted have their old block count get off, so that after a delete, there are still supposedly blocks to delete. (3) Other misc. problems with gfs2 metadata that seems to revert to older versions. 3. While testing the fix for bug #656032, we stumbled upon a scenario that recreates the corruption. To do this, we are using a very complex multi-node application written by one of our customers (not the customer for this bug). 4. In debugging the problem, we determined that most customers are spared the corruption problem by a section of code in function do_xmote that turns all normal glocks into "try" glocks. Thie code affects the timing and denies some glocks in cases where they should be waiting. This piece of code should not be necessary and has likely caused other problems and symptoms we saw while debugging bug #656032, so we know this code needs to be removed. This may also be adversely affecting performance. 5. To debug the problem, we removed that section of code, and it makes the problem easier to recreate. We decided to focus on the first symptom, the invalid bitmap transition. To debug that, we added instrumentation to log bitmap transitions and the management of resource groups. This was a very long and arduous task but it eventually paid off and told us that the problem was, in fact, due to a dinode that was unlinked twice (it should not have existed when the second unlink occurred). 6. We initially pointed our fingers at the GFS2 code responsible for reclaiming unlinked dinodes, because that's been our most problematic area in the past. In fact, it was an area of code we just fixed for the fourth or fifth time, in the name of bug #656032. After intense debugging and complex instrumentation, that section of code was exhonerated. The problem is due to a leaf block that has a directory entry removed, only to have it magically reappear. 7. We proceeded to add more instrumentation to log the history of all directory operations. After more long, arduous debugging, it appeared that the leaf block was reverting to an older version. This indicates a cluster coherency problem, which should be policed by the glock code. 8. We proceeded to add more instrumentation to log the history of glocks. This proved to be insuffient and we proceeded to add more instrumentation to log some history for lock_dlm, which the lock harness between GFS2 and the distributed lock manager, DLM. 9. Now we're facing several problems: (a) The more instrumentation we add, the harder the problem is to recreate. With out current level of instrumentation, it takes about a half-hour to recreate. (b) With this level of instrumentation, the corruption over a large span of instrumented data. The problem can go undetected a very long time, and by the time it's caught, the data we've instrumented has often been lost to do memory contraints. (c) The sheer amount of data we're gathering due to the instrumentation makes our analysis very slow and painstaking. First, the instrumentation tells us the block in the bitmap that is marked wrong. We have to backtrack from that to the leaf block. From the leaf block, we have to find its most recent update. That tells us the leaf block revision number. Then we have to backtrack through all the data on all the nodes to find out what revisions were done to that leaf by all the nodes. From there, we have to backtrack the glock operations to find out which node really had the DLM's lock to make updates and which one didn't. (d) Half the time, one of the other types of corruption is recreated, so our directory history does not help us to find what caused the corruption at all. 10. We continue to try to get a set of "clean" data from the instrumentation, which should tell us what's going wrong at the glock level. It's my belief that this is either a glock problem or a dlm problem. There's currently no way to know. I'm betting it's a glock problem. If it's a dlm problem, we will need the help of the DLM experts, and likely more instrumentation will be needed at that level. I really hope it doesn't come down to that.
Created attachment 496619 [details] Simpler recreation program This simple program allows you to do simultaneous unlinkat commands, which easily recreates the second problem (not the directory leaf problem, but the one where function gfs2_change_nlink is trying to decrement i_nlink, when, in fact, i_nlink == 0. I don't know if it will recreate the directory leaf problem, but it might. I've gathered what I hope is good glock history for this problem and I'm starting to analyze it now. Much simpler program means much simpler set of data to analyze.
Regarding comment #37: To recreate the problem I used my four node roth cluster and did this: 1. Compile the program with: gcc -o unlinkat unlinkat.c 2. Mount a gfs2 file system as /mnt/gfs2/ 3. Use a program like cssh to simultaneously type on multiple nodes: ./unlinkat 0 4 (on node 1) ./unlinkat 1 4 (on node 2) ./unlinkat 2 4 (on node 3) ./unlinkat 3 4 (on node 4) After the files are all created in the directory /mnt/gfs2/bigdir/, press <enter> simultaneously on all nodes to simultaneously try to "unlinkat" a file. If it does not recreate immediately, press <enter> again. For me, the second problem recreated about my 7th or 8th simultaneous unlinkat. By "simpler set of data to analyze" I mean I've got somewhere on the order of 10,000 log entries to sift through, rather than the customer scenario that had 150,000 - 300,000 entries on each and every node.
Created attachment 496626 [details] Glock / Directory history tarball Tarball of data gathered from unlinkat.c recreation program
Created attachment 496831 [details] Third patch This patch combines three patches: (1) The patch suggested by Steve in comment #42 (2) The patch Steve gave in bug #701577 (3) The patch we need that eliminates the "try lock" stuff. The good news is that it completely fixes the "unlinkat" problem and program I posted with comment #37, what I've been calling the "second" or "change_nlink" problem. With only the third patch (to eliminate the "try lock") my test program could recreate a i_nlink problem on the first few iterations of the test program. With this combined patch, I ran through a thousand iterations of the program with no failures. The bad news is that it does not fix the "first" problem, which is the bitmap inconsistency problem. With the customer app, I'm still able to recreate this problem in about a half-hour. I have not tried Adam's script, nor my original test program, writechaos, with the patch. Since we're obviously dealing with two or more problems, should we treat this as two bugs and open another bug record? I hope the customer would not view that negatively.
Hi Bob. If splitting the problem into two bug reports would result in a more efficient path to complete resolution, we see no problem with this approach. Thanks for checking with us. John Hadad Intec/CSG
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 497195 [details] The 5th RHEL5 patch Here is the RHEL5 patch with Steve's revisions. This has passed the GFS2 coherency test on two clusters of various speeds. It has also passed several combinations of the customer's script, plus a thousand iterations of the two reproducer programs. It has not undergone our full battery of QE tests. I'm going to build this into a test kernel and post to my people page ASAP. I'll post the link when it's ready, but there's lots of prep work so it probably won't be today.
Build 3304784 is complete. I loaded the test kernel rpms to my people page. You can find them here: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/kernel* This kernel has been successful under some heavy cluster workloads, including the aforementioned tests, but it has not gone through the Red Hat cluster QE process, so there are no promises. It is, however, very promising.
Hi Bob. 259 kernel test results. The test ran for about 90 min at which point the system hung (processes stuck in 'D' state). I collected the usual hang data and uploaded it to the FTP dropbox as gfs2_hang_259_May_6_11_RH_333124.tgz For reference (as you look at the data), the test ran from about 17:30 and the hang occurred around 19:00 (May 6). None of the nodes panicked (as they did with the previous kernel/patch). FYI, I re-created all the GFS2 filesystems before starting the test (as I always do). I collected some performance data while the test was running so that I can compare to previous runs/kernels. The rate at which the data was flowing through the system was close to 194 (around 200 sec/batch). Acceptable but a bit faster would significantly improve our chances. The delete rate was not good: 1.5 - 3.5 files/sec I will keep the system in this state in case you need further info from it. John H Intec/CSG
Created attachment 498105 [details] Sixth version of the patch This is the version I posted today.
Posted for inclusion into 5.7. Changing status to POST.
Patch(es) available in kernel-2.6.18-261.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Bob - one thing not covered in this ticket is that the node(s) where the FS has withdrawn usually needs to be rebooted before the filesystem can be fscked - this is disruptive at best. Can that part of the problem be addressed too?
(In reply to comment #63) > Bob - one thing not covered in this ticket is that the node(s) where the FS has > withdrawn usually needs to be rebooted before the filesystem can be fscked - > this is disruptive at best. > > Can that part of the problem be addressed too? Hi Alan, That would be a separate request for enhancement. This bug deals with a specific issue. If you are a Red Hat customer you should open a case with support stating that you'd like to request that we add a facility that allows us to unmount withdrawn GFS2 filesystems without a reboot. Thanks
I haven't had much luck reproducing this. Do we have more recent results from the customer?
Marking VERIFIED based on customer feedback. I have not been able to reproduce this in house.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Multiple GFS2 nodes attempted to unlink, rename, or manipulate files at the same time, causing various forms of file system corruption, panics, and withdraws. This update adds multiple checks for dinode's i_nlink value to assure inode operations such as link, unlink, or rename no longer cause the aforementioned problems.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html