Description of problem: As part of his testing of the patch for bug #517145 Jaroslav Kortus discovered that a gfs withdraw now hangs in 5.5 whereas it did not before. The regression was not caused by the patch for bug #517145. It was caused because another patch caused GFS to start using the standard freeze/thaw file system mechanism. When a GFS file system withdraw occurs, the kernel generates a uevent that causes gfs_controld to use dm to isolate the storage. The gfs_controld program does so by calling dmsetup with the suspend parameter. That causes the file system to be frozen, so it can't respond to the withdraw. Based on an idea from Steve Whitehouse, I patched gfs_controld so that when it calls dmsetup, it uses the --nolockfs and --noflush parameters. With this patch in place, the hang did not occur for me. I've given a patched version of gfs_controld to jkortus to try and hope to get feedback today. If it indeed fixes the problem that means (1) this bug's gfs patch did not cause a regression and can therefore be placed back into ON_QA or maybe even VERIFIED. (2) We need to open a new blocker bug record and get it into 5.5 for the patch to gfs_controld. Version-Release number of selected component (if applicable): 5.5 How reproducible: Easy Steps to Reproduce: 1.mount gfs file system from three nodes to /mnt/gfs 2.On one of the nodes, do: gfs_tool withdraw /mnt/gfs Actual results: The withdraw hangs and the gfs mount point cannot be unfrozen. It becomes unusable until the cluster is rebooted. Expected results: The withdraw should not hang and the mount point can be unfrozen. Additional info:
Created attachment 397846 [details] Proposed patch Here is my proposed patch that seems to fix the problem.
Sorry, bits of the problem description above were copied over from the other bug #517145 so they do not apply. This bug was opened to address the problem mentioned as (2) above.
On RHEL5.4 + RHN updates the withdrawal process succeeds: Mar 4 12:35:19 a2 kernel: GFS: fsid=a_cluster:vedder0.1: withdrawing from cluster at user's request Mar 4 12:35:19 a2 kernel: GFS: fsid=a_cluster:vedder0.1: about to withdraw from the cluster Mar 4 12:35:19 a2 kernel: GFS: fsid=a_cluster:vedder0.1: telling LM to withdraw Mar 4 12:35:20 a2 kernel: GFS: fsid=a_cluster:vedder0.1: withdrawn Mar 4 12:35:20 a2 kernel: Mar 4 12:35:20 a2 kernel: Call Trace: Mar 4 12:35:20 a2 kernel: [<a000000100013b40>] show_stack+0x40/0xa0 Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7bd0 bsp=e00000010e5a1298 Mar 4 12:35:20 a2 kernel: [<a000000100013bd0>] dump_stack+0x30/0x60 Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7da0 bsp=e00000010e5a1280 Mar 4 12:35:20 a2 kernel: [<a00000020331df40>] gfs_lm_withdraw+0x1e0/0x220 [gfs] Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7da0 bsp=e00000010e5a1218 Mar 4 12:35:20 a2 kernel: [<a000000203348600>] gfs_proc_read+0xaa0/0xd60 [gfs] Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7de0 bsp=e00000010e5a11b8 Mar 4 12:35:20 a2 kernel: [<a000000100177300>] vfs_read+0x200/0x3a0 Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7e20 bsp=e00000010e5a1168 Mar 4 12:35:20 a2 kernel: [<a0000001001779d0>] sys_read+0x70/0xe0 Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7e20 bsp=e00000010e5a10f0 Mar 4 12:35:20 a2 kernel: [<a00000010000bd70>] __ia64_trace_syscall+0xd0/0x110 Mar 4 12:35:20 a2 kernel: sp=e00000010e5a7e30 bsp=e00000010e5a10f0 Mar 4 12:35:20 a2 kernel: [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 Mar 4 12:35:20 a2 kernel: sp=e00000010e5a8000 bsp=e00000010e5a10f0
The patch was pushed to the RHEL55 branch of the cluster git tree for inclusion into 5.5. Changing status to POST until I can do a build.
Hm, I added a comment with some questions about this last week, but that comment seems to be completely missing... to recap, the --nolockfs looks correct; I remember when this bug was introduced by gfs2 adding the lockfs hooks, but apparently whomever added those didn't think of this problem or test it. The --noflush I'm not sure about, it depends on what happens to unflushed buffers when you suspend with --noflush. Are they all completed, with errors? Or are they left outstanding until the resume? The former should be fine, the later would be dangerous and defeat the purpose of the suspend (which is to wait for all writes to be gone so that the node doesn't need to be fenced.)
Verified as in description. Setting needinfo to Bob to clarify comment 6.
Regarding comment #6, what happens to the I/O depends upon the target which is installed rather than the flushing. So far as I can tell from the man page the flushing is something that was supposed to happen before the new target was installed. It should be easy enough to verify. The intent is that the new dm target remains in place until either the machine is rebooted, or a umount succeeds. That must by definition invalidate all the buffers since they are all in the address spaces of the inodes which will have been deallocated in order for umount to be successful. So either should be safe. The question is whether one or the other would make it more likely for umount to succeed. I suspect it makes no difference, but lets try it and see.
The decision to implement withdraw using dmsetup suspend was based on the premise that no outstanding writes or dirty buffers would exist for the given device once dmsetup returned. Otherwise the fs is open to being corrupted. So, if there are cases where that is not true, then we need to change something so that it is, detect those cases and panic instead of withdrawing, or advise people to use the panic option. (Panic instead of withdraw is almost always preferable anyway, and should really be made the default behavior.)
I did some testing on this. The --noflush option seems to make no difference. In both cases, the withdraw returns normally, but any subsequent attempt to umount will hang producing one of the following call traces (gfs and gfs2 respectively): INFO: task umount.gfs:3717 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount.gfs D ffff81000237eaa0 0 3717 3716 (NOTLB) ffff810066ae1c08 0000000000000086 0000000000000001 ffffffff800e3452 ffff81006afa7ac8 0000000000000007 ffff810066f3c820 ffff8100026e4100 0000024d4dd063e8 0000000000099e19 ffff810066f3ca08 0000000100000010 Call Trace: [<ffffffff800e3452>] block_read_full_page+0x259/0x276 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 [<ffffffff80028adc>] sync_page+0x0/0x43 [<ffffffff800647ea>] io_schedule+0x3f/0x67 [<ffffffff80028b1a>] sync_page+0x3e/0x43 [<ffffffff8006492e>] __wait_on_bit_lock+0x36/0x66 [<ffffffff8003ff92>] __lock_page+0x5e/0x64 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23 [<ffffffff8000c2e7>] do_generic_mapping_read+0x1df/0x354 [<ffffffff8000d0fb>] file_read_actor+0x0/0x159 [<ffffffff8000c5a8>] __generic_file_aio_read+0x14c/0x198 [<ffffffff800c78fb>] generic_file_read+0xac/0xc5 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e [<ffffffff8012e042>] selinux_file_permission+0x9f/0xb6 [<ffffffff8000b6b0>] vfs_read+0xcb/0x171 [<ffffffff80011c01>] sys_read+0x45/0x6e [<ffffffff8005e28d>] tracesys+0xd5/0xe0 With dmsetup --nolockfs and --noflush and gfs2: INFO: task gfs2_logd:3145 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs2_logd D ffff810002376420 0 3145 27 3146 3143 (L-TLB) ffff810069b67cc0 0000000000000046 ffff81007ef70c33 ffff81007fb92178 ffffffff800154ce 0000000000000009 ffff8100691e8040 ffffffff80309b60 000000499756fb9a 0000000000006545 ffff8100691e8228 0000000000000282 Call Trace: [<ffffffff800154ce>] sync_buffer+0x0/0x3f [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 [<ffffffff800154ce>] sync_buffer+0x0/0x3f [<ffffffff800647ea>] io_schedule+0x3f/0x67 [<ffffffff80015509>] sync_buffer+0x3b/0x3f [<ffffffff80064a16>] __wait_on_bit+0x40/0x6e [<ffffffff800154ce>] sync_buffer+0x0/0x3f [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23 [<ffffffff8003aca8>] sync_dirty_buffer+0x96/0xcb [<ffffffff88626dc8>] :gfs2:log_write_header+0x10e/0x336 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff886273ac>] :gfs2:gfs2_log_flush+0x3bc/0x472 [<ffffffff886269b5>] :gfs2:gfs2_ail1_empty+0x1a/0x95 [<ffffffff8862793c>] :gfs2:gfs2_logd+0xa2/0x15c [<ffffffff8862789a>] :gfs2:gfs2_logd+0x0/0x15c [<ffffffff80032bdc>] kthread+0xfe/0x132 [<ffffffff8005efb1>] child_rip+0xa/0x11 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032ade>] kthread+0x0/0x132 [<ffffffff8005efa7>] child_rip+0x0/0x11 INFO: task umount.gfs2:3195 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount.gfs2 D ffff810002376420 0 3195 3194 (NOTLB) ffff8100681bfdb8 0000000000000082 ffff81000237eb18 ffff8100681bfd28 ffff81007f756080 0000000000000007 ffff8100698c5080 ffffffff80309b60 0000004afa9aff98 0000000000097576 ffff8100698c5268 0000000000000000 Call Trace: [<ffffffff80065613>] __down_write_nested+0x7a/0x92 [<ffffffff8862700f>] :gfs2:gfs2_log_flush+0x1f/0x472 [<ffffffff8862746d>] :gfs2:gfs2_meta_syncfs+0xb/0x37 [<ffffffff8862e0ac>] :gfs2:gfs2_kill_sb+0x25/0x76 [<ffffffff800e4d41>] deactivate_super+0x6a/0x82 [<ffffffff800ee830>] sys_umount+0x245/0x27b [<ffffffff800b878c>] audit_syscall_entry+0x180/0x1b3 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 So we have to decide whether this is worth respinning the cman errata again this late in the build cycle to remove --noflush. My personal opinion is no, it's not worth respinning; we can deal with the umount problem and device sync issues in 5.6 or 5.5.z. Opinions?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html