Red Hat Bugzilla – Bug 908093
gfs2: withdraw does not wait for gfs_controld
Last modified: 2013-11-21 10:14:12 EST
Description of problem: rhel6 withdraw handling original design: When gfs2 initiates a withdraw in the kernel (due to some error), it notifies gfs_controld (via uevent). gfs_controld attempts to block all i/o on the local node (using dmsetup), and when done leaves the mount group. When it leaves the mount group, the other nodes all block the fs in the kernel (prevent it from acquiring/using dlm locks), remove the leaving/withdrawing node, and wait for all nodes to reach a barrier. They then tell the withdrawing node to finish its withdraw, and they recover its journal. When gfs_controld on the withdrawing node is told to complete its withdraw from the others, it writes "1" to the lock_dlm "withdraw" sysfs file. gfs2 in the kernel is then supposed to call dlm_release_lockspace(ls->dlm_lockspace, 2); dlm_release_lockspace() on the withdrawing node leaves the dlm lockspace (via dlm_controld). Leaving the lockspace will release locks it holds, allowing gfs2 on the other nodes to acquire them. Only the gfs2 node doing journal recovery for this node should be allowed to use these granted locks because they may protect metadata that needs recovery. Once gfs2 journal recovery is done, the kernel notifies gfs_controld. gfs_controld on all nodes then resumes fs activity. The problem arose when the lock_dlm module was merged into gfs2 itself. (at which point I stopped maintaining it) (commit f057f6cdf64175db1151b1f5d110e29904f119a1 in the rhel6 kernel) Before this commit, withdraw waited as described above, in this function: -static void gdlm_withdraw(void *lockspace) -{ - struct gdlm_ls *ls = lockspace; - - kobject_uevent(&ls->kobj, KOBJ_OFFLINE); - - wait_event_interruptible(ls->wait_control, - test_bit(DFL_WITHDRAW, &ls->flags)); - - dlm_release_lockspace(ls->dlm_lockspace, 2); - gdlm_release_threads(ls); - gdlm_kobject_release(ls); -} and continued via this function called by gfs_controld: -static ssize_t withdraw_store(struct gdlm_ls *ls, const char *buf, size_t len) -{ - ssize_t ret = len; - int val; - - val = simple_strtol(buf, NULL, 0); - - if (val == 1) - set_bit(DFL_WITHDRAW, &ls->flags); - else - ret = -EINVAL; - wake_up(&ls->wait_control); - return ret; -} After the commit, gfs2 no longer waited for gfs_controld to write "1" to the lock_dlm "withdraw" sysfs file. Instead, it just does: kobject_uevent(&sdp->sd_kobj, KOBJ_OFFLINE); dlm_release_lockspace(ls->ls_dlm, 2); The result is that dlm locks are released and may be used by gfs2 on the non-withdrawing nodes before gfs2 has been blocked/prepared for recovery by gfs_controld. This could lead to gfs2 on those nodes using portions of the fs that had been locked by withdrawing node, and are still unrecovered. This could lead them to seeing inconsistencies or possibly corrupting the fs. It may also prevent the node doing journal recovery from acquiring the locks it needs. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I came to look at this based on what I saw upstream where there is no more gfs_controld. Without gfs_controld, I expected the withdraw would just hang or panic. But it is much worse given the problem described above. What happens is that the node just removes itself from the dlm lockspace, its locks are granted to other nodes, and nothing is done at the gfs2 level. This means the parts of the file system it was using at the time of withdraw are left in some inconsistent state that is not recovered. Other nodes can quickly run across the mess left by the withdrawn node. I would advise replacing withdraw with panic to avoid this source of corruption. (Note that in a cluster file system, as opposed to a local file system, a single node should be much quicker and willing to sacrifice itself by panicking for the good of the "whole". If one node quickly fails, the others can quickly get on to recovering it, and continue using the fs themselves. But, if one node insists on trying to maintain some partial function in dubious circumstances, it can very easily disrupt all the other nodes. Thus one localized problem affecting one node, is turned into a global problem, affecting all nodes. In the case of a local fs, there is no harm that comes from attempting to maintain fs access in some limited form.)
Created attachment 693853 [details] First draft of upstream fix Not yet tested, this is the first draft of a patch to fix this issue.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
Upstream patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fd95e81cb1c74c9acd2356821faa9f24c2fec365
Created attachment 714660 [details] RHEL6 crosswrite patch This is a RHEL6 patch crosswritten from the upstream version. I'll post it after I do some withdraw testing with it.
Patch(es)
*** Bug 951970 has been marked as a duplicate of this bug. ***
Dell requesting access to this bug to understand if they are seeing a similar issue.
Verified in kernel-2.6.32-423.el6: -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i uname -r; done 2.6.32-345.el6.x86_64 2.6.32-345.el6.x86_64 2.6.32-345.el6.x86_64 -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i rpm -q gfs2-utils; done gfs2-utils-3.0.12.1-49.el6.x86_64 gfs2-utils-3.0.12.1-49.el6.x86_64 gfs2-utils-3.0.12.1-49.el6.x86_64 -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i ps aux |grep gfs_controld; done root 3531 0.6 0.0 129128 1632 ? Ssl 17:54 0:00 gfs_controld -w 0 root 3491 0.6 0.0 129128 1632 ? Ssl 17:54 0:00 gfs_controld -w 0 root 3470 0.6 0.0 129128 1632 ? Ssl 17:54 0:00 gfs_controld -w 0 -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i mount.gfs2 /dev/sda1 /mnt/mygfs2; done -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i grep gfs2 /proc/mounts; done /dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=0 0 0 /dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=1 0 0 /dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=2 0 0 [root@dash-01 ~]# echo 1 > /sys/fs/gfs2/dash\:mygfs2/withdraw <returns to prompt> -bash-3.2$ console dash-01 [Enter `^Ec?' for help] GFS2: fsid=dash:mygfs2.0: withdrawing from cluster at user's request GFS2: fsid=dash:mygfs2.0: about to withdraw this file system GFS2: fsid=dash:mygfs2.0: telling LM to unmount GFS2: fsid=dash:mygfs2.0: withdrawn Pid: 3702, comm: bash Not tainted 2.6.32-345.el6.x86_64 #1 Call Trace: [<ffffffffa0301e02>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2] [<ffffffff81280e6c>] ? simple_strtoull+0x2c/0x50 [<ffffffffa03008dd>] ? withdraw_store+0x7d/0x90 [gfs2] [<ffffffffa0300375>] ? gfs2_attr_store+0x25/0x30 [gfs2] [<ffffffff811f92e5>] ? sysfs_write_file+0xe5/0x170 [<ffffffff81180c38>] ? vfs_write+0xb8/0x1a0 [<ffffffff81181531>] ? sys_write+0x51/0x90 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i uname -r; done 2.6.32-423.el6.x86_64 2.6.32-423.el6.x86_64 2.6.32-423.el6.x86_64 -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i rpm -q gfs2-utils; done gfs2-utils-3.0.12.1-59.el6.x86_64 gfs2-utils-3.0.12.1-59.el6.x86_64 gfs2-utils-3.0.12.1-59.el6.x86_64 -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i ps aux |grep gfs_controld; done root 3480 0.5 0.0 129128 1624 ? Ssl 18:14 0:00 gfs_controld -w 0 root 3473 0.5 0.0 129128 1628 ? Ssl 18:14 0:00 gfs_controld -w 0 root 3510 0.5 0.0 129128 1628 ? Ssl 18:14 0:00 gfs_controld -w 0 -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i mount.gfs2 /dev/sda1 /mnt/mygfs2; done -bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i grep gfs2 /proc/mounts; done /dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=0 0 0 /dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=1 0 0 /dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=2 0 0 [root@dash-01 ~]# echo 1 > /sys/fs/gfs2/dash\:mygfs2/withdraw <hangs forever> -bash-3.2$ console dash-01 [Enter `^Ec?' for help] GFS2: fsid=dash:mygfs2.0: withdrawing from cluster at user's request GFS2: fsid=dash:mygfs2.0: about to withdraw this file system INFO: task bash:1828 blocked for more than 120 seconds. Not tainted 2.6.32-423.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bash D 0000000000000000 0 1828 1824 0x00000080 ffff88014bd35c88 0000000000000086 ffff88014bd35c38 ffffffff8100988e ffff88014bd35c28 ffff88014a6184d8 00000000016184e0 ffff8800282143c0 ffff8801492ef058 ffff88014bd35fd8 000000000000fbc8 ffff8801492ef058 Call Trace: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320 [<ffffffff81528110>] ? thread_return+0x4e/0x76e [<ffffffff81528fb5>] schedule_timeout+0x215/0x2e0 [<ffffffff81528c33>] wait_for_common+0x123/0x180 [<ffffffff81065ff0>] ? default_wake_function+0x0/0x20 [<ffffffff81528d4d>] wait_for_completion+0x1d/0x20 [<ffffffffa032c48a>] gfs2_lm_withdraw+0x10a/0x160 [gfs2] [<ffffffff8128ca7c>] ? simple_strtoull+0x2c/0x50 [<ffffffffa032b77d>] withdraw_store+0x7d/0x90 [gfs2] [<ffffffffa032a5f5>] gfs2_attr_store+0x25/0x30 [gfs2] [<ffffffff812040d5>] sysfs_write_file+0xe5/0x170 [<ffffffff81189208>] vfs_write+0xb8/0x1a0 [<ffffffff81189b01>] sys_write+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task umount:3682 blocked for more than 120 seconds. Not tainted 2.6.32-423.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 3682 3681 0x00000080 ffff88014c777b28 0000000000000082 ffff880028216840 ffff880028216840 ffff880028216840 0000000000000001 0000000000000000 0000000000000000 ffff88014c775af8 ffff88014c777fd8 000000000000fbc8 ffff88014c775af8 Call Trace: [<ffffffff81060d13>] ? perf_event_task_sched_out+0x33/0x70 [<ffffffff81528fb5>] schedule_timeout+0x215/0x2e0 [<ffffffff81066002>] ? default_wake_function+0x12/0x20 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 [<ffffffff81528c33>] wait_for_common+0x123/0x180 [<ffffffff81065ff0>] ? default_wake_function+0x0/0x20 [<ffffffff811a69ae>] ? ifind_fast+0x5e/0xb0 [<ffffffff81528d4d>] wait_for_completion+0x1d/0x20 [<ffffffff812059c8>] sysfs_addrm_finish+0x228/0x270 [<ffffffff81205b13>] sysfs_remove_dir+0xa3/0xf0 [<ffffffff81284616>] kobject_del+0x16/0x40 [<ffffffff812846ae>] kobject_release+0x6e/0x240 [<ffffffff81284640>] ? kobject_release+0x0/0x240 [<ffffffff81285bb7>] kref_put+0x37/0x70 [<ffffffff81284547>] kobject_put+0x27/0x60 [<ffffffffa032a7f7>] gfs2_sys_fs_del+0x47/0x50 [gfs2] [<ffffffffa03294dc>] gfs2_put_super+0x18c/0x220 [gfs2] [<ffffffff8118b4db>] generic_shutdown_super+0x5b/0xe0 [<ffffffff8118b591>] kill_block_super+0x31/0x50 [<ffffffffa031b103>] gfs2_kill_sb+0x73/0x80 [gfs2] [<ffffffff8118bd67>] deactivate_super+0x57/0x80 [<ffffffff811aad3f>] mntput_no_expire+0xbf/0x110 [<ffffffff811ab88b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1645.html