Bug 908093 - gfs2: withdraw does not wait for gfs_controld
gfs2: withdraw does not wait for gfs_controld
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
Unspecified Unspecified
urgent Severity high
: rc
: ---
Assigned To: Steve Whitehouse
Cluster QE
: ZStream
: 951970 (view as bug list)
Depends On:
Blocks: 927308 961662
  Show dependency treegraph
 
Reported: 2013-02-05 15:57 EST by David Teigland
Modified: 2013-11-21 10:14 EST (History)
9 users (show)

See Also:
Fixed In Version: kernel-2.6.32-381.el6
Doc Type: Bug Fix
Doc Text:
When an inconsistency is detected in a GFS2 file system after an I/O operation, the kernel performs the withdraw operation on the local node. However, the kernel previously did not wait for an acknowledgement from the GFS control daemon (gfs_controld) before proceeding with the withdraw operation. Therefore, if a failure isolating the GFS2 file system from a data storage occurred, the kernel was not aware of this problem and an I/O operation to the shared block device may have been performed after the withdraw operation was logged as successful. This could lead to corruption of the file system or prevent the node from journal recovery. This patch modifies the GFS2 code so the withdraw operation no longer proceeds without the acknowledgement from gfs_controld, and the GFS2 file system can no longer become corrupted after performing the withdraw operation.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-21 10:14:12 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
First draft of upstream fix (2.51 KB, patch)
2013-02-06 05:19 EST, Steve Whitehouse
no flags Details | Diff
RHEL6 crosswrite patch (2.83 KB, patch)
2013-03-22 13:02 EDT, Robert Peterson
no flags Details | Diff

  None (edit)
Description David Teigland 2013-02-05 15:57:43 EST
Description of problem:

rhel6 withdraw handling original design:

When gfs2 initiates a withdraw in the kernel (due to some error), it notifies
gfs_controld (via uevent). gfs_controld attempts to block all i/o on the local
node (using dmsetup), and when done leaves the mount group.

When it leaves the mount group, the other nodes all block the fs in the
kernel (prevent it from acquiring/using dlm locks), remove the leaving/withdrawing node, and wait for all nodes to reach a barrier.
They then tell the withdrawing node to finish its withdraw, and they
recover its journal.

When gfs_controld on the withdrawing node is told to complete its withdraw
from the others, it writes "1" to the lock_dlm "withdraw" sysfs file.
gfs2 in the kernel is then supposed to call
dlm_release_lockspace(ls->dlm_lockspace, 2);

dlm_release_lockspace() on the withdrawing node leaves the dlm lockspace
(via dlm_controld).  Leaving the lockspace will release locks it holds,
allowing gfs2 on the other nodes to acquire them.  Only the gfs2 node doing
journal recovery for this node should be allowed to use these granted locks
because they may protect metadata that needs recovery.

Once gfs2 journal recovery is done, the kernel notifies gfs_controld.
gfs_controld on all nodes then resumes fs activity.


The problem arose when the lock_dlm module was merged into gfs2 itself.
(at which point I stopped maintaining it)
(commit f057f6cdf64175db1151b1f5d110e29904f119a1 in the rhel6 kernel)

Before this commit, withdraw waited as described above, in this function:

-static void gdlm_withdraw(void *lockspace)
-{
-       struct gdlm_ls *ls = lockspace;
-
-       kobject_uevent(&ls->kobj, KOBJ_OFFLINE);
-
-       wait_event_interruptible(ls->wait_control,
-                                test_bit(DFL_WITHDRAW, &ls->flags));
-
-       dlm_release_lockspace(ls->dlm_lockspace, 2);
-       gdlm_release_threads(ls);
-       gdlm_kobject_release(ls);
-}

and continued via this function called by gfs_controld:

-static ssize_t withdraw_store(struct gdlm_ls *ls, const char *buf, size_t len)
-{
-       ssize_t ret = len;
-       int val;
-
-       val = simple_strtol(buf, NULL, 0);
-
-       if (val == 1)
-               set_bit(DFL_WITHDRAW, &ls->flags);
-       else
-               ret = -EINVAL;
-       wake_up(&ls->wait_control);
-       return ret;
-}


After the commit, gfs2 no longer waited for gfs_controld to write "1" to
the lock_dlm "withdraw" sysfs file.  Instead, it just does:

kobject_uevent(&sdp->sd_kobj, KOBJ_OFFLINE);
dlm_release_lockspace(ls->ls_dlm, 2);

The result is that dlm locks are released and may be used by gfs2 on the
non-withdrawing nodes before gfs2 has been blocked/prepared for recovery
by gfs_controld.  This could lead to gfs2 on those nodes using portions of
the fs that had been locked by withdrawing node, and are still unrecovered.
This could lead them to seeing inconsistencies or possibly corrupting the fs.
It may also prevent the node doing journal recovery from acquiring the
locks it needs.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 2 David Teigland 2013-02-05 16:40:24 EST
I came to look at this based on what I saw upstream where there is no
more gfs_controld.  Without gfs_controld, I expected the withdraw would
just hang or panic.  But it is much worse given the problem described
above.  What happens is that the node just removes itself from the
dlm lockspace, its locks are granted to other nodes, and nothing
is done at the gfs2 level.  This means the parts of the file system
it was using at the time of withdraw are left in some inconsistent
state that is not recovered.  Other nodes can quickly run across the
mess left by the withdrawn node.  I would advise replacing withdraw with
panic to avoid this source of corruption.

(Note that in a cluster file system, as opposed to a local file system,
a single node should be much quicker and willing to sacrifice itself by
panicking for the good of the "whole".  If one node quickly fails,
the others can quickly get on to recovering it, and continue using the
fs themselves.  But, if one node insists on trying to maintain some
partial function in dubious circumstances, it can very easily disrupt
all the other nodes. Thus one localized problem affecting one node,
is turned into a global problem, affecting all nodes.  In the case
of a local fs, there is no harm that comes from attempting to maintain
fs access in some limited form.)
Comment 3 Steve Whitehouse 2013-02-06 05:19:50 EST
Created attachment 693853 [details]
First draft of upstream fix

Not yet tested, this is the first draft of a patch to fix this issue.
Comment 4 RHEL Product and Program Management 2013-02-06 05:21:05 EST
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.
Comment 8 Robert Peterson 2013-03-22 13:02:57 EDT
Created attachment 714660 [details]
RHEL6 crosswrite patch

This is a RHEL6 patch crosswritten from the upstream version.
I'll post it after I do some withdraw testing with it.
Comment 12 Jarod Wilson 2013-05-24 11:48:01 EDT
Patch(es)
Comment 15 Robert Peterson 2013-08-20 09:21:23 EDT
*** Bug 951970 has been marked as a duplicate of this bug. ***
Comment 16 Joe Donohue 2013-10-04 12:38:35 EDT
Dell requesting access to this bug to understand if they are seeing a similar issue.
Comment 18 Justin Payne 2013-10-18 19:25:49 EDT
Verified in kernel-2.6.32-423.el6:

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i uname -r; done
2.6.32-345.el6.x86_64
2.6.32-345.el6.x86_64
2.6.32-345.el6.x86_64
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i rpm -q gfs2-utils; done
gfs2-utils-3.0.12.1-49.el6.x86_64
gfs2-utils-3.0.12.1-49.el6.x86_64
gfs2-utils-3.0.12.1-49.el6.x86_64
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i ps aux |grep gfs_controld; done
root      3531  0.6  0.0 129128  1632 ?        Ssl  17:54   0:00 gfs_controld -w 0
root      3491  0.6  0.0 129128  1632 ?        Ssl  17:54   0:00 gfs_controld -w 0
root      3470  0.6  0.0 129128  1632 ?        Ssl  17:54   0:00 gfs_controld -w 0
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i mount.gfs2 /dev/sda1 /mnt/mygfs2; done
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i grep gfs2 /proc/mounts; done
/dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=0 0 0
/dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=1 0 0
/dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=2 0 0

[root@dash-01 ~]# echo 1 > /sys/fs/gfs2/dash\:mygfs2/withdraw 
<returns to prompt>

-bash-3.2$ console dash-01
[Enter `^Ec?' for help]
GFS2: fsid=dash:mygfs2.0: withdrawing from cluster at user's request
GFS2: fsid=dash:mygfs2.0: about to withdraw this file system
GFS2: fsid=dash:mygfs2.0: telling LM to unmount
GFS2: fsid=dash:mygfs2.0: withdrawn
Pid: 3702, comm: bash Not tainted 2.6.32-345.el6.x86_64 #1
Call Trace:
 [<ffffffffa0301e02>] ? gfs2_lm_withdraw+0x102/0x130 [gfs2]
 [<ffffffff81280e6c>] ? simple_strtoull+0x2c/0x50
 [<ffffffffa03008dd>] ? withdraw_store+0x7d/0x90 [gfs2]
 [<ffffffffa0300375>] ? gfs2_attr_store+0x25/0x30 [gfs2]
 [<ffffffff811f92e5>] ? sysfs_write_file+0xe5/0x170
 [<ffffffff81180c38>] ? vfs_write+0xb8/0x1a0
 [<ffffffff81181531>] ? sys_write+0x51/0x90
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b



-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i uname -r; done
2.6.32-423.el6.x86_64
2.6.32-423.el6.x86_64
2.6.32-423.el6.x86_64
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i rpm -q gfs2-utils; done
gfs2-utils-3.0.12.1-59.el6.x86_64
gfs2-utils-3.0.12.1-59.el6.x86_64
gfs2-utils-3.0.12.1-59.el6.x86_64
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i ps aux |grep gfs_controld; done
root      3480  0.5  0.0 129128  1624 ?        Ssl  18:14   0:00 gfs_controld -w 0
root      3473  0.5  0.0 129128  1628 ?        Ssl  18:14   0:00 gfs_controld -w 0
root      3510  0.5  0.0 129128  1628 ?        Ssl  18:14   0:00 gfs_controld -w 0
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i mount.gfs2 /dev/sda1 /mnt/mygfs2; done
-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i grep gfs2 /proc/mounts; done
/dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=0 0 0
/dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=1 0 0
/dev/sda1 /mnt/mygfs2 gfs2 rw,seclabel,relatime,hostdata=jid=2 0 0

[root@dash-01 ~]# echo 1 > /sys/fs/gfs2/dash\:mygfs2/withdraw
<hangs forever>

-bash-3.2$ console dash-01
[Enter `^Ec?' for help]
GFS2: fsid=dash:mygfs2.0: withdrawing from cluster at user's request
GFS2: fsid=dash:mygfs2.0: about to withdraw this file system
INFO: task bash:1828 blocked for more than 120 seconds.
      Not tainted 2.6.32-423.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash          D 0000000000000000     0  1828   1824 0x00000080
 ffff88014bd35c88 0000000000000086 ffff88014bd35c38 ffffffff8100988e
 ffff88014bd35c28 ffff88014a6184d8 00000000016184e0 ffff8800282143c0
 ffff8801492ef058 ffff88014bd35fd8 000000000000fbc8 ffff8801492ef058
Call Trace:
 [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
 [<ffffffff81528110>] ? thread_return+0x4e/0x76e
 [<ffffffff81528fb5>] schedule_timeout+0x215/0x2e0
 [<ffffffff81528c33>] wait_for_common+0x123/0x180
 [<ffffffff81065ff0>] ? default_wake_function+0x0/0x20
 [<ffffffff81528d4d>] wait_for_completion+0x1d/0x20
 [<ffffffffa032c48a>] gfs2_lm_withdraw+0x10a/0x160 [gfs2]
 [<ffffffff8128ca7c>] ? simple_strtoull+0x2c/0x50
 [<ffffffffa032b77d>] withdraw_store+0x7d/0x90 [gfs2]
 [<ffffffffa032a5f5>] gfs2_attr_store+0x25/0x30 [gfs2]
 [<ffffffff812040d5>] sysfs_write_file+0xe5/0x170
 [<ffffffff81189208>] vfs_write+0xb8/0x1a0
 [<ffffffff81189b01>] sys_write+0x51/0x90
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task umount:3682 blocked for more than 120 seconds.
      Not tainted 2.6.32-423.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0  3682   3681 0x00000080
 ffff88014c777b28 0000000000000082 ffff880028216840 ffff880028216840
 ffff880028216840 0000000000000001 0000000000000000 0000000000000000
 ffff88014c775af8 ffff88014c777fd8 000000000000fbc8 ffff88014c775af8
Call Trace:
 [<ffffffff81060d13>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff81528fb5>] schedule_timeout+0x215/0x2e0
 [<ffffffff81066002>] ? default_wake_function+0x12/0x20
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff81528c33>] wait_for_common+0x123/0x180
 [<ffffffff81065ff0>] ? default_wake_function+0x0/0x20
 [<ffffffff811a69ae>] ? ifind_fast+0x5e/0xb0
 [<ffffffff81528d4d>] wait_for_completion+0x1d/0x20
 [<ffffffff812059c8>] sysfs_addrm_finish+0x228/0x270
 [<ffffffff81205b13>] sysfs_remove_dir+0xa3/0xf0
 [<ffffffff81284616>] kobject_del+0x16/0x40
 [<ffffffff812846ae>] kobject_release+0x6e/0x240
 [<ffffffff81284640>] ? kobject_release+0x0/0x240
 [<ffffffff81285bb7>] kref_put+0x37/0x70
 [<ffffffff81284547>] kobject_put+0x27/0x60
 [<ffffffffa032a7f7>] gfs2_sys_fs_del+0x47/0x50 [gfs2]
 [<ffffffffa03294dc>] gfs2_put_super+0x18c/0x220 [gfs2]
 [<ffffffff8118b4db>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b591>] kill_block_super+0x31/0x50
 [<ffffffffa031b103>] gfs2_kill_sb+0x73/0x80 [gfs2]
 [<ffffffff8118bd67>] deactivate_super+0x57/0x80
 [<ffffffff811aad3f>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab88b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment 19 errata-xmlrpc 2013-11-21 10:14:12 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1645.html

Note You need to log in before you can comment on or make changes to this bug.