570530 – cman: gfs_controld dm suspend hangs withdrawn GFS file system

Bug 570530 - cman: gfs_controld dm suspend hangs withdrawn GFS file system

Summary: cman: gfs_controld dm suspend hangs withdrawn GFS file system

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	571806
TreeView+	depends on / blocked

Reported:	2010-03-04 16:23 UTC by Robert Peterson
Modified:	2010-03-30 08:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cman-2.0.115-33.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	571806 (view as bug list)
Environment:
Last Closed:	2010-03-30 08:39:04 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed patch (585 bytes, patch) 2010-03-04 16:25 UTC, Robert Peterson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0266	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2010-03-29 12:54:44 UTC

Description Robert Peterson 2010-03-04 16:23:39 UTC

Description of problem:

As part of his testing of the patch for bug #517145 Jaroslav
Kortus discovered that a gfs withdraw now hangs in 5.5 whereas
it did not before.  The regression was not caused by the patch
for bug #517145.  It was caused because another patch caused
GFS to start using the standard freeze/thaw file system mechanism.

When a GFS file system withdraw occurs, the kernel generates a
uevent that causes gfs_controld to use dm to isolate the storage.
The gfs_controld program does so by calling dmsetup with the
suspend parameter.  That causes the file system to be frozen, so
it can't respond to the withdraw.

Based on an idea from Steve Whitehouse, I patched gfs_controld
so that when it calls dmsetup, it uses the --nolockfs and --noflush
parameters.  With this patch in place, the hang did not occur for me.

I've given a patched version of gfs_controld to jkortus to try
and hope to get feedback today.  If it indeed fixes the problem
that means (1) this bug's gfs patch did not cause a regression
and can therefore be placed back into ON_QA or maybe even VERIFIED.
(2) We need to open a new blocker bug record and get it into 5.5
for the patch to gfs_controld.    

Version-Release number of selected component (if applicable):
5.5

How reproducible:
Easy

Steps to Reproduce:
1.mount gfs file system from three nodes to /mnt/gfs
2.On one of the nodes, do: gfs_tool withdraw /mnt/gfs
  
Actual results:
The withdraw hangs and the gfs mount point cannot be unfrozen.
It becomes unusable until the cluster is rebooted.

Expected results:
The withdraw should not hang and the mount point can be unfrozen.

Additional info:

Comment 1 Robert Peterson 2010-03-04 16:25:04 UTC

Created attachment 397846 [details]
Proposed patch

Here is my proposed patch that seems to fix the problem.

Comment 2 Robert Peterson 2010-03-04 16:29:59 UTC

Sorry, bits of the problem description above were copied over
from the other bug #517145 so they do not apply.  This bug was
opened to address the problem mentioned as (2) above.

Comment 3 Jaroslav Kortus 2010-03-04 19:22:49 UTC

On RHEL5.4 + RHN updates the withdrawal process succeeds:

Mar  4 12:35:19 a2 kernel: GFS: fsid=a_cluster:vedder0.1: withdrawing from cluster at user's request
Mar  4 12:35:19 a2 kernel: GFS: fsid=a_cluster:vedder0.1: about to withdraw from the cluster
Mar  4 12:35:19 a2 kernel: GFS: fsid=a_cluster:vedder0.1: telling LM to withdraw
Mar  4 12:35:20 a2 kernel: GFS: fsid=a_cluster:vedder0.1: withdrawn
Mar  4 12:35:20 a2 kernel:
Mar  4 12:35:20 a2 kernel: Call Trace:
Mar  4 12:35:20 a2 kernel:  [<a000000100013b40>] show_stack+0x40/0xa0
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7bd0 bsp=e00000010e5a1298
Mar  4 12:35:20 a2 kernel:  [<a000000100013bd0>] dump_stack+0x30/0x60
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7da0 bsp=e00000010e5a1280
Mar  4 12:35:20 a2 kernel:  [<a00000020331df40>] gfs_lm_withdraw+0x1e0/0x220 [gfs]
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7da0 bsp=e00000010e5a1218
Mar  4 12:35:20 a2 kernel:  [<a000000203348600>] gfs_proc_read+0xaa0/0xd60 [gfs]
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7de0 bsp=e00000010e5a11b8
Mar  4 12:35:20 a2 kernel:  [<a000000100177300>] vfs_read+0x200/0x3a0
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7e20 bsp=e00000010e5a1168
Mar  4 12:35:20 a2 kernel:  [<a0000001001779d0>] sys_read+0x70/0xe0
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7e20 bsp=e00000010e5a10f0
Mar  4 12:35:20 a2 kernel:  [<a00000010000bd70>] __ia64_trace_syscall+0xd0/0x110
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a7e30 bsp=e00000010e5a10f0
Mar  4 12:35:20 a2 kernel:  [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
Mar  4 12:35:20 a2 kernel:                                 sp=e00000010e5a8000 bsp=e00000010e5a10f0

Comment 5 Robert Peterson 2010-03-09 15:20:15 UTC

The patch was pushed to the RHEL55 branch of the cluster git tree
for inclusion into 5.5.  Changing status to POST until I can do a
build.

Comment 6 David Teigland 2010-03-10 02:37:04 UTC

Hm, I added a comment with some questions about this last week, but that comment seems to be completely missing... to recap, the --nolockfs looks correct; I remember when this bug was introduced by gfs2 adding the lockfs hooks, but apparently whomever added those didn't think of this problem or test it.

The --noflush I'm not sure about, it depends on what happens to unflushed buffers when you suspend with --noflush.  Are they all completed, with errors?  Or are they left outstanding until the resume?  The former should be fine, the later would be dangerous and defeat the purpose of the suspend (which is to wait for all writes to be gone so that the node doesn't need to be fenced.)

Comment 8 Jaroslav Kortus 2010-03-11 15:12:36 UTC

Verified as in description. Setting needinfo to Bob to clarify comment 6.

Comment 9 Steve Whitehouse 2010-03-11 16:06:37 UTC

Regarding comment #6, what happens to the I/O depends upon the target which is installed rather than the flushing. So far as I can tell from the man page the flushing is something that was supposed to happen before the new target was installed.

It should be easy enough to verify. The intent is that the new dm target remains in place until either the machine is rebooted, or a umount succeeds. That must by definition invalidate all the buffers since they are all in the address spaces of the inodes which will have been deallocated in order for umount to be successful.

So either should be safe. The question is whether one or the other would make it more likely for umount to succeed. I suspect it makes no difference, but lets try it and see.

Comment 10 David Teigland 2010-03-12 05:31:31 UTC

The decision to implement withdraw using dmsetup suspend was based on the premise that no outstanding writes or dirty buffers would exist for the given device once dmsetup returned.  Otherwise the fs is open to being corrupted.

So, if there are cases where that is not true, then we need to change something so that it is, detect those cases and panic instead of withdrawing, or advise people to use the panic option.  (Panic instead of withdraw is almost always preferable anyway, and should really be made the default behavior.)

Comment 11 Robert Peterson 2010-03-12 19:04:00 UTC

I did some testing on this.  The --noflush option seems to make no
difference.  In both cases, the withdraw returns normally, but any
subsequent attempt to umount will hang producing one of the
following call traces (gfs and gfs2 respectively):

INFO: task umount.gfs:3717 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount.gfs    D ffff81000237eaa0     0  3717   3716                     (NOTLB)
 ffff810066ae1c08 0000000000000086 0000000000000001 ffffffff800e3452
 ffff81006afa7ac8 0000000000000007 ffff810066f3c820 ffff8100026e4100
 0000024d4dd063e8 0000000000099e19 ffff810066f3ca08 0000000100000010
Call Trace:
 [<ffffffff800e3452>] block_read_full_page+0x259/0x276
 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90
 [<ffffffff80028adc>] sync_page+0x0/0x43
 [<ffffffff800647ea>] io_schedule+0x3f/0x67
 [<ffffffff80028b1a>] sync_page+0x3e/0x43
 [<ffffffff8006492e>] __wait_on_bit_lock+0x36/0x66
 [<ffffffff8003ff92>] __lock_page+0x5e/0x64
 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23
 [<ffffffff8000c2e7>] do_generic_mapping_read+0x1df/0x354
 [<ffffffff8000d0fb>] file_read_actor+0x0/0x159
 [<ffffffff8000c5a8>] __generic_file_aio_read+0x14c/0x198
 [<ffffffff800c78fb>] generic_file_read+0xac/0xc5
 [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8012e042>] selinux_file_permission+0x9f/0xb6
 [<ffffffff8000b6b0>] vfs_read+0xcb/0x171
 [<ffffffff80011c01>] sys_read+0x45/0x6e
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

With dmsetup --nolockfs and --noflush and gfs2:

INFO: task gfs2_logd:3145 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs2_logd     D ffff810002376420     0  3145     27          3146  3143 (L-TLB)
ffff810069b67cc0 0000000000000046 ffff81007ef70c33 ffff81007fb92178
ffffffff800154ce 0000000000000009 ffff8100691e8040 ffffffff80309b60
000000499756fb9a 0000000000006545 ffff8100691e8228 0000000000000282
Call Trace:
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff800647ea>] io_schedule+0x3f/0x67
 [<ffffffff80015509>] sync_buffer+0x3b/0x3f
 [<ffffffff80064a16>] __wait_on_bit+0x40/0x6e
 [<ffffffff800154ce>] sync_buffer+0x0/0x3f
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23
 [<ffffffff8003aca8>] sync_dirty_buffer+0x96/0xcb
 [<ffffffff88626dc8>] :gfs2:log_write_header+0x10e/0x336
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff886273ac>] :gfs2:gfs2_log_flush+0x3bc/0x472
 [<ffffffff886269b5>] :gfs2:gfs2_ail1_empty+0x1a/0x95
 [<ffffffff8862793c>] :gfs2:gfs2_logd+0xa2/0x15c
 [<ffffffff8862789a>] :gfs2:gfs2_logd+0x0/0x15c
 [<ffffffff80032bdc>] kthread+0xfe/0x132
 [<ffffffff8005efb1>] child_rip+0xa/0x11
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032ade>] kthread+0x0/0x132
 [<ffffffff8005efa7>] child_rip+0x0/0x11

INFO: task umount.gfs2:3195 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount.gfs2   D ffff810002376420     0  3195   3194                     (NOTLB)
 ffff8100681bfdb8 0000000000000082 ffff81000237eb18 ffff8100681bfd28
 ffff81007f756080 0000000000000007 ffff8100698c5080 ffffffff80309b60
 0000004afa9aff98 0000000000097576 ffff8100698c5268 0000000000000000
Call Trace:
 [<ffffffff80065613>] __down_write_nested+0x7a/0x92
 [<ffffffff8862700f>] :gfs2:gfs2_log_flush+0x1f/0x472
 [<ffffffff8862746d>] :gfs2:gfs2_meta_syncfs+0xb/0x37
 [<ffffffff8862e0ac>] :gfs2:gfs2_kill_sb+0x25/0x76
 [<ffffffff800e4d41>] deactivate_super+0x6a/0x82
 [<ffffffff800ee830>] sys_umount+0x245/0x27b
 [<ffffffff800b878c>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

So we have to decide whether this is worth respinning the cman
errata again this late in the build cycle to remove --noflush.
My personal opinion is no, it's not worth respinning; we can
deal with the umount problem and device sync issues in 5.6 or
5.5.z.  Opinions?

Comment 13 errata-xmlrpc 2010-03-30 08:39:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Note You need to log in before you can comment on or make changes to this bug.