496884 – GFS2: withdraw after I/O error on file system A hangs other file systems

Bug 496884 - GFS2: withdraw after I/O error on file system A hangs other file systems

Summary: GFS2: withdraw after I/O error on file system A hangs other file systems

Keywords:
Status:	CLOSED DUPLICATE of bug 229081
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Whitehouse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	533192
TreeView+	depends on / blocked

Reported:	2009-04-21 14:42 UTC by Nate Straz
Modified:	2010-01-27 11:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-01-27 11:04:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glock and dlm dumps from all nodes and sysrq-t from dash-02 (22.74 KB, application/x-bzip2) 2009-04-21 14:50 UTC, Nate Straz	no flags	Details
View All

Description Nate Straz 2009-04-21 14:42:04 UTC

Description of problem:

While running revolver with a new error target test case (where a device-mapper error target is inserted underneath the file system in order to cause a withdraw) the `dmsetup suspend` command hung causing the test to hang.  It appear that GFS2 is stuck in a memory allocation.


dmsetup       D ffff8101049c3338     0  9650   9645                     (NOTLB)
 ffff810138877a28 0000000000000082 ffff81000001ac10 000280d200000000
 0000000000000002 0000000000000006 ffff8101388f4040 ffff8101388f47a0
 000000665a23bf58 00000000001fec83 ffff8101388f4228 0000000000000000
Call Trace:
 [<ffffffff8000f61e>] __alloc_pages+0x65/0x2ce
 [<ffffffff88623150>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff88623159>] :gfs2:just_schedule+0x9/0xe
 [<ffffffff80063ad7>] __wait_on_bit+0x40/0x6e
 [<ffffffff88623150>] :gfs2:just_schedule+0x0/0xe
 [<ffffffff80063b71>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009dc93>] wake_bit_function+0x0/0x23
 [<ffffffff8862314b>] :gfs2:gfs2_glock_wait+0x2b/0x30
 [<ffffffff88632c4a>] :gfs2:gfs2_write_inode+0x5f/0x159
 [<ffffffff88632c42>] :gfs2:gfs2_write_inode+0x57/0x159
 [<ffffffff8002ff59>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff800c8c8f>] zone_statistics+0x3e/0x6d
 [<ffffffff8000af0f>] get_page_from_freelist+0x377/0x3fa
 [<ffffffff80020ea2>] sync_sb_inodes+0x1b5/0x26f
 [<ffffffff880fc4ef>] :dm_mod:dev_suspend+0x0/0x18c
 [<ffffffff800edfb5>] sync_inodes_sb+0x99/0xa9
 [<ffffffff800dcb34>] __fsync_super+0x10/0x9e
 [<ffffffff800dcc05>] freeze_bdev+0x43/0x82
 [<ffffffff880f966f>] :dm_mod:dm_suspend+0x113/0x2db
 [<ffffffff8008a425>] default_wake_function+0x0/0xe
 [<ffffffff880fc54c>] :dm_mod:dev_suspend+0x5d/0x18c
 [<ffffffff880fce4c>] :dm_mod:ctl_ioctl+0x210/0x25b
 [<ffffffff8004219d>] do_ioctl+0x55/0x6b
 [<ffffffff800302e0>] vfs_ioctl+0x248/0x261
 [<ffffffff8004c70f>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

GFS2 and DLM locks:

dash-01
G:  s:UN n:2/67edb85 f:l t:EX d:UN/0 l:0 a:0 r:4
 H: s:EX f:W e:0 p:9545 [d_doio] gfs2_write_begin+0x53/0x2a1 [gfs2]
Resource ffff8101399d0ac0 Name (len=24) "       2         67edb85"
Local Copy, Master is node 3
Granted Queue
Conversion Queue
Waiting Queue
00e80001 -- (EX) Master:     03f70006

dash-02
G:  s:SH n:2/67edb85 f:ldr t:EX d:UN/0 l:0 a:0 r:7
 H: s:EX f:W e:0 p:9547 [d_doio] gfs2_write_begin+0x53/0x2a1 [gfs2]
 H: s:EX f:W e:0 p:9650 [dmsetup] gfs2_write_inode+0x57/0x159 [gfs2]
 I: n:21/108977029 t:8 f:0x00000010

Resource ffff81013a72b7c0 Name (len=24) "       2         67edb85"
Local Copy, Master is node 3
Granted Queue
00860001 PR Master:     00280009
Conversion Queue
Waiting Queue

dash-03
G:  s:UN n:2/67edb85 f:l t:SH d:EX/0 l:0 a:0 r:4
 H: s:SH f:W e:0 p:9551 [d_doio] gfs2_readpage+0x61/0x183 [gfs2]

Resource ffff8101396572c0 Name (len=24) "       2         67edb85"
Master Copy
Granted Queue
00280009 PR Remote:   2 00860001
Conversion Queue
Waiting Queue
03f70006 -- (EX) Remote:   1 00e80001
01bb0002 -- (PR)

Version-Release number of selected component (if applicable):
kernel-2.6.18-138.el5

How reproducible:
Unknown

Comment 1 Nate Straz 2009-04-21 14:44:25 UTC

Strike the memory allocation comment.  That's probably just leftover junk on the stack.  It appears from the glock dumps that GFS2 on dash-02 is waiting on a lock which it thinks is still shared, but the other two nodes thinks is unlocked.  DLM thinks the lock was already granted.

Comment 2 Nate Straz 2009-04-21 14:50:55 UTC

Created attachment 340543 [details]
glock and dlm dumps from all nodes and sysrq-t from dash-02

Here are the raw dumps from dash-0{1,2,3} and the sysrq-t from dash-02 where the dmsetup suspend was running.

Comment 3 Nate Straz 2009-04-22 21:28:13 UTC

I've done a little more digging into this and it appears that causing a withdraw on one GFS2 file system causes the others to hang. In the test case we insert an error target underneth the file system by using dmsetup suspend. This part succeeds and the file system detects the error and withdraws.

Buffer I/O error on device dm-2, logical block 9119
lost page write due to I/O error on dm-2
GFS2: fsid=dash:dash0.1: fatal: I/O error
GFS2: fsid=dash:dash0.1: block = 9119
GFS2: fsid=dash:dash0.1: function = log_write_header, file = fs/gfs2/log.c, l2
GFS2: fsid=dash:dash0.1: about to withdraw this file system
GFS2: fsid=dash:dash0.1: telling LM to withdraw

[root@dash-02 ~]# ps alx
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
5 0 8581 1 -100 - 20608 716 429493 Ss ? 0:00 /sbin/gfs_controld
4 0 10196 8581 19 0 67604 808 - D ? 0:00 dmsetup suspend /dev/mapper/dash-dash0

This causes gfs_controld to suspend the device. Any actions on the other file systems also hang.

F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
0 0 10207 9665 17 0 73880 844 just_s D+ pts/0 0:00 ls --color=tty

Steps to reproduce:
1. create two GFS2 file systems and mount them
2. place an error target under the GFS2 file system
a. dmsetup suspend /dev/mapper/$vg-$lv
b. dmsetup load $vg-$lv --table "0 $length error $start_sector_from_linear"
c. dmsetup resume /dev/mapper/$vg-$lv
3. try to access other file system

Comment 4 Steve Whitehouse 2009-04-23 09:46:06 UTC

Do you have a stack trace for the dmsetup suspend (pid 10196)? I can't see it in the attached info. Its obviously blocking, waiting for something.

I didn't know that the just_schedule would show up in the WCHAN field of processes - that probably needs renaming to make it more obvious whats going on. That basically means that its waiting for a glock.

I suspect that the issue comes down to sharing the glock workqueue between filesystems. If you suspend a device just as the workqueue is trying to write back something or other, then that will block any other filesystem that files some work on that same cpu.

That doesn't explain why the gfs_controld started dmsetup has hung though.

Comment 5 Nate Straz 2009-04-23 11:42:02 UTC

I recreated it again and here is the backtrace from the dmsetup process which gfs_controld starts:

PID: 22534  TASK: ffff81012962d080  CPU: 0   COMMAND: "dmsetup"
 #0 [ffff810129e7fa98] schedule at ffffffff80063045
 #1 [ffff810129e7fb70] __down_write_nested at ffffffff800646d4
 #2 [ffff810129e7fbb0] gfs2_log_flush at ffffffff885f9d6d
 #3 [ffff810129e7fc10] gfs2_sync_fs at ffffffff886045de
 #4 [ffff810129e7fc20] quota_sync_sb at ffffffff800ff3ba
 #5 [ffff810129e7fc60] __fsync_super at ffffffff800dcb73
 #6 [ffff810129e7fc80] freeze_bdev at ffffffff800dcc39
 #7 [ffff810129e7fca0] dm_suspend at ffffffff880f966f
 #8 [ffff810129e7fd00] dev_suspend at ffffffff880fc54c
 #9 [ffff810129e7fd30] ctl_ioctl at ffffffff880fce4c
#10 [ffff810129e7feb0] do_ioctl at ffffffff80042320
#11 [ffff810129e7fed0] vfs_ioctl at ffffffff80030424
#12 [ffff810129e7ff40] sys_ioctl at ffffffff8004c894
#13 [ffff810129e7ff80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 000000352a6cbc77  RSP: 00007fffe3d6c6e8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000013666820  RSI: 00000000c138fd06  RDI: 0000000000000000
    RBP: 0000000000000000   R8: 0000003529e11e57   R9: 0000000013666850
    R10: 0000003529e11d59  R11: 0000000000000246  R12: 0000003529e11d59
    R13: 0000000000000000  R14: 0000003529e11d59  R15: 0000000000000000
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Comment 6 Steve Whitehouse 2009-04-23 12:22:51 UTC

So the issue appears to be that the dmsuspend calls into the (now withdrawn) fs in order to flush the log, etc. This is stuck because the log flush is stuck - no surprise there as that is what caused the withdraw in the first place.

The solution to this is to improve the way in which gfs2 deals with I/O errors so that we can at least retry writes (once should be plenty) and then do our best to handle the problem locally rather than withdrawing the complete fs. There is a long standing bz open against this problem, #253948.

The trouble is that its not trivial to solve since there are a lot of different points at which we send I/O to the journal in the current code, and they all need handling in a slightly different way.

I don't think there is much chance of being able to solve this for 5.4 now, so this will have to be a 5.5 think at least I think.

We could make the freeze of gfs2 (i.e. the journal flush & sync) fail, but that would mean that the whole dmsuspend would fail too, so that doesn't really get us any further along the road. I'd be happier with a solution which is internal in gfs2 so that we prevent any further I/O at that level and thus avoid the userspace call, except for the final leaving of the cluster.

I need to think about this a bit more...

Comment 7 Steve Whitehouse 2009-05-27 13:23:45 UTC

I'm not so sure what we can do about this, for the reasons outlined in comment #6. Ideally we'd like to ensure that GFS2 has stopped all its I/O internally, but that isn't so easy to do. That would remove the need for any external dmsuspend calls which is really what the issue is here.

So it looks like this will involve both a kernel and a userland fix. The big issue is how to ensure that all in-flight I/O has finished. Stopping new I/O is trivial by comparison. Maybe we can query the block device to see if its queue is empty.

Comment 8 David Teigland 2009-05-27 19:07:05 UTC

First stepping way back so we all understand the bigger picture -- the specifics of what is withdraw and what is it supposed to do.

The ultimate goal is high availability. When gfs on a node sees an i/o error, it needs to quickly get out of the way, otherwise it will prevent other nodes from using the fs -- low availability. There are two good options for getting quickly out of the way so as to not prevent fs access from other nodes:

1. panic. The other nodes will see a membership failure, fence, recover the journal, and continue working. This is very effective, and for a long time this was the only option, people still want this option (see other bz about a mount option to mean panic-on-error.) IMO this should be the default behavior instead of 2; it gives the best results.

2. withdraw. This has never been a very good or a very useful idea, and has never worked very well. The idea is to get quickly out of the way like 1, but without a panic or fencing reset, allowing the node a more controlled shutdown, or allowing other fs's or services to continue running that don't depend on the withdrawn fs. (SAN fencing instead of power fencing would generally give the same benefits.)

So, withdraw is getting out of the way of other nodes (so they can access the fs) without going through a full failure-recovery of the node. The necessary steps are,
1. node1 detects i/o error on fs foo
2. node1 waits for all writes on device for fs foo to complete
3. node1 tells other nodes that it has withdrawn foo, and is done
4. other nodes recover the journal on foo that was used by node1
5. other nodes can continue using fs foo
6. later, the admin can shut other stuff down on node1 and reset it

In RHEL4, gfs does step 2 itself using the "diaper device" which counts the number of outstanding writes on the device; withdraw waits for that count to reach 0. In RHEL5 we changed to use "dmsetup suspend" on the device which also waits for all writes to complete before returning.

By installing the dm error target to test withdraw, you are in a sense premptively doing what the withdraw would do itself. So the test may in some ways interfere with the system. You might consider another ways of injecting i/o errors.

Now, to the hang, which looks to have a very simple explanation: like all other gfs functions that do i/o, gfs2_sync_fs() needs to check the SHUTDOWN flag (set if withdrawing) and just return if it's set.

Comment 9 Steve Whitehouse 2009-05-28 07:28:31 UTC

The problem is the "just return" though. Its only correct to return 0 in the case that it has been called by the withdraw code, otherwise it should return an error and the problem is how does it know what its been called by...

So we need to be careful here. I don't want to give the impression to users that they've successfully suspended a withdrawn filesystem in case they are using that as a sign that they can take a correct and uptodate backup (if they are using snapshots, for example).

We do have plans to improve the error handling, so maybe this should just wait until we can address that. What we do on an I/O error should depend upon the type of I/O that we were doing at the time. Sometimes it can be possible to recover from such things, or alternatively we might just be able to return an error to the user and carry on. Thats not true all the time though, so we will still need the option of withdraw or something like it.

Comment 10 David Teigland 2009-05-28 14:24:25 UTC

It sounds like you're describing the SHUTDOWN flag, which is only set on withdraw.

Comment 11 Steve Whitehouse 2009-05-28 14:28:00 UTC

Yes, because thats the only way the fs can know if a withdraw has occurred after the fact.

Comment 12 Steve Whitehouse 2010-01-27 11:04:24 UTC

We already have a bug which covers this issue. Marking this as a dup of that one.

*** This bug has been marked as a duplicate of bug 229081 ***

Note You need to log in before you can comment on or make changes to this bug.