Bug 465856 - GFS2: recovery stuck
GFS2: recovery stuck
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: rc
: ---
Assigned To: Steve Whitehouse
Martin Jenner
Depends On:
  Show dependency treegraph
Reported: 2008-10-06 14:22 EDT by Nate Straz
Modified: 2009-01-20 15:19 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-20 15:19:35 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Stack traces and glock dumps from dash-01 and dash-02 (80.95 KB, application/x-gzip)
2008-10-06 14:34 EDT, Nate Straz
no flags Details
group_tool output and dumps, case #2 (59.56 KB, application/x-gzip)
2008-11-06 11:41 EST, Nate Straz
no flags Details
Data collected from cluster for case #3 (237.42 KB, application/x-gzip)
2008-11-06 12:59 EST, Nate Straz
no flags Details
gfs2 and dlm lock dumps for case3 (22.68 KB, application/x-gzip)
2008-11-06 15:28 EST, David Teigland
no flags Details
debug notes (19.16 KB, text/plain)
2008-11-06 15:36 EST, David Teigland
no flags Details
Second part of test patch (7.47 KB, patch)
2008-11-14 10:44 EST, Steve Whitehouse
no flags Details | Diff
Test program (6.87 KB, text/plain)
2008-11-18 06:01 EST, Steve Whitehouse
no flags Details
Updated second patch (6.88 KB, patch)
2008-11-18 06:04 EST, Steve Whitehouse
no flags Details | Diff
Test program (8.11 KB, text/x-csrc)
2008-11-18 06:07 EST, Steve Whitehouse
no flags Details

  None (edit)
Description Nate Straz 2008-10-06 14:22:24 EDT
Description of problem:

While running revolver on kernel-2.6.18-116.gfs2abhi.001, recovery hung in GFS2.

[root@dash-01 ~]# group_tool
type             level name     id       state       
fence            0     default  00010001 none        [1 2 3]
dlm              1     clvmd    00020001 none        [1 2 3]
dlm              1     dash0    00040001 none        [1 2]
dlm              1     dash1    00060001 none        [1 2]
dlm              1     dash2    00080001 none        [1 2]
gfs              2     dash0    00030001 FAIL_START_WAIT [1 2]
gfs              2     dash1    00050001 FAIL_START_WAIT [1 2]
gfs              2     dash2    00070001 FAIL_START_WAIT [1 2]

I'm attaching a tarball with stack traces and glock dumps for nodes 1 and 2.
Node 3 in this case was shot during this iteration of revolver and shouldn't not have participated in recovery.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.  Run revolver
Comment 1 Nate Straz 2008-10-06 14:34:20 EDT
Created attachment 319586 [details]
Stack traces and glock dumps from dash-01 and dash-02
Comment 2 Steve Whitehouse 2008-10-22 12:17:38 EDT
Need to discover if this is reproduceable.
Comment 3 Steve Whitehouse 2008-10-30 09:39:17 EDT
Just checked this again to see if maybe it was related to the memory allocation problems that we've been seeing. I can see no evidence of that in the traces though, so I think this must be something else.
Comment 5 Nate Straz 2008-11-06 11:41:09 EST
Created attachment 322752 [details]
group_tool output and dumps, case #2

I was able to hit this today.  It happened after one node was shot.  In the attached tarball is the output from `group_tool -v` as gtv-$node and `group_tool dump` as gtd-$node.  I was not able to get stack traces from the cluster.
Comment 7 David Teigland 2008-11-06 12:13:40 EST
group_tool -v shows that gfs recovery is complete everywhere except on
tank-04.  We need to look at data from gfs (gfs_controld and gfs-kernel)
to find out why recovery has not completed.

gfs_controld data can be collected from /var/log/messages and
group_tool dump gfs <fsname>.  gfs-kernel data from /var/log/messages
Comment 8 Nate Straz 2008-11-06 12:59:28 EST
Created attachment 322766 [details]
Data collected from cluster for case #3

The moon must be in the right phase today.  Here is another reproduction and I've gathered the information which Dave requested in comment #7.
Comment 9 David Teigland 2008-11-06 15:28:25 EST
Created attachment 322782 [details]
gfs2 and dlm lock dumps for case3
Comment 10 David Teigland 2008-11-06 15:36:30 EST
Created attachment 322785 [details]
debug notes

This attached file is the notes and excerpts from various debug info
I've taken.  I've traced things back into gfs2 to the point where I
can't say much more.

gfs2_recoverd on morph-03 and tank-03 is blocked trying to acquire
journal glocks for the failed journals.  The dlm has granted the
only journal glock request it's gotten, for glock 9,4 in fs 0 on
morph-03.  Someone who can decipher the gfs2 lockdump can probably
say more about why recoverd isn't getting the glocks it needs.
Comment 11 Steve Whitehouse 2008-11-07 05:42:02 EST
Well I can see whats going on now, at least partially. Its a pity that the flags which are in the glock dump are only the "in core" flags and not the on disk flags.

The stack trace gave the clue.... one of the glock workqueues gets stuck because the truncate in progress flag has been set on one of the inodes, so that it heads off trying to truncate the inode and no doubt then gets stuck due to the recovery that has yet to take place.

The reason for the flag is that when truncating large inodes, the transaction might have to be split up into smaller transactions since an inode may have more metadata than will fit into a single journal transaction. The flag is set when the transaction is started in order to signal that the truncation isn't complete.

The idea is that when the next node looks at the inode, it will see the truncate in progress flag, and finish the job. The trouble being that this occurs in the context of the glock workqueue, so that if something happens (such as a recovery) blocking the other locks (e.g. rgrp locks/transaction lock) then everything stops.

I must admit that I've wondered why the job of finishing the truncate was stuck in the glock code and not done as part of the recovery code, which would seem to make more sense. Maybe Ken had a good reason, but its not obvious to me, what that was.

It would be nice if we can figure out which inode caused the problem. I can easily add a patch to print out the on disk flags in the glock dump, but we might also be able to get the same info from the filesystem itself, assuming that its been preserved. Its just a question of looking in the inode blocks that are listed in the lock dump (there are very few in fact) to see which of them has the truncate in progress bit set. I guess we can do that with Bob's gfs2_edit tool.
Comment 12 Steve Whitehouse 2008-11-07 09:02:03 EST
Ok, so looking at fs #2, and looking at the first inode which is waiting for a lock, I see:

gfs2_edit - Global File System Editor (use with extreme caution)
Block #50890    (0xc6ca)      of 126579712 (0x78B7400) (disk inode)

  mh_magic              0x01161970                (hex)
  mh_type               4                         0x4
  mh_format             400                       0x190
  no_formal_ino         33                        0x21
  no_addr               50890                     0xc6ca
  di_mode               0100666                   (decimal)
  di_uid                0                         0x0
  di_gid                0                         0x0
  di_nlink              1                         0x1
  di_size               0                         0x0
  di_blocks             4                         0x4
  di_atime              1225986779                0x491312db
  di_mtime              1225989932                0x49131f2c
  di_ctime              1225989932                0x49131f2c
  di_major              0                         0x0
  di_minor              0                         0x0
  di_goal_meta          42259338                  0x284d38a
  di_goal_data          42259338                  0x284d38a
  di_flags              0x20000000                (hex)

As expected, its got the "truncate in progress" flag set, and sure enough it has 4 blocks, but zero size which is pretty much the kind of thing I'd expect to see with an inode in that state.
Comment 13 Steve Whitehouse 2008-11-07 09:20:35 EST
I guess I should have added the bit from the glock dump too:

G:  s:EX n:2/c6ca f:l t:EX d:EX/0 l:0 a:0 r:6
 H: s:EX f:W e:0 p:3554 [d_doio] gfs2_setattr+0x23/0x33a [gfs2]
 I: n:33/50890 t:8 f:0x00000010

so the lock is an exclusive one which has been got in order to do a setattr operation. I think I'll do a patch to extend the details which we print for inodes. I think it would be useful to have a few more bits of info, and it was always my intention to expand on the information printed in relation to the individual objects related to each glock.

Also, another thought... I'd expect recovery to go something like this:

1. "block locks" is issued to stop locking activity
2. dead node's locks are reset/recovered
3. recovery proceeds for the dead node's journal
4. "block locks" is unlocked

So we've got an inode with the "truncate in progress" bit set. Since we grab the lock on that inode and keep it for the whole of the truncation, even if we divide that truncation into multiple transactions, the only time we should ever see an inode with that bit set is after a node failure.

In that case though, there are two possible scenarios:

1. the "setattr" node issued its lock request before "block locks" in which case it has a queued lock request. In that case the lock_dlm module should be delaying the reply until after recovery since "block locks" is issued before the dead node's locks are recovered (1 & 2 above)

2. the "setattr" node issued its lock request after "block locks" in which case it should have found its request queued up until after that point.

So, I'm slightly confused now as to how this is happening.
Comment 14 Steve Whitehouse 2008-11-07 09:53:38 EST
Ok, so hows this for an idea....

Suppose we have a truncate which is interrupted by revolver killing a node such that its left with the "truncate in progress" bit on. The next time its accessed, which might be long after recovery has taken place, it will try to complete the truncate.

Now also suppose that we get the glock for the inode in question just at the very moment that "block locks" is being set due to a different node being shot by revolver. That way, when the truncate tries to complete, its glock request for rgrps would be stopped, and since its running in the context of the glock workqueue, that would also cause other locking activity to get stuck too.

So the key seems to be to realise that the inode with truncate in progress set is not one created by the very latest "shooting" of revolver, but in fact is left over from some previous one.

I wonder if we can move the truncatiom fix-up into journal recovery. Currently the journal recovery is fairly simple: it just copies the blocks back to the in-place blocks. Potentially we could also include the truncation fix up in that area too, so that we would no longer have to deal with it in the glock code. The only gotcha that I can see is that it might be possible for the transactions involved in truncating an inode to be split such that the first part of the truncation is in the journal, which is then flushed and the node then dies before starting the second part of the truncation. That would leave us with no record in the journal of the missing transaction.

Anyway, I'll keep looking at possible solutions.
Comment 16 Steve Whitehouse 2008-11-13 09:21:04 EST
The idea, as per comment #14 failed since there are too many dependancies relating to recovery. We can't create a thread on demand for finishing off truncation, since that implies a GFP_KERNEL allocation which we must not do from the glock workqueue context.

So the search is on for a suitable thread in which to deal with this truncation issue. Both recoverd and logd are out since there are deadlock problems with both of them. We can't use glockd or scand as we plan to remove those anyway. That leaves quotad, which already has two jobs, but would seem to be the most suitable overall, so thats my current plan.
Comment 17 Steve Whitehouse 2008-11-14 10:44:38 EST
Created attachment 323590 [details]
Second part of test patch

Earlier on, I posted the first part of the patch for this bz to cluster-devel. Thats is tested, and aside from the fact that I forgot to remove a couple of debugging printk's is in its final form.

The second (untested) part of the patch is attached. My next plan is to write a little test program to ensure that I can test it properly.
Comment 18 Steve Whitehouse 2008-11-18 06:01:54 EST
Created attachment 323880 [details]
Test program

Program used to test the fix. It mounts an fs, creates some files with various amounts of data and then unmount the fs, marks the files half their original size and sets the "truncate in progress" bit. It then remounts the fs, writes a single byte to the start of each file (to ensure that we get an EX lock) and checks to see if the number of blocks in the file has shrunk, and that the size is correct.

There are a few rough edges, but it works well enough to verify the fix for this particular bug.
Comment 19 Steve Whitehouse 2008-11-18 06:04:14 EST
Created attachment 323881 [details]
Updated second patch
Comment 20 Steve Whitehouse 2008-11-18 06:07:52 EST
Created attachment 323884 [details]
Test program

I managed to upload an old version the first time, this is the correct one.
Comment 21 Steve Whitehouse 2008-11-18 06:13:24 EST
The first patch of the pair is now in my -nmw git tree. I'll post the second one after I've done a bit more testing.
Comment 22 Steve Whitehouse 2008-11-18 06:39:36 EST
Ok, the second patch is now posted too. Abhi, can you create a RHEL build with the two kernel patches in it, so that Nate can test?
Comment 24 Steve Whitehouse 2008-11-25 03:31:25 EST
Can someone please set the remaining required flags?
Comment 27 Don Zickus 2008-12-02 17:19:17 EST
in kernel-2.6.18-125.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 30 errata-xmlrpc 2009-01-20 15:19:35 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.