Description of problem: While running revolver on kernel-2.6.18-116.gfs2abhi.001, recovery hung in GFS2. [root@dash-01 ~]# group_tool type level name id state fence 0 default 00010001 none [1 2 3] dlm 1 clvmd 00020001 none [1 2 3] dlm 1 dash0 00040001 none [1 2] dlm 1 dash1 00060001 none [1 2] dlm 1 dash2 00080001 none [1 2] gfs 2 dash0 00030001 FAIL_START_WAIT [1 2] gfs 2 dash1 00050001 FAIL_START_WAIT [1 2] gfs 2 dash2 00070001 FAIL_START_WAIT [1 2] I'm attaching a tarball with stack traces and glock dumps for nodes 1 and 2. Node 3 in this case was shot during this iteration of revolver and shouldn't not have participated in recovery. Version-Release number of selected component (if applicable): kernel-2.6.18-116.gfs2abhi.001 How reproducible: Unknown Steps to Reproduce: 1. Run revolver
Created attachment 319586 [details] Stack traces and glock dumps from dash-01 and dash-02
Need to discover if this is reproduceable.
Just checked this again to see if maybe it was related to the memory allocation problems that we've been seeing. I can see no evidence of that in the traces though, so I think this must be something else.
Created attachment 322752 [details] group_tool output and dumps, case #2 I was able to hit this today. It happened after one node was shot. In the attached tarball is the output from `group_tool -v` as gtv-$node and `group_tool dump` as gtd-$node. I was not able to get stack traces from the cluster.
group_tool -v shows that gfs recovery is complete everywhere except on tank-04. We need to look at data from gfs (gfs_controld and gfs-kernel) to find out why recovery has not completed. gfs_controld data can be collected from /var/log/messages and group_tool dump gfs <fsname>. gfs-kernel data from /var/log/messages
Created attachment 322766 [details] Data collected from cluster for case #3 The moon must be in the right phase today. Here is another reproduction and I've gathered the information which Dave requested in comment #7.
Created attachment 322782 [details] gfs2 and dlm lock dumps for case3
Created attachment 322785 [details] debug notes This attached file is the notes and excerpts from various debug info I've taken. I've traced things back into gfs2 to the point where I can't say much more. gfs2_recoverd on morph-03 and tank-03 is blocked trying to acquire journal glocks for the failed journals. The dlm has granted the only journal glock request it's gotten, for glock 9,4 in fs 0 on morph-03. Someone who can decipher the gfs2 lockdump can probably say more about why recoverd isn't getting the glocks it needs.
Well I can see whats going on now, at least partially. Its a pity that the flags which are in the glock dump are only the "in core" flags and not the on disk flags. The stack trace gave the clue.... one of the glock workqueues gets stuck because the truncate in progress flag has been set on one of the inodes, so that it heads off trying to truncate the inode and no doubt then gets stuck due to the recovery that has yet to take place. The reason for the flag is that when truncating large inodes, the transaction might have to be split up into smaller transactions since an inode may have more metadata than will fit into a single journal transaction. The flag is set when the transaction is started in order to signal that the truncation isn't complete. The idea is that when the next node looks at the inode, it will see the truncate in progress flag, and finish the job. The trouble being that this occurs in the context of the glock workqueue, so that if something happens (such as a recovery) blocking the other locks (e.g. rgrp locks/transaction lock) then everything stops. I must admit that I've wondered why the job of finishing the truncate was stuck in the glock code and not done as part of the recovery code, which would seem to make more sense. Maybe Ken had a good reason, but its not obvious to me, what that was. It would be nice if we can figure out which inode caused the problem. I can easily add a patch to print out the on disk flags in the glock dump, but we might also be able to get the same info from the filesystem itself, assuming that its been preserved. Its just a question of looking in the inode blocks that are listed in the lock dump (there are very few in fact) to see which of them has the truncate in progress bit set. I guess we can do that with Bob's gfs2_edit tool.
Ok, so looking at fs #2, and looking at the first inode which is waiting for a lock, I see: gfs2_edit - Global File System Editor (use with extreme caution) Block #50890 (0xc6ca) of 126579712 (0x78B7400) (disk inode) Dinode: mh_magic 0x01161970 (hex) mh_type 4 0x4 mh_format 400 0x190 no_formal_ino 33 0x21 no_addr 50890 0xc6ca di_mode 0100666 (decimal) di_uid 0 0x0 di_gid 0 0x0 di_nlink 1 0x1 di_size 0 0x0 di_blocks 4 0x4 di_atime 1225986779 0x491312db di_mtime 1225989932 0x49131f2c di_ctime 1225989932 0x49131f2c di_major 0 0x0 di_minor 0 0x0 di_goal_meta 42259338 0x284d38a di_goal_data 42259338 0x284d38a di_flags 0x20000000 (hex) As expected, its got the "truncate in progress" flag set, and sure enough it has 4 blocks, but zero size which is pretty much the kind of thing I'd expect to see with an inode in that state.
I guess I should have added the bit from the glock dump too: G: s:EX n:2/c6ca f:l t:EX d:EX/0 l:0 a:0 r:6 H: s:EX f:W e:0 p:3554 [d_doio] gfs2_setattr+0x23/0x33a [gfs2] I: n:33/50890 t:8 f:0x00000010 so the lock is an exclusive one which has been got in order to do a setattr operation. I think I'll do a patch to extend the details which we print for inodes. I think it would be useful to have a few more bits of info, and it was always my intention to expand on the information printed in relation to the individual objects related to each glock. Also, another thought... I'd expect recovery to go something like this: 1. "block locks" is issued to stop locking activity 2. dead node's locks are reset/recovered 3. recovery proceeds for the dead node's journal 4. "block locks" is unlocked So we've got an inode with the "truncate in progress" bit set. Since we grab the lock on that inode and keep it for the whole of the truncation, even if we divide that truncation into multiple transactions, the only time we should ever see an inode with that bit set is after a node failure. In that case though, there are two possible scenarios: 1. the "setattr" node issued its lock request before "block locks" in which case it has a queued lock request. In that case the lock_dlm module should be delaying the reply until after recovery since "block locks" is issued before the dead node's locks are recovered (1 & 2 above) 2. the "setattr" node issued its lock request after "block locks" in which case it should have found its request queued up until after that point. So, I'm slightly confused now as to how this is happening.
Ok, so hows this for an idea.... Suppose we have a truncate which is interrupted by revolver killing a node such that its left with the "truncate in progress" bit on. The next time its accessed, which might be long after recovery has taken place, it will try to complete the truncate. Now also suppose that we get the glock for the inode in question just at the very moment that "block locks" is being set due to a different node being shot by revolver. That way, when the truncate tries to complete, its glock request for rgrps would be stopped, and since its running in the context of the glock workqueue, that would also cause other locking activity to get stuck too. So the key seems to be to realise that the inode with truncate in progress set is not one created by the very latest "shooting" of revolver, but in fact is left over from some previous one. I wonder if we can move the truncatiom fix-up into journal recovery. Currently the journal recovery is fairly simple: it just copies the blocks back to the in-place blocks. Potentially we could also include the truncation fix up in that area too, so that we would no longer have to deal with it in the glock code. The only gotcha that I can see is that it might be possible for the transactions involved in truncating an inode to be split such that the first part of the truncation is in the journal, which is then flushed and the node then dies before starting the second part of the truncation. That would leave us with no record in the journal of the missing transaction. Anyway, I'll keep looking at possible solutions.
The idea, as per comment #14 failed since there are too many dependancies relating to recovery. We can't create a thread on demand for finishing off truncation, since that implies a GFP_KERNEL allocation which we must not do from the glock workqueue context. So the search is on for a suitable thread in which to deal with this truncation issue. Both recoverd and logd are out since there are deadlock problems with both of them. We can't use glockd or scand as we plan to remove those anyway. That leaves quotad, which already has two jobs, but would seem to be the most suitable overall, so thats my current plan.
Created attachment 323590 [details] Second part of test patch Earlier on, I posted the first part of the patch for this bz to cluster-devel. Thats is tested, and aside from the fact that I forgot to remove a couple of debugging printk's is in its final form. The second (untested) part of the patch is attached. My next plan is to write a little test program to ensure that I can test it properly.
Created attachment 323880 [details] Test program Program used to test the fix. It mounts an fs, creates some files with various amounts of data and then unmount the fs, marks the files half their original size and sets the "truncate in progress" bit. It then remounts the fs, writes a single byte to the start of each file (to ensure that we get an EX lock) and checks to see if the number of blocks in the file has shrunk, and that the size is correct. There are a few rough edges, but it works well enough to verify the fix for this particular bug.
Created attachment 323881 [details] Updated second patch
Created attachment 323884 [details] Test program I managed to upload an old version the first time, this is the correct one.
The first patch of the pair is now in my -nmw git tree. I'll post the second one after I've done a bit more testing.
Ok, the second patch is now posted too. Abhi, can you create a RHEL build with the two kernel patches in it, so that Nate can test?
Can someone please set the remaining required flags?
in kernel-2.6.18-125.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html