Description of problem: While running revolver with three GFS2 file systems I got into a state where recovery for the third file system would not complete. [GFS2] [revolver] Scenario iteration 3.6 started at Fri Jan 8 12:40:25 CST 2010 [GFS2] [revolver] Sleeping 5 minute(s) to let the I/O get its lock count up... [GFS2] [revolver] Senario: DLM lose cluster communication on Quorum [GFS2] [revolver] (will need to physically kill last node) [GFS2] [revolver] [GFS2] [revolver] Those picked to face the revolver... morph-04 morph-01 [GFS2] [revolver] Killing network interfaces on morph-04... [GFS2] [revolver] Feeling lucky morph-01? Well do ya? Go'head make my day... [GFS2] [revolver] Didn't receive heartbeat for 2 seconds [GFS2] [revolver] [GFS2] [revolver] Verifying nodes were removed from cluster [GFS2] [revolver] Verified morph-01 was removed on morph-03 [GFS2] [revolver] Verified morph-04 was removed on morph-03 [GFS2] [revolver] Verifying that the dueler(s) are alive ... [GFS2] [revolver] All killed nodes are back up (able to be pinged), making sure they're qarshable... [GFS2] [revolver] still not all qarshable, sleeping another 10 seconds [GFS2] [revolver] All killed nodes are now qarshable [GFS2] [revolver] [GFS2] [revolver] Verifying that recovery properly took place (on the nodes that stayed in the cluster) [GFS2] [revolver] checking that all of the cluster nodes are now/still cman members... [GFS2] [revolver] checking fence recovery (state of each service)... [GFS2] [revolver] checking dlm recovery (state of each service)... [GFS2] [revolver] checking gfs recovery (state of each service)... [GFS2] [revolver] gfs recovery is hung [GFS2] [revolver] appears to be hung in FAIL_START_WAIT [GFS2] [revolver] FI_engine: recover() method failed [root@morph-03 ~]# group_tool type level name id state fence 0 default 00010001 none [1 3 4] dlm 1 clvmd 00020001 none [1 3 4] dlm 1 morph-cluster0 00040001 none [3] dlm 1 morph-cluster1 00060001 none [3] dlm 1 morph-cluster2 00080001 none [3] gfs 2 morph-cluster0 00030001 none [3] gfs 2 morph-cluster1 00050001 none [3] gfs 2 morph-cluster2 00070001 FAIL_START_WAIT [3] [root@morph-03 morph-cluster:morph-cluster2]# cat /tmp/glocks.morph-cluster2 G: s:UN n:2/635a f:l t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:cW e:0 p:2973 [gfs2_quotad] gfs2_statfs_sync+0x46/0x17d [gfs2] G: s:SH n:5/18 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:5/202b f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:5/19 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:EX n:9/1 f: t:EX d:EX/0 l:0 a:0 r:3 H: s:EX f:eH e:0 p:2956 [(ended)] gfs2_glock_nq_num+0x31/0x54 [gfs2] G: s:SH n:5/6374 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:5/635b f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:5/299e0fb f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2988 [d_doio] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:2/202b f: t:SH d:EX/0 l:0 a:0 r:4 H: s:SH f:eEcH e:0 p:2956 [(ended)] init_journal+0x1e7/0x3aa [gfs2] I: n:5/8235 t:8 f:0x11 d:0x00000200 s:33554432/33554432 G: s:SH n:2/403d f: t:SH d:EX/0 l:0 a:0 r:4 H: s:SH f:ecH e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0xbf/0x814 [gfs2] I: n:6/16445 t:8 f:0x10 d:0x00000200 s:33554432/33554432 G: s:SH n:5/17 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:EX n:2/6155 f: t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:H e:0 p:2956 [(ended)] init_per_node+0x1f6/0x27d [gfs2] I: n:13/24917 t:8 f:0x10 d:0x00000200 s:1048576/1048576 G: s:SH n:5/6155 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:4/0 f: t:SH d:EX/0 l:0 a:0 r:2 G: s:EX n:9/2 f: t:EX d:EX/0 l:0 a:0 r:3 H: s:EX f:tecH e:0 p:2971 [gfs2_recoverd] gfs2_glock_nq_num+0x31/0x54 [gfs2] G: s:UN n:2/6374 f: t:UN d:EX/0 l:0 a:0 r:2 G: s:SH n:5/6154 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:EX n:2/6153 f: t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:H e:0 p:2956 [(ended)] init_per_node+0x15d/0x27d [gfs2] I: n:11/24915 t:8 f:0x10 d:0x00000201 s:16/16 G: s:UN n:2/6359 f: t:UN d:EX/0 l:0 a:0 r:2 G: s:SH n:1/1 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:eEH e:0 p:2956 [(ended)] gfs2_glock_nq_num+0x31/0x54 [gfs2] G: s:SH n:2/16 f: t:SH d:EX/0 l:0 a:0 r:3 I: n:1/22 t:4 f:0x10 d:0x00000001 s:3864/3864 G: s:SH n:5/635a f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:SH n:2/17 f: t:SH d:EX/0 l:0 a:0 r:3 I: n:2/23 t:4 f:0x10 d:0x00000201 s:3864/3864 G: s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5 H: s:SH f:epcW e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814 [gfs2] H: s:SH f:W e:0 p:2988 [d_doio] gfs2_do_trans_begin+0xae/0x119 [gfs2] G: s:EX n:2/299e0fb f: t:EX d:UN/0 l:0 a:0 r:5 H: s:EX f:H e:0 p:2988 [d_doio] gfs2_write_begin+0x47/0x34f [gfs2] H: s:EX f:W e:0 p:188 [pdflush] gfs2_write_inode+0x55/0x15a [gfs2] I: n:21/43639035 t:8 f:0x10 d:0x00000000 s:28562/28562 G: s:SH n:5/6359 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:UN n:2/19 f: t:UN d:EX/0 l:0 a:0 r:2 G: s:SH n:5/6153 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:UN n:1/3 f: t:UN d:EX/0 l:0 a:0 r:2 G: s:SH n:2/604f f: t:SH d:EX/0 l:0 a:0 r:2 G: s:SH n:5/604f f: t:SH d:EX/0 l:0 a:0 r:2 G: s:SH n:5/403d f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] G: s:EX n:2/6154 f: t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:H e:0 p:2956 [(ended)] init_per_node+0x1ab/0x27d [gfs2] I: n:12/24916 t:8 f:0x10 d:0x00000201 s:24/24 G: s:SH n:2/18 f: t:SH d:EX/0 l:0 a:0 r:3 I: n:3/24 t:4 f:0x10 d:0x00000201 s:3864/3864 G: s:UN n:2/635b f: t:UN d:EX/0 l:0 a:0 r:2 G: s:SH n:5/16 f: t:SH d:EX/0 l:0 a:0 r:3 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2] crash> bt 2973 PID: 2973 TASK: f2567550 CPU: 0 COMMAND: "gfs2_quotad" #0 [f24f6e4c] schedule at c061c528 #1 [f24f6ec4] just_schedule at f8f04119 #2 [f24f6ec8] __wait_on_bit at c061cdb2 #3 [f24f6ee0] out_of_line_wait_on_bit at c061ce37 #4 [f24f6f18] gfs2_glock_wait at f8f0410d #5 [f24f6f28] gfs2_statfs_sync at f8f197bb #6 [f24f6f78] quotad_check_timeo at f8f137fc #7 [f24f6f88] gfs2_quotad at f8f15172 #8 [f24f6fcc] kthread at c0435f11 #9 [f24f6fe4] kernel_thread_helper at c0405c51 crash> bt 2971 PID: 2971 TASK: f319a550 CPU: 1 COMMAND: "gfs2_recoverd" #0 [f2076dd0] schedule at c061c528 #1 [f2076e48] just_schedule at f8f04119 #2 [f2076e4c] __wait_on_bit at c061cdb2 #3 [f2076e64] out_of_line_wait_on_bit at c061ce37 #4 [f2076e9c] gfs2_glock_wait at f8f0410d #5 [f2076eac] gfs2_recover_journal at f8f16014 #6 [f2076fc0] gfs2_recoverd at f8f166bd #7 [f2076fcc] kthread at c0435f11 #8 [f2076fe4] kernel_thread_helper at c0405c51 crash> bt 2988 PID: 2988 TASK: f2096000 CPU: 1 COMMAND: "d_doio" #0 [f2098cac] schedule at c061c528 #1 [f2098d24] just_schedule at f8f04119 #2 [f2098d28] __wait_on_bit at c061cdb2 #3 [f2098d40] out_of_line_wait_on_bit at c061ce37 #4 [f2098d78] gfs2_glock_wait at f8f0410d #5 [f2098d88] gfs2_do_trans_begin at f8f1a4c6 #6 [f2098da4] gfs2_write_begin at f8f0cc56 #7 [f2098dd8] gfs2_file_buffered_write at f8f0e2d9 #8 [f2098e40] __gfs2_file_aio_write_nolock at f8f0e6e4 #9 [f2098ec4] gfs2_file_write_nolock at f8f0e897 #10 [f2098f60] gfs2_file_write at f8f0e9c8 #11 [f2098f84] vfs_write at c0475fe5 #12 [f2098f9c] sys_write at c04765d4 #13 [f2098fb8] system_call at c0404f10 EAX: ffffffda EBX: 00000004 ECX: 08feac00 EDX: 00000941 DS: 007b ESI: 0057eca0 ES: 007b EDI: 08feab9f SS: 007b ESP: bfd31294 EBP: bfd312c8 CS: 0073 EIP: 00f1a402 ERR: 00000004 EFLAGS: 00000246 crash> bt 188 PID: 188 TASK: f7fa2550 CPU: 0 COMMAND: "pdflush" #0 [f7f8adcc] schedule at c061c528 #1 [f7f8ae44] just_schedule at f8f04119 #2 [f7f8ae48] __wait_on_bit at c061cdb2 #3 [f7f8ae60] out_of_line_wait_on_bit at c061ce37 #4 [f7f8ae98] gfs2_glock_wait at f8f0410d #5 [f7f8aea8] gfs2_write_inode at f8f1367d #6 [f7f8aef8] __writeback_single_inode at c0494433 #7 [f7f8af34] sync_sb_inodes at c049481f #8 [f7f8af54] writeback_inodes at c0494a6e #9 [f7f8af68] wb_kupdate at c045e00f #10 [f7f8afa4] pdflush at c045e53c #11 [f7f8afcc] kthread at c0435f11 #12 [f7f8afe4] kernel_thread_helper at c0405c51 crash> bt 2959 PID: 2959 TASK: f2096550 CPU: 0 COMMAND: "lock_dlm" #0 [f208bf30] schedule at c061c528 #1 [f208bfa8] gdlm_thread at f8e80534 #2 [f208bfcc] kthread at c0435f11 #3 [f208bfe4] kernel_thread_helper at c0405c51 Version-Release number of selected component (if applicable): kernel-2.6.18-183.el5 How reproducible: Unknown, it on iteration 3.6 in revolver Steps to Reproduce: 1. run revolver Actual results: Expected results: Additional info:
I can't see anything wrong with this: G: s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5 H: s:SH f:epcW e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814 [gfs2] H: s:SH f:W e:0 p:2988 [d_doio] gfs2_do_trans_begin+0xae/0x119 [gfs2] The question is what is holding the lock in a different mode? The only time we take the transaction lock in a mode other than shared is when freezing the fs. So it looks like another node has probably frozen the fs which is what is preventing recovery from proceeding.
The other two nodes had been rebooted by revolver, morph-03 should be the only node left with valid locks.
Well GFS2 thinks something else is holding that lock, so I guess the next step is to look at the dlm lock dumps in that case to see what it thinks has happened.
Created attachment 382989 [details] dlm lock dump from morph-03, morph-cluster2 lockspace Here are the dlm locks still on morph-03.
So this is the corresponding dlm lock: Resource f1c2aac0 Name (len=24) " 1 2" Master Copy Granted Queue 012b0002 NL Conversion Queue Waiting Queue and it claims that there is one NL mode lock, which seems to be local. There is nothing in the queue for some reason. I would expect to see at least the SH lock request. Its not at all obvious what is going on here.
This looks very similar to #570363.
Dave, can you take a look at this and see if you can figure out whats going wrong. We have a node stuck on the transaction lock. This lock is only ever taken in the SH (PR) mode except when the fs is being frozen. Since there is no fsfreeze going on here, we'd not expect to see anything except the following transistions: new -> PR PR -> NL NL -> PR NL -> unlocked It looks like the stuck node has sent a request to the dlm, but for some strange reason it doesn't appear to show up in the dlm lock dump. Bearing in mind that this report was prior to the recent dlm fix (rhel6/upstream), I suppose that it is just possible that this is the same thing. I'm not sure how we can prove that though.
I'd look at the NOEXP conditions in lock_dlm and the cases where locks granted during recovery need to be dropped by lock_dlm and reacquired. There were definately some subtle conditions and uncommon corner cases in those parts of lock_dlm -- I was pretty sure that stuff would break as a result of the lock_dlm rewrite/rework early in RHEL5 (which went along with me transfering the lock_dlm ownership to the gfs2 folks.) The main rule was that only locks used for doing journal recovery could be acquired until journal recovery was complete. If the dlm granted a lock apart from this (which happens due to dlm recovery), lock_dlm needed to release them and attempt to reacquire them.
The rule is still the same, and the code in lock_dlm which deals with that is pretty much unchanged in rhel5. We've just fixed a similar (the same?) bug in rhel6 where the problem was that glocks with noexp which were queued after recovery had started were occasionally getting stuck. This was happening where the reply from the DLM arrived after recovery had started, but before the noexp request had been queued (i.e. there was a race with another lock acquiring process). That was relatively easy to fix, since the GLF_FROZEN flag indicated which glocks are frozen with pending replies. It will be more complicated to fix in rhel5 since most of the knowledge regarding frozen locks is within lock_dlm, but the place which needs to be updated in order to check for the particular condition found in rhel6 is in gfs2 (in the glock state machine). So assuming for the moment that we have the same problem here, then the fix will have to be different, and I suspect more complicated.
This might be related to #656032
Bob, I'm wondering whether we should use this bz to apply the transaction lock bit of the #656032 patch. I'd prefer to use the upstream solution which is the one liner here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=846f40455276617275284a4b76b89311b4aed0b9 There is no need to add the flush back in the umount path, because it is already there (just hidden in one of the functions that are already called in that code path). That way we can get part of the #656032 patch in sooner rather than later, since we know that this does fix a real bug.
Excellent idea. Requesting ack flags to get this into a release.
Created attachment 475208 [details] Patch to fix the problem Here is the RHEL5 crosswrite patch.
The patch was posted to rhkernel-list for inclusion into RHEL5.7. Changing status to POST.
Cloned for a RHEL6 crosswrite as bug #672600
in kernel-2.6.18-242.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
I ran the load for over a week and I have not been able to reproduce this recovery error. Calling it verified on 2.6.18-261.el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html