Bug 553803

Summary: GFS2: recovery stuck on transaction lock
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.5CC: adas, bmarzins, brsmith, cww, dhoward, rpeterso, ssaha, swhiteho, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 672600 (view as bug list) Environment:
Last Closed: 2011-07-21 10:29:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 570263, 590878    
Bug Blocks: 672600, 733678    
Attachments:
Description Flags
dlm lock dump from morph-03, morph-cluster2 lockspace
none
Patch to fix the problem none

Description Nate Straz 2010-01-08 22:18:31 UTC
Description of problem:

While running revolver with three GFS2 file systems I got into a state where recovery for the third file system would not complete.

[GFS2] [revolver] Scenario iteration 3.6 started at Fri Jan  8 12:40:25 CST 2010
[GFS2] [revolver] Sleeping 5 minute(s) to let the I/O get its lock count up...
[GFS2] [revolver] Senario: DLM lose cluster communication on Quorum
[GFS2] [revolver]          (will need to physically kill last node)
[GFS2] [revolver] 
[GFS2] [revolver] Those picked to face the revolver... morph-04 morph-01 
[GFS2] [revolver] Killing network interfaces on morph-04...
[GFS2] [revolver] Feeling lucky morph-01? Well do ya? Go'head make my day...
[GFS2] [revolver] Didn't receive heartbeat for 2 seconds
[GFS2] [revolver] 
[GFS2] [revolver] Verifying nodes were removed from cluster
[GFS2] [revolver] Verified morph-01 was removed on morph-03
[GFS2] [revolver] Verified morph-04 was removed on morph-03
[GFS2] [revolver] Verifying that the dueler(s) are alive
...
[GFS2] [revolver] All killed nodes are back up (able to be pinged), making sure they're qarshable...
[GFS2] [revolver] still not all qarshable, sleeping another 10 seconds
[GFS2] [revolver] All killed nodes are now qarshable
[GFS2] [revolver] 
[GFS2] [revolver] Verifying that recovery properly took place (on the nodes that stayed in the cluster)
[GFS2] [revolver] checking that all of the cluster nodes are now/still cman members...
[GFS2] [revolver] checking fence recovery (state of each service)...
[GFS2] [revolver] checking dlm recovery (state of each service)...
[GFS2] [revolver] checking gfs recovery (state of each service)...
[GFS2] [revolver] gfs recovery is hung
[GFS2] [revolver] appears to be hung in FAIL_START_WAIT
[GFS2] [revolver] FI_engine: recover() method failed

[root@morph-03 ~]# group_tool
type             level name            id       state
fence            0     default         00010001 none
[1 3 4]
dlm              1     clvmd           00020001 none
[1 3 4]
dlm              1     morph-cluster0  00040001 none
[3]
dlm              1     morph-cluster1  00060001 none
[3]
dlm              1     morph-cluster2  00080001 none
[3]
gfs              2     morph-cluster0  00030001 none
[3]
gfs              2     morph-cluster1  00050001 none
[3]
gfs              2     morph-cluster2  00070001 FAIL_START_WAIT
[3]

[root@morph-03 morph-cluster:morph-cluster2]# cat /tmp/glocks.morph-cluster2 
G:  s:UN n:2/635a f:l t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:cW e:0 p:2973 [gfs2_quotad] gfs2_statfs_sync+0x46/0x17d [gfs2]
G:  s:SH n:5/18 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:5/202b f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:5/19 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:EX n:9/1 f: t:EX d:EX/0 l:0 a:0 r:3
 H: s:EX f:eH e:0 p:2956 [(ended)] gfs2_glock_nq_num+0x31/0x54 [gfs2]
G:  s:SH n:5/6374 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:5/635b f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:5/299e0fb f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2988 [d_doio] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:2/202b f: t:SH d:EX/0 l:0 a:0 r:4
 H: s:SH f:eEcH e:0 p:2956 [(ended)] init_journal+0x1e7/0x3aa [gfs2]
 I: n:5/8235 t:8 f:0x11 d:0x00000200 s:33554432/33554432
G:  s:SH n:2/403d f: t:SH d:EX/0 l:0 a:0 r:4
 H: s:SH f:ecH e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0xbf/0x814 [gfs2]
 I: n:6/16445 t:8 f:0x10 d:0x00000200 s:33554432/33554432
G:  s:SH n:5/17 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:EX n:2/6155 f: t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:H e:0 p:2956 [(ended)] init_per_node+0x1f6/0x27d [gfs2]
 I: n:13/24917 t:8 f:0x10 d:0x00000200 s:1048576/1048576
G:  s:SH n:5/6155 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:4/0 f: t:SH d:EX/0 l:0 a:0 r:2
G:  s:EX n:9/2 f: t:EX d:EX/0 l:0 a:0 r:3
 H: s:EX f:tecH e:0 p:2971 [gfs2_recoverd] gfs2_glock_nq_num+0x31/0x54 [gfs2]
G:  s:UN n:2/6374 f: t:UN d:EX/0 l:0 a:0 r:2
G:  s:SH n:5/6154 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:EX n:2/6153 f: t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:H e:0 p:2956 [(ended)] init_per_node+0x15d/0x27d [gfs2]
 I: n:11/24915 t:8 f:0x10 d:0x00000201 s:16/16
G:  s:UN n:2/6359 f: t:UN d:EX/0 l:0 a:0 r:2
G:  s:SH n:1/1 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:eEH e:0 p:2956 [(ended)] gfs2_glock_nq_num+0x31/0x54 [gfs2]
G:  s:SH n:2/16 f: t:SH d:EX/0 l:0 a:0 r:3
 I: n:1/22 t:4 f:0x10 d:0x00000001 s:3864/3864
G:  s:SH n:5/635a f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:SH n:2/17 f: t:SH d:EX/0 l:0 a:0 r:3
 I: n:2/23 t:4 f:0x10 d:0x00000201 s:3864/3864
G:  s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5
 H: s:SH f:epcW e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814 [gfs2]
 H: s:SH f:W e:0 p:2988 [d_doio] gfs2_do_trans_begin+0xae/0x119 [gfs2]
G:  s:EX n:2/299e0fb f: t:EX d:UN/0 l:0 a:0 r:5
 H: s:EX f:H e:0 p:2988 [d_doio] gfs2_write_begin+0x47/0x34f [gfs2]
 H: s:EX f:W e:0 p:188 [pdflush] gfs2_write_inode+0x55/0x15a [gfs2]
 I: n:21/43639035 t:8 f:0x10 d:0x00000000 s:28562/28562
G:  s:SH n:5/6359 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:UN n:2/19 f: t:UN d:EX/0 l:0 a:0 r:2
G:  s:SH n:5/6153 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:UN n:1/3 f: t:UN d:EX/0 l:0 a:0 r:2
G:  s:SH n:2/604f f: t:SH d:EX/0 l:0 a:0 r:2
G:  s:SH n:5/604f f: t:SH d:EX/0 l:0 a:0 r:2
G:  s:SH n:5/403d f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]
G:  s:EX n:2/6154 f: t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:H e:0 p:2956 [(ended)] init_per_node+0x1ab/0x27d [gfs2]
 I: n:12/24916 t:8 f:0x10 d:0x00000201 s:24/24
G:  s:SH n:2/18 f: t:SH d:EX/0 l:0 a:0 r:3
 I: n:3/24 t:4 f:0x10 d:0x00000201 s:3864/3864
G:  s:UN n:2/635b f: t:UN d:EX/0 l:0 a:0 r:2
G:  s:SH n:5/16 f: t:SH d:EX/0 l:0 a:0 r:3
 H: s:SH f:EH e:0 p:2956 [(ended)] gfs2_inode_lookup+0x12d/0x1f0 [gfs2]


crash> bt 2973
PID: 2973   TASK: f2567550  CPU: 0   COMMAND: "gfs2_quotad"
 #0 [f24f6e4c] schedule at c061c528
 #1 [f24f6ec4] just_schedule at f8f04119
 #2 [f24f6ec8] __wait_on_bit at c061cdb2
 #3 [f24f6ee0] out_of_line_wait_on_bit at c061ce37
 #4 [f24f6f18] gfs2_glock_wait at f8f0410d
 #5 [f24f6f28] gfs2_statfs_sync at f8f197bb
 #6 [f24f6f78] quotad_check_timeo at f8f137fc
 #7 [f24f6f88] gfs2_quotad at f8f15172
 #8 [f24f6fcc] kthread at c0435f11
 #9 [f24f6fe4] kernel_thread_helper at c0405c51
crash> bt 2971
PID: 2971   TASK: f319a550  CPU: 1   COMMAND: "gfs2_recoverd"
 #0 [f2076dd0] schedule at c061c528
 #1 [f2076e48] just_schedule at f8f04119
 #2 [f2076e4c] __wait_on_bit at c061cdb2
 #3 [f2076e64] out_of_line_wait_on_bit at c061ce37
 #4 [f2076e9c] gfs2_glock_wait at f8f0410d
 #5 [f2076eac] gfs2_recover_journal at f8f16014
 #6 [f2076fc0] gfs2_recoverd at f8f166bd
 #7 [f2076fcc] kthread at c0435f11
 #8 [f2076fe4] kernel_thread_helper at c0405c51
crash> bt 2988
PID: 2988   TASK: f2096000  CPU: 1   COMMAND: "d_doio"
 #0 [f2098cac] schedule at c061c528
 #1 [f2098d24] just_schedule at f8f04119
 #2 [f2098d28] __wait_on_bit at c061cdb2
 #3 [f2098d40] out_of_line_wait_on_bit at c061ce37
 #4 [f2098d78] gfs2_glock_wait at f8f0410d
 #5 [f2098d88] gfs2_do_trans_begin at f8f1a4c6
 #6 [f2098da4] gfs2_write_begin at f8f0cc56
 #7 [f2098dd8] gfs2_file_buffered_write at f8f0e2d9
 #8 [f2098e40] __gfs2_file_aio_write_nolock at f8f0e6e4
 #9 [f2098ec4] gfs2_file_write_nolock at f8f0e897
#10 [f2098f60] gfs2_file_write at f8f0e9c8
#11 [f2098f84] vfs_write at c0475fe5
#12 [f2098f9c] sys_write at c04765d4
#13 [f2098fb8] system_call at c0404f10
    EAX: ffffffda  EBX: 00000004  ECX: 08feac00  EDX: 00000941 
    DS:  007b      ESI: 0057eca0  ES:  007b      EDI: 08feab9f
    SS:  007b      ESP: bfd31294  EBP: bfd312c8
    CS:  0073      EIP: 00f1a402  ERR: 00000004  EFLAGS: 00000246 
crash> bt 188
PID: 188    TASK: f7fa2550  CPU: 0   COMMAND: "pdflush"
 #0 [f7f8adcc] schedule at c061c528
 #1 [f7f8ae44] just_schedule at f8f04119
 #2 [f7f8ae48] __wait_on_bit at c061cdb2
 #3 [f7f8ae60] out_of_line_wait_on_bit at c061ce37
 #4 [f7f8ae98] gfs2_glock_wait at f8f0410d
 #5 [f7f8aea8] gfs2_write_inode at f8f1367d
 #6 [f7f8aef8] __writeback_single_inode at c0494433
 #7 [f7f8af34] sync_sb_inodes at c049481f
 #8 [f7f8af54] writeback_inodes at c0494a6e
 #9 [f7f8af68] wb_kupdate at c045e00f
#10 [f7f8afa4] pdflush at c045e53c
#11 [f7f8afcc] kthread at c0435f11
#12 [f7f8afe4] kernel_thread_helper at c0405c51
crash> bt 2959
PID: 2959   TASK: f2096550  CPU: 0   COMMAND: "lock_dlm"
 #0 [f208bf30] schedule at c061c528
 #1 [f208bfa8] gdlm_thread at f8e80534
 #2 [f208bfcc] kthread at c0435f11
 #3 [f208bfe4] kernel_thread_helper at c0405c51

Version-Release number of selected component (if applicable):
kernel-2.6.18-183.el5

How reproducible:
Unknown, it on iteration 3.6 in revolver


Steps to Reproduce:
1. run revolver
  
Actual results:


Expected results:


Additional info:

Comment 1 Steve Whitehouse 2010-01-11 13:17:27 UTC
I can't see anything wrong with this:

G:  s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5
 H: s:SH f:epcW e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814
[gfs2]
 H: s:SH f:W e:0 p:2988 [d_doio] gfs2_do_trans_begin+0xae/0x119 [gfs2]


The question is what is holding the lock in a different mode? The only time we take the transaction lock in a mode other than shared is when freezing the fs. So it looks like another node has probably frozen the fs which is what is preventing recovery from proceeding.

Comment 2 Nate Straz 2010-01-11 14:34:55 UTC
The other two nodes had been rebooted by revolver, morph-03 should be the only node left with valid locks.

Comment 3 Steve Whitehouse 2010-01-11 14:39:08 UTC
Well GFS2 thinks something else is holding that lock, so I guess the next step is to look at the dlm lock dumps in that case to see what it thinks has happened.

Comment 4 Nate Straz 2010-01-11 14:48:59 UTC
Created attachment 382989 [details]
dlm lock dump from morph-03, morph-cluster2 lockspace

Here are the dlm locks still on morph-03.

Comment 5 Steve Whitehouse 2010-01-11 15:13:30 UTC
So this is the corresponding dlm lock:

Resource f1c2aac0 Name (len=24) "       1               2"
Master Copy
Granted Queue
012b0002 NL
Conversion Queue
Waiting Queue


and it claims that there is one NL mode lock, which seems to be local. There is nothing in the queue for some reason. I would expect to see at least the SH lock request. Its not at all obvious what is going on here.

Comment 6 Steve Whitehouse 2010-04-21 12:58:32 UTC
This looks very similar to #570363.

Comment 7 Steve Whitehouse 2010-05-11 10:34:21 UTC
Dave, can you take a look at this and see if you can figure out whats going wrong. We have a node stuck on the transaction lock. This lock is only ever taken in the SH (PR) mode except when the fs is being frozen. Since there is no fsfreeze going on here, we'd not expect to see anything except the following transistions:

new -> PR
PR -> NL
NL -> PR
NL -> unlocked

It looks like the stuck node has sent a request to the dlm, but for some strange reason it doesn't appear to show up in the dlm lock dump.

Bearing in mind that this report was prior to the recent dlm fix (rhel6/upstream), I suppose that it is just possible that this is the same thing. I'm not sure how we can prove that though.

Comment 8 David Teigland 2010-05-11 21:20:57 UTC
I'd look at the NOEXP conditions in lock_dlm and the cases where locks granted during recovery need to be dropped by lock_dlm and reacquired.  There were definately some subtle conditions and uncommon corner cases in those parts of lock_dlm -- I was pretty sure that stuff would break as a result of the lock_dlm rewrite/rework early in RHEL5 (which went along with me transfering the lock_dlm ownership to the gfs2 folks.)

The main rule was that only locks used for doing journal recovery could be acquired until journal recovery was complete.  If the dlm granted a lock apart from this (which happens due to dlm recovery), lock_dlm needed to release them and attempt to reacquire them.

Comment 9 Steve Whitehouse 2010-07-14 12:22:40 UTC
The rule is still the same, and the code in lock_dlm which deals with that is pretty much unchanged in rhel5. We've just fixed a similar (the same?) bug in rhel6 where the problem was that glocks with noexp which were queued after recovery had started were occasionally getting stuck. This was happening where the reply from the DLM arrived after recovery had started, but before the noexp request had been queued (i.e. there was a race with another lock acquiring process).

That was relatively easy to fix, since the GLF_FROZEN flag indicated which glocks are frozen with pending replies. It will be more complicated to fix in rhel5 since most of the knowledge regarding frozen locks is within lock_dlm, but the place which needs to be updated in order to check for the particular condition found in rhel6 is in gfs2 (in the glock state machine).

So assuming for the moment that we have the same problem here, then the fix will have to be different, and I suspect more complicated.

Comment 10 Steve Whitehouse 2010-12-01 11:51:38 UTC
This might be related to #656032

Comment 11 Steve Whitehouse 2011-01-24 13:08:12 UTC
Bob, I'm wondering whether we should use this bz to apply the transaction lock bit of the #656032 patch. I'd prefer to use the upstream solution which is the one liner here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=846f40455276617275284a4b76b89311b4aed0b9

There is no need to add the flush back in the umount path, because it is already there (just hidden in one of the functions that are already called in that code path).

That way we can get part of the #656032 patch in sooner rather than later, since we know that this does fix a real bug.

Comment 12 Robert Peterson 2011-01-24 14:21:31 UTC
Excellent idea.  Requesting ack flags to get this into a release.

Comment 13 Robert Peterson 2011-01-25 17:06:23 UTC
Created attachment 475208 [details]
Patch to fix the problem

Here is the RHEL5 crosswrite patch.

Comment 14 Robert Peterson 2011-01-25 17:22:52 UTC
The patch was posted to rhkernel-list for inclusion into RHEL5.7.
Changing status to POST.

Comment 15 Robert Peterson 2011-01-25 17:25:26 UTC
Cloned for a RHEL6 crosswrite as bug #672600

Comment 17 Jarod Wilson 2011-02-02 05:35:23 UTC
in kernel-2.6.18-242.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 19 Nate Straz 2011-05-26 15:45:00 UTC
I ran the load for over a week and I have not been able to reproduce this recovery error.  Calling it verified on 2.6.18-261.el5

Comment 20 errata-xmlrpc 2011-07-21 10:29:01 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html