Bug 570263 - GFS2: journal recovery stuck after multiple node failures
Summary: GFS2: journal recovery stuck after multiple node failures
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Steve Whitehouse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 553803 672600
TreeView+ depends on / blocked
 
Reported: 2010-03-03 17:26 UTC by Nate Straz
Modified: 2011-01-25 17:19 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-20 09:47:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Complete logs from morph cluster (26.56 KB, application/x-gzip)
2010-03-03 17:27 UTC, Nate Straz
no flags Details

Description Nate Straz 2010-03-03 17:26:21 UTC
Description of problem:

I was running recovery tests on a three node cluster (morph-0[134]) with three file systems (morph-cluster[012]) when recovery of morph-cluster0 became hung.

Two nodes were targeted to fail, morph-01 via network outage and morph-03 via reboot.  The scenario unfolded like so:

Mar  2 20:47:40 morph-01 qarshd[3247]: Running cmdline: for i in $(find /sys/class/net/* -maxdepth 0 -type d -printf "%f "); do ifdown $i; done 
Mar  2 20:47:40 morph-03 qarshd[3095]: Running cmdline: reboot -fin > /dev/null 2>&1 & 

Mar  2 20:47:50 morph-01 openais[2435]: [TOTEM] entering GATHER state from 12. 
Mar  2 20:47:51 morph-04 openais[2259]: [TOTEM] The token was lost in the OPERATIONAL state. 

Mar  2 20:48:10 morph-01 openais[2435]: [CLM  ] New Configuration: 
Mar  2 20:48:10 morph-01 openais[2435]: [CLM  ] 	r(0) ip(10.15.89.61)  
Mar  2 20:48:10 morph-01 openais[2435]: [CLM  ] 	r(0) ip(10.15.89.64)  
Mar  2 20:48:11 morph-04 openais[2259]: [CLM  ] New Configuration: 
Mar  2 20:48:11 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.61)  
Mar  2 20:48:11 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.64)  

Mar  2 20:48:11 morph-04 fenced[2278]: fencing deferred to morph-01
Mar  2 20:48:10 morph-01 fenced[2454]: morph-03 not a cluster member after 0 sec post_fail_delay
Mar  2 20:48:10 morph-01 fenced[2454]: fencing node "morph-03"
Mar  2 20:48:16 morph-01 fenced[2454]: fence "morph-03" success

Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster0.0: jid=1: Trying to acquire journal lock...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Trying to acquire journal lock...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Trying to acquire journal lock...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster0.0: jid=1: Busy
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Looking at journal...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Looking at journal...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Acquiring the transaction lock...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Replaying journal...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Replayed 0 of 0 blocks
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Found 1 revoke tags
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Journal replayed in 1s
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Done
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Acquiring the transaction lock...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Replaying journal...
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Replayed 0 of 0 blocks
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Found 1 revoke tags
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Journal replayed in 0s
Mar  2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Done

Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Trying to acquire journal lock...
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=1: Trying to acquire journal lock...
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=1: Trying to acquire journal lock...
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=1: Busy
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=1: Busy
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Looking at journal...
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Acquiring the transaction lock...
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Replaying journal...
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Replayed 0 of 0 blocks
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Found 1 revoke tags
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Journal replayed in 1s
Mar  2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Done

Mar  2 20:48:27 morph-01 openais[2435]: [TOTEM] The network interface is down. 
Mar  2 20:48:27 morph-04 openais[2259]: [TOTEM] The token was lost in the OPERATIONAL state. 
Mar  2 20:48:47 morph-04 openais[2259]: [CLM  ] New Configuration: 
Mar  2 20:48:47 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.64)  
Mar  2 20:48:48 morph-04 ccsd[2251]: Cluster is not quorate.  Refusing connection. 

Mar  2 20:49:37 morph-03 syslogd 1.4.1: restart.
Mar  2 20:49:41 morph-03 openais[2277]: [CLM  ] New Configuration: 
Mar  2 20:49:41 morph-03 openais[2277]: [CLM  ] 	r(0) ip(10.15.89.63)  
Mar  2 20:49:41 morph-03 openais[2277]: [CLM  ] 	r(0) ip(10.15.89.64)  
Mar  2 20:49:43 morph-04 openais[2259]: [CLM  ] New Configuration: 
Mar  2 20:49:43 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.63)  
Mar  2 20:49:43 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.64)  

Mar  2 20:49:47 morph-03 fenced[2296]: fencing deferred to prior member
Mar  2 20:49:43 morph-04 fenced[2278]: morph-01 not a cluster member after 0 sec post_fail_delay
Mar  2 20:49:43 morph-04 fenced[2278]: fencing node "morph-01"
Mar  2 20:49:48 morph-04 fenced[2278]: fence "morph-01" success
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=0: Trying to acquire journal lock...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Trying to acquire journal lock...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=0: Trying to acquire journal lock...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=0: Looking at journal...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=0: Looking at journal...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Looking at journal...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Acquiring the transaction lock...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Replaying journal...
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Replayed 0 of 0 blocks
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Found 0 revoke tags
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Journal replayed in 0s
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Done
Mar  2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=0: Done
Mar  2 20:49:49 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=0: Acquiring the transaction lock...

Mar  2 20:51:13 morph-01 syslogd 1.4.1: restart.
Mar  2 20:51:15 morph-01 openais[2454]: [CLM  ] New Configuration: 
Mar  2 20:51:15 morph-01 openais[2454]: [CLM  ] 	r(0) ip(10.15.89.61)  
Mar  2 20:51:15 morph-01 openais[2454]: [CLM  ] 	r(0) ip(10.15.89.63)  
Mar  2 20:51:15 morph-01 openais[2454]: [CLM  ] 	r(0) ip(10.15.89.64)  
Mar  2 20:51:14 morph-03 openais[2277]: [CLM  ] New Configuration: 
Mar  2 20:51:14 morph-03 openais[2277]: [CLM  ] 	r(0) ip(10.15.89.61)  
Mar  2 20:51:14 morph-03 openais[2277]: [CLM  ] 	r(0) ip(10.15.89.63)  
Mar  2 20:51:14 morph-03 openais[2277]: [CLM  ] 	r(0) ip(10.15.89.64)  
Mar  2 20:51:16 morph-04 openais[2259]: [CLM  ] New Configuration: 
Mar  2 20:51:16 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.61)  
Mar  2 20:51:16 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.63)  
Mar  2 20:51:16 morph-04 openais[2259]: [CLM  ] 	r(0) ip(10.15.89.64)  

Mar  2 20:51:16 morph-04 kernel: INFO: task gfs2_quotad:2738 blocked for more than 120 seconds.
Mar  2 20:51:16 morph-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  2 20:51:16 morph-04 kernel: gfs2_quotad   D 00000148  2812  2738     11          2769  2737 (L-TLB)
Mar  2 20:51:16 morph-04 kernel:        f5e65ec0 00000046 37cfd7e2 00000148 00000000 f638ec00 f5e65e80 00000008 
Mar  2 20:51:16 morph-04 kernel:        f7308550 37d06da5 00000148 000095c3 00000001 f730865c c2013744 f7088c80 
Mar  2 20:51:16 morph-04 kernel:        f8e7d21c c20140e4 f62c9d70 00000000 f5142954 00000018 f62c9d50 ffffffff 
Mar  2 20:51:16 morph-04 kernel: Call Trace:
Mar  2 20:51:16 morph-04 kernel:  [<f8e7d21c>] gdlm_bast+0x0/0x78 [lock_dlm]
Mar  2 20:51:16 morph-04 kernel:  [<f8f0210e>] just_schedule+0x5/0x8 [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<c061cdfd>] __wait_on_bit+0x33/0x58
Mar  2 20:51:16 morph-04 kernel:  [<f8f02109>] just_schedule+0x0/0x8 [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<f8f02109>] just_schedule+0x0/0x8 [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<c061ce84>] out_of_line_wait_on_bit+0x62/0x6a
Mar  2 20:51:16 morph-04 kernel:  [<c0435ff4>] wake_bit_function+0x0/0x3c
Mar  2 20:51:16 morph-04 kernel:  [<f8f02102>] gfs2_glock_wait+0x27/0x2e [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<f8f17784>] gfs2_statfs_sync+0x4d/0x17d [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<f8f1777d>] gfs2_statfs_sync+0x46/0x17d [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<f8f117e6>] quotad_check_timeo+0x19/0x4f [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<f8f1315f>] gfs2_quotad+0xb7/0x1c7 [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<c0435fc7>] autoremove_wake_function+0x0/0x2d
Mar  2 20:51:16 morph-04 kernel:  [<f8f130a8>] gfs2_quotad+0x0/0x1c7 [gfs2]
Mar  2 20:51:16 morph-04 kernel:  [<c0435f03>] kthread+0xc0/0xed
Mar  2 20:51:16 morph-04 kernel:  [<c0435e43>] kthread+0x0/0xed
Mar  2 20:51:16 morph-04 kernel:  [<c0405c53>] kernel_thread_helper+0x7/0x10
Mar  2 20:51:16 morph-04 kernel:  =======================

Mar  2 20:53:16 morph-04 kernel: INFO: task gfs2_recoverd:2735 blocked for more than 120 seconds.
Mar  2 20:53:16 morph-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  2 20:53:16 morph-04 kernel: gfs2_recoverd D 00000156  2476  2735     11          2737  2701 (L-TLB)
Mar  2 20:53:16 morph-04 kernel:        f5e64e44 00000046 80e0fd68 00000156 c041eb94 00000000 00000000 0000000a 
Mar  2 20:53:16 morph-04 kernel:        f7f5aaa0 80e2a1e2 00000156 0001a47a 00000000 f7f5abac c200c900 f727cac0 
Mar  2 20:53:16 morph-04 kernel:        ffffffa9 00000000 c200d2a0 c2013744 00000020 00000000 c200c944 f5e64e78 
Mar  2 20:53:16 morph-04 kernel: Call Trace:
Mar  2 20:53:16 morph-04 kernel:  [<c041eb94>] __wake_up+0x2a/0x3d
Mar  2 20:53:16 morph-04 kernel:  [<f8f0210e>] just_schedule+0x5/0x8 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<c061cdfd>] __wait_on_bit+0x33/0x58
Mar  2 20:53:16 morph-04 kernel:  [<f8f02109>] just_schedule+0x0/0x8 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<f8f02109>] just_schedule+0x0/0x8 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<c061ce84>] out_of_line_wait_on_bit+0x62/0x6a
Mar  2 20:53:16 morph-04 kernel:  [<c0435ff4>] wake_bit_function+0x0/0x3c
Mar  2 20:53:16 morph-04 kernel:  [<f8f02102>] gfs2_glock_wait+0x27/0x2e [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<f8f14001>] gfs2_recover_journal+0x189/0x814 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<f8f13ffa>] gfs2_recover_journal+0x182/0x814 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<f8f13f37>] gfs2_recover_journal+0xbf/0x814 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<f8f0357f>] gfs2_glock_nq_num+0x31/0x54 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<c042cd69>] process_timeout+0x0/0x5
Mar  2 20:53:16 morph-04 kernel:  [<f8f1468c>] gfs2_recoverd+0x0/0x66 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<f8f146aa>] gfs2_recoverd+0x1e/0x66 [gfs2]
Mar  2 20:53:16 morph-04 kernel:  [<c0435f03>] kthread+0xc0/0xed
Mar  2 20:53:16 morph-04 kernel:  [<c0435e43>] kthread+0x0/0xed
Mar  2 20:53:16 morph-04 kernel:  [<c0405c53>] kernel_thread_helper+0x7/0x10
Mar  2 20:53:16 morph-04 kernel:  =======================

Current glock state:

================================================================================
morph-04
GFS2 Glock (1, 2)
G:  s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5
 H: s:SH f:epcW e:0 p:2735 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814 [gfs2]
 H: s:SH f:W e:0 p:187 [pdflush] gfs2_do_trans_begin+0xae/0x119 [gfs2]

================================================================================
2 25434
dlm
Resource "       2            635a" master morph-04 1G/0C/0W
Granted Queue
022f0002 NL
No Conversion
None Waiting

morph-04
GFS2 Glock (2, 25434)
G:  s:UN n:2/635a f:l t:EX d:EX/0 l:0 a:0 r:4
 H: s:EX f:cW e:0 p:2738 [gfs2_quotad] gfs2_statfs_sync+0x46/0x17d [gfs2]

================================================================================
2 21493534
dlm
Resource "       2         147f71e" master morph-04 1G/0C/0W
Granted Queue
00eb0002 EX
No Conversion
None Waiting

morph-04
GFS2 Glock (2, 21493534)
G:  s:EX n:2/147f71e f: t:EX d:UN/0 l:0 a:0 r:5
 H: s:EX f:H e:0 p:187 [pdflush] gfs2_write_inode+0x55/0x15a [gfs2]
 H: s:EX f:W e:0 p:2968 [d_doio] gfs2_mmap+0x2e/0x65 [gfs2]
 I: n:21/21493534 t:8 f:0x10 d:0x00000000 s:25300/25300

================================================================================


Version-Release number of selected component (if applicable):
kernel-2.6.18-189.el5
cman-2.0.115-31.el5
gfs2-utils-0.1.62-18.el5

How reproducible:
Unknown

Comment 1 Nate Straz 2010-03-03 17:27:27 UTC
Created attachment 397619 [details]
Complete logs from morph cluster

Comment 2 Steve Whitehouse 2010-04-21 12:49:41 UTC
GFS2 uses the transaction lock for one reason only, and that is suspending the filesystem. So assuming that no suspends were being done at the time, there should only be PR requests going to the DLM, so that no conflicting locks should have been around.

It looks like the lock as been requested with all the right flags, so I think that we are probably waiting on dlm recovery in this case.

Dave, can you spot anything that looks odd in the above logs?

Comment 3 David Teigland 2010-04-21 18:35:20 UTC
We really need more data to figure this out; nothing seems very interesting in /var/log/messages.  It looks like userspace recovery may have not re-enabled dlm/gfs in the kernel, data from group_tool would tell us.  Or, if it's dlm kernel recovery that's not completing, then <dlm log_debug="1"/>, or at least a ps -o pid,stat,cmd,wchan.

Comment 4 Steve Whitehouse 2010-09-22 10:24:08 UTC
Nate, can you reproduce this in order to get the info which Dave is requesting?

Comment 5 Steve Whitehouse 2011-01-18 10:25:15 UTC
This seems to have been stuck for a long time now. I'm going to close this is no more information is available.


Note You need to log in before you can comment on or make changes to this bug.