Description of problem: I was running recovery tests on a three node cluster (morph-0[134]) with three file systems (morph-cluster[012]) when recovery of morph-cluster0 became hung. Two nodes were targeted to fail, morph-01 via network outage and morph-03 via reboot. The scenario unfolded like so: Mar 2 20:47:40 morph-01 qarshd[3247]: Running cmdline: for i in $(find /sys/class/net/* -maxdepth 0 -type d -printf "%f "); do ifdown $i; done Mar 2 20:47:40 morph-03 qarshd[3095]: Running cmdline: reboot -fin > /dev/null 2>&1 & Mar 2 20:47:50 morph-01 openais[2435]: [TOTEM] entering GATHER state from 12. Mar 2 20:47:51 morph-04 openais[2259]: [TOTEM] The token was lost in the OPERATIONAL state. Mar 2 20:48:10 morph-01 openais[2435]: [CLM ] New Configuration: Mar 2 20:48:10 morph-01 openais[2435]: [CLM ] r(0) ip(10.15.89.61) Mar 2 20:48:10 morph-01 openais[2435]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:48:11 morph-04 openais[2259]: [CLM ] New Configuration: Mar 2 20:48:11 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.61) Mar 2 20:48:11 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:48:11 morph-04 fenced[2278]: fencing deferred to morph-01 Mar 2 20:48:10 morph-01 fenced[2454]: morph-03 not a cluster member after 0 sec post_fail_delay Mar 2 20:48:10 morph-01 fenced[2454]: fencing node "morph-03" Mar 2 20:48:16 morph-01 fenced[2454]: fence "morph-03" success Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster0.0: jid=1: Trying to acquire journal lock... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Trying to acquire journal lock... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Trying to acquire journal lock... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster0.0: jid=1: Busy Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Looking at journal... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Looking at journal... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Acquiring the transaction lock... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Replaying journal... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Replayed 0 of 0 blocks Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Found 1 revoke tags Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Journal replayed in 1s Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster1.0: jid=1: Done Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Acquiring the transaction lock... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Replaying journal... Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Replayed 0 of 0 blocks Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Found 1 revoke tags Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Journal replayed in 0s Mar 2 20:48:16 morph-01 kernel: GFS2: fsid=morph-cluster:morph-cluster2.0: jid=1: Done Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Trying to acquire journal lock... Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=1: Trying to acquire journal lock... Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=1: Trying to acquire journal lock... Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=1: Busy Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=1: Busy Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Looking at journal... Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Acquiring the transaction lock... Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Replaying journal... Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Replayed 0 of 0 blocks Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Found 1 revoke tags Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Journal replayed in 1s Mar 2 20:48:17 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=1: Done Mar 2 20:48:27 morph-01 openais[2435]: [TOTEM] The network interface is down. Mar 2 20:48:27 morph-04 openais[2259]: [TOTEM] The token was lost in the OPERATIONAL state. Mar 2 20:48:47 morph-04 openais[2259]: [CLM ] New Configuration: Mar 2 20:48:47 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:48:48 morph-04 ccsd[2251]: Cluster is not quorate. Refusing connection. Mar 2 20:49:37 morph-03 syslogd 1.4.1: restart. Mar 2 20:49:41 morph-03 openais[2277]: [CLM ] New Configuration: Mar 2 20:49:41 morph-03 openais[2277]: [CLM ] r(0) ip(10.15.89.63) Mar 2 20:49:41 morph-03 openais[2277]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:49:43 morph-04 openais[2259]: [CLM ] New Configuration: Mar 2 20:49:43 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.63) Mar 2 20:49:43 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:49:47 morph-03 fenced[2296]: fencing deferred to prior member Mar 2 20:49:43 morph-04 fenced[2278]: morph-01 not a cluster member after 0 sec post_fail_delay Mar 2 20:49:43 morph-04 fenced[2278]: fencing node "morph-01" Mar 2 20:49:48 morph-04 fenced[2278]: fence "morph-01" success Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=0: Trying to acquire journal lock... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Trying to acquire journal lock... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=0: Trying to acquire journal lock... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=0: Looking at journal... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=0: Looking at journal... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Looking at journal... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Acquiring the transaction lock... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Replaying journal... Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Replayed 0 of 0 blocks Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Found 0 revoke tags Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Journal replayed in 0s Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster1.2: jid=0: Done Mar 2 20:49:48 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster2.2: jid=0: Done Mar 2 20:49:49 morph-04 kernel: GFS2: fsid=morph-cluster:morph-cluster0.2: jid=0: Acquiring the transaction lock... Mar 2 20:51:13 morph-01 syslogd 1.4.1: restart. Mar 2 20:51:15 morph-01 openais[2454]: [CLM ] New Configuration: Mar 2 20:51:15 morph-01 openais[2454]: [CLM ] r(0) ip(10.15.89.61) Mar 2 20:51:15 morph-01 openais[2454]: [CLM ] r(0) ip(10.15.89.63) Mar 2 20:51:15 morph-01 openais[2454]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:51:14 morph-03 openais[2277]: [CLM ] New Configuration: Mar 2 20:51:14 morph-03 openais[2277]: [CLM ] r(0) ip(10.15.89.61) Mar 2 20:51:14 morph-03 openais[2277]: [CLM ] r(0) ip(10.15.89.63) Mar 2 20:51:14 morph-03 openais[2277]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:51:16 morph-04 openais[2259]: [CLM ] New Configuration: Mar 2 20:51:16 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.61) Mar 2 20:51:16 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.63) Mar 2 20:51:16 morph-04 openais[2259]: [CLM ] r(0) ip(10.15.89.64) Mar 2 20:51:16 morph-04 kernel: INFO: task gfs2_quotad:2738 blocked for more than 120 seconds. Mar 2 20:51:16 morph-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 2 20:51:16 morph-04 kernel: gfs2_quotad D 00000148 2812 2738 11 2769 2737 (L-TLB) Mar 2 20:51:16 morph-04 kernel: f5e65ec0 00000046 37cfd7e2 00000148 00000000 f638ec00 f5e65e80 00000008 Mar 2 20:51:16 morph-04 kernel: f7308550 37d06da5 00000148 000095c3 00000001 f730865c c2013744 f7088c80 Mar 2 20:51:16 morph-04 kernel: f8e7d21c c20140e4 f62c9d70 00000000 f5142954 00000018 f62c9d50 ffffffff Mar 2 20:51:16 morph-04 kernel: Call Trace: Mar 2 20:51:16 morph-04 kernel: [<f8e7d21c>] gdlm_bast+0x0/0x78 [lock_dlm] Mar 2 20:51:16 morph-04 kernel: [<f8f0210e>] just_schedule+0x5/0x8 [gfs2] Mar 2 20:51:16 morph-04 kernel: [<c061cdfd>] __wait_on_bit+0x33/0x58 Mar 2 20:51:16 morph-04 kernel: [<f8f02109>] just_schedule+0x0/0x8 [gfs2] Mar 2 20:51:16 morph-04 kernel: [<f8f02109>] just_schedule+0x0/0x8 [gfs2] Mar 2 20:51:16 morph-04 kernel: [<c061ce84>] out_of_line_wait_on_bit+0x62/0x6a Mar 2 20:51:16 morph-04 kernel: [<c0435ff4>] wake_bit_function+0x0/0x3c Mar 2 20:51:16 morph-04 kernel: [<f8f02102>] gfs2_glock_wait+0x27/0x2e [gfs2] Mar 2 20:51:16 morph-04 kernel: [<f8f17784>] gfs2_statfs_sync+0x4d/0x17d [gfs2] Mar 2 20:51:16 morph-04 kernel: [<f8f1777d>] gfs2_statfs_sync+0x46/0x17d [gfs2] Mar 2 20:51:16 morph-04 kernel: [<f8f117e6>] quotad_check_timeo+0x19/0x4f [gfs2] Mar 2 20:51:16 morph-04 kernel: [<f8f1315f>] gfs2_quotad+0xb7/0x1c7 [gfs2] Mar 2 20:51:16 morph-04 kernel: [<c0435fc7>] autoremove_wake_function+0x0/0x2d Mar 2 20:51:16 morph-04 kernel: [<f8f130a8>] gfs2_quotad+0x0/0x1c7 [gfs2] Mar 2 20:51:16 morph-04 kernel: [<c0435f03>] kthread+0xc0/0xed Mar 2 20:51:16 morph-04 kernel: [<c0435e43>] kthread+0x0/0xed Mar 2 20:51:16 morph-04 kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10 Mar 2 20:51:16 morph-04 kernel: ======================= Mar 2 20:53:16 morph-04 kernel: INFO: task gfs2_recoverd:2735 blocked for more than 120 seconds. Mar 2 20:53:16 morph-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 2 20:53:16 morph-04 kernel: gfs2_recoverd D 00000156 2476 2735 11 2737 2701 (L-TLB) Mar 2 20:53:16 morph-04 kernel: f5e64e44 00000046 80e0fd68 00000156 c041eb94 00000000 00000000 0000000a Mar 2 20:53:16 morph-04 kernel: f7f5aaa0 80e2a1e2 00000156 0001a47a 00000000 f7f5abac c200c900 f727cac0 Mar 2 20:53:16 morph-04 kernel: ffffffa9 00000000 c200d2a0 c2013744 00000020 00000000 c200c944 f5e64e78 Mar 2 20:53:16 morph-04 kernel: Call Trace: Mar 2 20:53:16 morph-04 kernel: [<c041eb94>] __wake_up+0x2a/0x3d Mar 2 20:53:16 morph-04 kernel: [<f8f0210e>] just_schedule+0x5/0x8 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<c061cdfd>] __wait_on_bit+0x33/0x58 Mar 2 20:53:16 morph-04 kernel: [<f8f02109>] just_schedule+0x0/0x8 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<f8f02109>] just_schedule+0x0/0x8 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<c061ce84>] out_of_line_wait_on_bit+0x62/0x6a Mar 2 20:53:16 morph-04 kernel: [<c0435ff4>] wake_bit_function+0x0/0x3c Mar 2 20:53:16 morph-04 kernel: [<f8f02102>] gfs2_glock_wait+0x27/0x2e [gfs2] Mar 2 20:53:16 morph-04 kernel: [<f8f14001>] gfs2_recover_journal+0x189/0x814 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<f8f13ffa>] gfs2_recover_journal+0x182/0x814 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<f8f13f37>] gfs2_recover_journal+0xbf/0x814 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<f8f0357f>] gfs2_glock_nq_num+0x31/0x54 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<c042cd69>] process_timeout+0x0/0x5 Mar 2 20:53:16 morph-04 kernel: [<f8f1468c>] gfs2_recoverd+0x0/0x66 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<f8f146aa>] gfs2_recoverd+0x1e/0x66 [gfs2] Mar 2 20:53:16 morph-04 kernel: [<c0435f03>] kthread+0xc0/0xed Mar 2 20:53:16 morph-04 kernel: [<c0435e43>] kthread+0x0/0xed Mar 2 20:53:16 morph-04 kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10 Mar 2 20:53:16 morph-04 kernel: ======================= Current glock state: ================================================================================ morph-04 GFS2 Glock (1, 2) G: s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5 H: s:SH f:epcW e:0 p:2735 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814 [gfs2] H: s:SH f:W e:0 p:187 [pdflush] gfs2_do_trans_begin+0xae/0x119 [gfs2] ================================================================================ 2 25434 dlm Resource " 2 635a" master morph-04 1G/0C/0W Granted Queue 022f0002 NL No Conversion None Waiting morph-04 GFS2 Glock (2, 25434) G: s:UN n:2/635a f:l t:EX d:EX/0 l:0 a:0 r:4 H: s:EX f:cW e:0 p:2738 [gfs2_quotad] gfs2_statfs_sync+0x46/0x17d [gfs2] ================================================================================ 2 21493534 dlm Resource " 2 147f71e" master morph-04 1G/0C/0W Granted Queue 00eb0002 EX No Conversion None Waiting morph-04 GFS2 Glock (2, 21493534) G: s:EX n:2/147f71e f: t:EX d:UN/0 l:0 a:0 r:5 H: s:EX f:H e:0 p:187 [pdflush] gfs2_write_inode+0x55/0x15a [gfs2] H: s:EX f:W e:0 p:2968 [d_doio] gfs2_mmap+0x2e/0x65 [gfs2] I: n:21/21493534 t:8 f:0x10 d:0x00000000 s:25300/25300 ================================================================================ Version-Release number of selected component (if applicable): kernel-2.6.18-189.el5 cman-2.0.115-31.el5 gfs2-utils-0.1.62-18.el5 How reproducible: Unknown
Created attachment 397619 [details] Complete logs from morph cluster
GFS2 uses the transaction lock for one reason only, and that is suspending the filesystem. So assuming that no suspends were being done at the time, there should only be PR requests going to the DLM, so that no conflicting locks should have been around. It looks like the lock as been requested with all the right flags, so I think that we are probably waiting on dlm recovery in this case. Dave, can you spot anything that looks odd in the above logs?
We really need more data to figure this out; nothing seems very interesting in /var/log/messages. It looks like userspace recovery may have not re-enabled dlm/gfs in the kernel, data from group_tool would tell us. Or, if it's dlm kernel recovery that's not completing, then <dlm log_debug="1"/>, or at least a ps -o pid,stat,cmd,wchan.
Nate, can you reproduce this in order to get the info which Dave is requesting?
This seems to have been stuck for a long time now. I'm going to close this is no more information is available.