Description of problem: from Dave: The problem here appears to be stuck gfs/lock_dlm recovery on morph-02 which is in recover state 2 (morph-03 is in recover 4 which is complete.) To classify this further we'd need output from /proc/cluster/lock_dlm/debug (esp on morph-02) and possibly info on dlm/lock_dlm kernel threads. Version-Release number of selected component (if applicable): DLM <CVS> (built Jan 18 2005 13:36:03) installed How reproducible: Sometimes
I hit this while running revolver on the morph cluster. Three of the five nodes were taken down (so quorum was lost) and then brought back up. This caused recovery to get stuck. As a side effect the mounting of the filesystems was hung. morph-02 and morph-03 were the nodes left up: [root@morph-02 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 5 1 4 3] DLM Lock Space: "clvmd" 3 3 run - [2 5 1 4 3] DLM Lock Space: "gfs0" 4 4 update U-4,1,1 [2 5 1] DLM Lock Space: "gfs1" 6 6 run - [2 5] GFS Mount Group: "gfs0" 5 5 recover 2 - [2 5] GFS Mount Group: "gfs1" 7 7 run - [2 5] [root@morph-03 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 5 1 4 3] DLM Lock Space: "clvmd" 3 3 run - [2 5 1 4 3] DLM Lock Space: "gfs0" 4 4 update U-4,1,1 [2 5 1] DLM Lock Space: "gfs1" 6 6 run - [2 5] GFS Mount Group: "gfs0" 5 5 recover 4 - [2 5] GFS Mount Group: "gfs1" 7 7 run - [2 5] morph-01, morph-04, and morph-05 were the nodes shot: [root@morph-01 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 2 1 4 3] DLM Lock Space: "clvmd" 3 3 run - [5 2 1 4 3] DLM Lock Space: "gfs0" 4 4 join S-6,20,3 [5 2 1] [root@morph-04 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 5 1 4 3] DLM Lock Space: "clvmd" 3 3 run - [2 5 1 4 3] DLM Lock Space: "gfs1" 6 4 join S-6,20,3 [2 5 4] [root@morph-05 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 1 4 2 3] DLM Lock Space: "clvmd" 3 3 run - [5 1 4 2 3]
morph-02: CMAN: quorum lost, blocking activity CMAN: quorum regained, resuming activity GFS: fsid=morph-cluster:gfs1.2: jid=4: Trying to acquire journal lock... GFS: fsid=morph-cluster:gfs0.2: jid=4: Trying to acquire journal lock... GFS: fsid=morph-cluster:gfs0.2: jid=4: Looking at journal... GFS: fsid=morph-cluster:gfs0.2: jid=4: Acquiring the transaction lock... GFS: fsid=morph-cluster:gfs1.2: jid=4: Looking at journal... GFS: fsid=morph-cluster:gfs1.2: jid=4: Acquiring the transaction lock... GFS: fsid=morph-cluster:gfs0.2: jid=4: Replaying journal... GFS: fsid=morph-cluster:gfs1.2: jid=4: Replaying journal... GFS: fsid=morph-cluster:gfs1.2: jid=4: Replayed 182 of 182 blocks GFS: fsid=morph-cluster:gfs1.2: jid=4: replays = 182, skips = 0, sames = 0 GFS: fsid=morph-cluster:gfs1.2: jid=4: Journal replayed in 3s GFS: fsid=morph-cluster:gfs1.2: jid=4: Done GFS: fsid=morph-cluster:gfs1.2: jid=3: Trying to acquire journal lock... GFS: fsid=morph-cluster:gfs1.2: jid=3: Looking at journal... GFS: fsid=morph-cluster:gfs1.2: jid=3: Done GFS: fsid=morph-cluster:gfs1.2: jid=1: Trying to acquire journal lock... GFS: fsid=morph-cluster:gfs1.2: jid=1: Busy GFS: fsid=morph-cluster:gfs0.2: jid=4: Replayed 2727 of 2870 blocks GFS: fsid=morph-cluster:gfs0.2: jid=4: replays = 2727, skips = 65, sames = 78 GFS: fsid=morph-cluster:gfs0.2: jid=4: Journal replayed in 17s GFS: fsid=morph-cluster:gfs0.2: jid=4: Done GFS: fsid=morph-cluster:gfs0.2: jid=3: Trying to acquire journal lock... GFS: fsid=morph-cluster:gfs0.2: jid=3: Busy GFS: fsid=morph-cluster:gfs0.2: jid=1: Trying to acquire journal lock... GFS: fsid=morph-cluster:gfs0.2: jid=1: Looking at journal... GFS: fsid=morph-cluster:gfs0.2: jid=1: Acquiring the transaction lock... lock_dlm: cancel 1,2 flags 400 lock_dlm: cancel 1,2 complete GFS: fsid=morph-cluster:gfs0.2: jid=1: Replaying journal... GFS: fsid=morph-cluster:gfs0.2: jid=1: Replayed 1024 of 1025 blocks GFS: fsid=morph-cluster:gfs0.2: jid=1: replays = 1024, skips = 0, sames = 1
I'll try to reproduce this and gather more info.
Removing from Blocker list, if it is reproducable, then it will get back on the list.
What do you know, I reproduced it. Back on the blocker list you go. 5 node cluster (morph-01 - morph-05) all running I/O to 5 GFS. I shoot morph-05, recovery ends upstuck. All have the same view of the nodes in the cluster: [root@morph-01 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 5 M morph-01.lab.msp.redhat.com 2 1 5 M morph-05.lab.msp.redhat.com 3 1 5 M morph-04.lab.msp.redhat.com 4 1 5 M morph-03.lab.msp.redhat.com 5 1 5 M morph-02.lab.msp.redhat.com Services: [root@morph-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 5 4 3 2] DLM Lock Space: "clvmd" 3 4 recover 2 - [1 5 4 3] DLM Lock Space: "gfs0" 4 5 recover 2 - [1 5 4 3] DLM Lock Space: "gfs1" 6 7 recover 2 - [1 5 4 3] DLM Lock Space: "gfs2" 8 9 recover 2 - [1 5 4 3] DLM Lock Space: "gfs3" 10 11 recover 2 - [1 5 4 3] DLM Lock Space: "gfs4" 12 13 recover 2 - [1 5 4 3] GFS Mount Group: "gfs0" 5 6 recover 0 - [1 5 4 3] GFS Mount Group: "gfs1" 7 8 recover 0 - [1 5 4 3] GFS Mount Group: "gfs2" 9 10 recover 0 - [1 5 4 3] GFS Mount Group: "gfs3" 11 12 recover 0 - [1 5 4 3] GFS Mount Group: "gfs4" 13 14 recover 0 - [1 5 4 3] [root@morph-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 5 4 3 2] DLM Lock Space: "clvmd" 3 4 recover 2 - [1 5 4 3] DLM Lock Space: "gfs0" 4 5 recover 2 - [1 5 4 3] DLM Lock Space: "gfs1" 6 7 recover 2 - [1 5 4 3] DLM Lock Space: "gfs2" 8 9 recover 2 - [1 5 4 3] DLM Lock Space: "gfs3" 10 11 recover 2 - [1 5 4 3] DLM Lock Space: "gfs4" 12 13 recover 2 - [1 5 4 3] GFS Mount Group: "gfs0" 5 6 recover 0 - [1 5 4 3] GFS Mount Group: "gfs1" 7 8 recover 0 - [1 5 4 3] GFS Mount Group: "gfs2" 9 10 recover 0 - [1 5 4 3] GFS Mount Group: "gfs3" 11 12 recover 0 - [1 5 4 3] GFS Mount Group: "gfs4" 13 14 recover 0 - [1 5 4 3] [root@morph-03 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 4 5 3 2] DLM Lock Space: "clvmd" 3 4 recover 2 - [1 4 5 3] DLM Lock Space: "gfs0" 4 5 recover 2 - [1 4 5 3] DLM Lock Space: "gfs1" 6 7 recover 2 - [1 4 5 3] DLM Lock Space: "gfs2" 8 9 recover 2 - [1 4 5 3] DLM Lock Space: "gfs3" 10 11 recover 2 - [1 4 5 3] DLM Lock Space: "gfs4" 12 13 recover 2 - [1 4 5 3] GFS Mount Group: "gfs0" 5 6 recover 0 - [1 4 5 3] GFS Mount Group: "gfs1" 7 8 recover 0 - [1 4 5 3] GFS Mount Group: "gfs2" 9 10 recover 0 - [1 4 5 3] GFS Mount Group: "gfs3" 11 12 recover 0 - [1 4 5 3] GFS Mount Group: "gfs4" 13 14 recover 0 - [1 4 5 3] [root@morph-04 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 3 4 5 2] DLM Lock Space: "clvmd" 3 4 recover 2 - [1 3 4 5] DLM Lock Space: "gfs0" 4 5 recover 2 - [1 3 4 5] DLM Lock Space: "gfs1" 6 7 recover 2 - [1 3 4 5] DLM Lock Space: "gfs2" 8 9 recover 2 - [1 3 4 5] DLM Lock Space: "gfs3" 10 11 recover 2 - [1 3 4 5] DLM Lock Space: "gfs4" 12 13 recover 2 - [1 3 4 5] GFS Mount Group: "gfs0" 5 6 recover 0 - [1 3 4 5] GFS Mount Group: "gfs1" 7 8 recover 0 - [1 3 4 5] GFS Mount Group: "gfs2" 9 10 recover 0 - [1 3 4 5] GFS Mount Group: "gfs3" 11 12 recover 0 - [1 3 4 5] GFS Mount Group: "gfs4" 13 14 recover 0 - [1 3 4 5] [root@morph-05 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 4 1 3 2]
Here's the debug info: [root@morph-01 ~]# cat /proc/cluster/lock_dlm/debug 30023 8225 lk 11,30023 id d037c 0,5 4 6982 qc 11,30023 0,5 id d037c sts 0 0 8225 req 7,30023 ex 0-7fffffffffffffff lkf 2000 wait 1 8225 lk 7,30023 id 0 -1,5 2000 8225 lk 11,30023 id d037c 5,0 4 6982 qc 11,30023 5,0 id d037c sts 0 0 6982 qc 7,30023 -1,5 id 17002a sts 0 0 8225 ex plock 0 8108 en punlock 7,46 8108 lk 11,46 id 8003c 0,5 4 6982 qc 11,46 0,5 id 8003c sts 0 0 8108 remove 7,46 8108 un 7,46 1d01a2 5 0 6982 qc 7,46 5,5 id 1d01a2 sts -65538 0 8108 lk 11,46 id 8003c 5,0 4 6982 qc 11,46 5,0 id 8003c sts 0 0 8108 ex punlock 0 8108 en plock 7,46 8108 lk 11,46 id 8003c 0,5 4 6982 qc 11,46 0,5 id 8003c sts 0 0 8108 req 7,46 ex 0-7fffffffffffffff lkf 2000 wait 1 8108 lk 7,46 id 0 -1,5 2000 8108 lk 11,46 id 8003c 5,0 4 6982 qc 11,46 5,0 id 8003c sts 0 0 6982 qc 7,46 -1,5 id 2a02a8 sts 0 0 8108 ex plock 0 8173 en punlock 7,4b 8173 lk 11,4b id 300ee 0,5 4 6982 qc 11,4b 0,5 id 300ee sts 0 0 8173 remove 7,4b 8173 un 7,4b 1301af 5 0 6982 qc 7,4b 5,5 id 1301af sts -65538 0 8173 lk 11,4b id 300ee 5,0 4 6982 qc 11,4b 5,0 id 300ee sts 0 0 8173 ex punlock 0 8173 en plock 7,4d 8222 en punlock 7,20028 8222 lk 11,20028 id a0144 0,5 4 6982 qc 11,20028 0,5 id a0144 sts 0 0 8222 remove 7,20028 8222 un 7,20028 1b0010 5 0 6982 qc 7,20028 5,5 id 1b0010 sts -65538 0 8222 lk 11,20028 id a0144 5,0 4 6982 qc 11,20028 5,0 id a0144 sts 0 0 8221 lk 11,20028 id a0144 0,5 4 6982 qc 11,20028 0,5 id a0144 sts 0 0 8221 req 7,20028 ex 0-7fffffffffffffff lkf 2000 wait 1 8221 lk 7,20028 id 0 -1,5 2000 8176 en plock 7,2002d 8176 lk 11,2002d id 403c0 0,5 4 8221 lk 11,20028 id a0144 5,0 4 6982 qc 11,2002d 0,5 id 403c0 sts 0 0 8222 ex punlock 0 6982 qc 11,20028 5,0 id a0144 sts 0 0 8176 req 7,2002d ex 2c1f80-2db511 lkf 2000 wait 1 8176 lk 7,2002d id 0 -1,5 2000 8222 en plock 7,1002d 8176 lk 11,2002d id 403c0 5,0 4 6982 qc 11,2002d 5,0 id 403c0 sts 0 0 6982 qc 7,2002d -1,5 id 1003f9 sts 0 0 6982 qc 7,20028 -1,5 id 10011e sts 0 0 8221 ex plock 0 8176 ex plock 0 8102 en punlock 7,20029 8102 lk 11,20029 id b0349 0,5 4 6982 qc 11,20029 0,5 id b0349 sts 0 0 8104 en punlock 7,49 8104 lk 11,49 id 703fb 0,5 4 6982 qc 11,49 0,5 id 703fb sts 0 0 8102 remove 7,20029 8102 un 7,20029 1b026c 5 0 8104 remove 7,49 8104 un 7,49 250200 5 0 6982 qc 7,20029 5,5 id 1b026c sts -65538 0 6982 qc 7,49 5,5 id 250200 sts -65538 0 8102 lk 11,20029 id b0349 5,0 4 8104 lk 11,49 id 703fb 5,0 4 6982 qc 11,20029 5,0 id b0349 sts 0 0 6982 qc 11,49 5,0 id 703fb sts 0 0 8102 ex punlock 0 8172 en punlock 7,20029 8172 lk 11,20029 id 503dd 0,5 4 6982 qc 11,20029 0,5 id 503dd sts 0 0 8102 en plock 7,20029 8172 remove 7,20029 8172 un 7,20029 1200f5 5 0 6982 qc 7,20029 5,5 id 1200f5 sts -65538 0 8102 lk 11,20029 id b0349 0,5 4 8172 lk 11,20029 id 503dd 5,0 4 6982 qc 11,20029 5,0 id 503dd sts 0 0 8104 ex punlock 0 6982 qc 11,20029 0,5 id b0349 sts 0 0 8172 ex punlock 0 8172 en plock 7,20027 8172 lk 11,20027 id 70137 0,5 4 6982 qc 11,20027 0,5 id 70137 sts 0 0 8172 req 7,20027 ex 0-7fffffffffffffff lkf 2000 wait 1 8172 lk 7,20027 id 0 -1,5 2000 8172 lk 11,20027 id 70137 5,0 4 6982 qc 11,20027 5,0 id 70137 sts 0 0 6982 qc 7,20027 -1,5 id 1101da sts 0 0 8172 ex plock 0 8102 req 7,20029 ex 6255-6a01 lkf 2000 wait 1 8102 lk 7,20029 id 0 -1,5 2000 8102 lk 11,20029 id b0349 5,0 4 6982 qc 7,20029 -1,5 id 1e0339 sts 0 0 6982 qc 11,20029 5,0 id b0349 sts 0 0 8102 ex plock 0 8140 en punlock 7,20028 8140 lk 11,20028 id a01a0 0,5 4 6982 qc 11,20028 0,5 id a01a0 sts 0 0 8140 remove 7,20028 8140 un 7,20028 1a0255 5 0 6982 qc 7,20028 5,5 id 1a0255 sts -65538 0 8140 lk 11,20028 id a01a0 5,0 4 6982 qc 11,20028 5,0 id a01a0 sts 0 0 8140 ex punlock 0 8217 en punlock 7,1002d 8174 en punlock 7,4d 8140 en plock 7,20028 8176 en punlock 7,2002d 8105 en punlock 7,50 8190 en punlock 7,4f 8104 en plock 7,49 8225 en punlock 7,30023 8106 en punlock 7,44 8163 en punlock 7,4c 8107 en punlock 7,4f 8221 en punlock 7,20028 8171 en punlock 7,20028 8108 en punlock 7,46 8172 en punlock 7,20027 8102 en punlock 7,20029 7791 un 2,5002a b0128 5 0 7628 un 2,1002c 40360 5 0 7714 un 2,59 c0183 5 0 7560 un 2,300a3 502b6 5 0 7483 un 2,373 d01ed 5 0 [root@morph-02 ~]# cat /proc/cluster/lock_dlm/debug b id 4035b 5,0 4 6536 qc 11,14af9bb 5,0 id 4035b sts 0 0 7372 en punlock 7,14bf9bb 7372 lk 11,14bf9bb id 40162 0,5 4 6536 qc 11,14bf9bb 0,5 id 40162 sts 0 0 7372 remove 7,14bf9bb 7372 un 7,14bf9bb d039b 5 0 6536 qc 7,14bf9bb 5,5 id d039b sts -65538 0 7372 lk 11,14bf9bb id 40162 5,0 4 6536 qc 11,14bf9bb 5,0 id 40162 sts 0 0 7372 ex punlock 0 6536 qc 7,14af9bb -1,5 id e00cb sts 0 0 7306 en punlock 7,14af9d0 7306 lk 11,14af9d0 id 30071 0,5 4 6536 qc 11,14af9d0 0,5 id 30071 sts 0 0 7306 remove 7,14af9d0 7306 un 7,14af9d0 110045 5 0 6536 qc 7,14af9d0 5,5 id 110045 sts -65538 0 7306 lk 11,14af9d0 id 30071 5,0 4 6536 qc 11,14af9d0 5,0 id 30071 sts 0 0 7306 ex punlock 0 7306 en plock 7,14af9d0 7306 lk 11,14af9d0 id 30071 0,5 4 6536 qc 11,14af9d0 0,5 id 30071 sts 0 0 7306 req 7,14af9d0 ex 0-7fffffffffffffff lkf 2000 wait 1 7306 lk 7,14af9d0 id 0 -1,5 2000 7306 lk 11,14af9d0 id 30071 5,0 4 6536 qc 11,14af9d0 5,0 id 30071 sts 0 0 7268 en plock 7,14af9c2 7268 lk 11,14af9c2 id 502c2 0,5 4 6536 qc 11,14af9c2 0,5 id 502c2 sts 0 0 7268 req 7,14af9c2 ex 0-2e4ca1 lkf 2000 wait 1 7268 lk 7,14af9c2 id 0 -1,5 2000 7268 lk 11,14af9c2 id 502c2 5,0 4 6536 qc 7,14af9c2 -1,5 id 11026e sts 0 0 6536 qc 11,14af9c2 5,0 id 502c2 sts 0 0 7268 ex plock 0 7398 en punlock 7,14cf9b0 6536 qc 7,14af9d0 -1,5 id 140347 sts 0 0 7398 lk 11,14cf9b0 id 10135 0,5 4 7306 ex plock 0 6536 qc 11,14cf9b0 0,5 id 10135 sts 0 0 7398 remove 7,14cf9b0 7398 un 7,14cf9b0 90289 5 0 6536 qc 7,14cf9b0 5,5 id 90289 sts -65538 0 7398 lk 11,14cf9b0 id 10135 5,0 4 6536 qc 11,14cf9b0 5,0 id 10135 sts 0 0 7398 ex punlock 0 7398 en plock 7,14cf9b1 7400 lk 11,14cf9b0 id 10135 0,5 4 6536 qc 11,14cf9b0 0,5 id 10135 sts 0 0 7400 req 7,14cf9b0 ex 0-7fffffffffffffff lkf 2000 wait 1 7400 lk 7,14cf9b0 id 0 -1,5 2000 7400 lk 11,14cf9b0 id 10135 5,0 4 6536 qc 11,14cf9b0 5,0 id 10135 sts 0 0 7332 en punlock 7,14af9c6 7332 lk 11,14af9c6 id 102a8 0,5 4 6536 qc 11,14af9c6 0,5 id 102a8 sts 0 0 7332 remove 7,14af9c6 7332 un 7,14af9c6 c02dd 5 0 6536 qc 7,14af9c6 5,5 id c02dd sts -65538 0 7332 lk 11,14af9c6 id 102a8 5,0 4 6536 qc 11,14af9c6 5,0 id 102a8 sts 0 0 7332 ex punlock 0 7332 en plock 7,14af9c5 7332 lk 11,14af9c5 id 30124 0,5 4 6536 qc 11,14af9c5 0,5 id 30124 sts 0 0 7332 req 7,14af9c5 ex 0-7fffffffffffffff lkf 2000 wait 1 7332 lk 7,14af9c5 id 0 -1,5 2000 7332 lk 11,14af9c5 id 30124 5,0 4 6536 qc 11,14af9c5 5,0 id 30124 sts 0 0 6536 qc 7,14af9c5 -1,5 id d0232 sts 0 0 7332 ex plock 0 7305 en punlock 7,14af9d9 7305 lk 11,14af9d9 id 50025 0,5 4 6536 qc 11,14af9d9 0,5 id 50025 sts 0 0 7305 remove 7,14af9d9 7305 un 7,14af9d9 a01e3 5 0 6536 qc 7,14af9d9 5,5 id a01e3 sts -65538 0 7305 lk 11,14af9d9 id 50025 5,0 4 6536 qc 11,14af9d9 5,0 id 50025 sts 0 0 7303 lk 11,14af9d9 id 50025 0,5 4 6536 qc 11,14af9d9 0,5 id 50025 sts 0 0 7303 req 7,14af9d9 ex 0-7fffffffffffffff lkf 2000 wait 1 7303 lk 7,14af9d9 id 0 -1,5 2000 7303 lk 11,14af9d9 id 50025 5,0 4 6536 qc 11,14af9d9 5,0 id 50025 sts 0 0 7305 ex punlock 0 7305 en plock 7,14af9cb 7305 lk 11,14af9cb id 20245 0,5 4 6536 qc 11,14af9cb 0,5 id 20245 sts 0 0 7305 req 7,14af9cb ex 0-7fffffffffffffff lkf 2000 wait 1 7305 lk 7,14af9cb id 0 -1,5 2000 7305 lk 11,14af9cb id 20245 5,0 4 6536 qc 11,14af9cb 5,0 id 20245 sts 0 0 6536 qc 7,14af9d9 -1,5 id f00ef sts 0 0 7303 ex plock 0 7372 en plock 7,14bf9bb 7372 lk 11,14bf9bb id 40162 0,5 4 6536 qc 11,14bf9bb 0,5 id 40162 sts 0 0 7359 ex plock 0 7372 req 7,14bf9bb ex 1ee13b-2be9a5 lkf 2000 wait 1 7372 lk 7,14bf9bb id 0 -1,5 2000 7372 lk 11,14bf9bb id 40162 5,0 4 6536 qc 11,14bf9bb 5,0 id 40162 sts 0 0 7339 en punlock 7,14cf9b6 7266 en punlock 7,14af9c1 7310 en punlock 7,14af9d4 7338 en punlock 7,14cf9b5 7359 en punlock 7,14af9bb 7329 en punlock 7,14af9bb 7403 en punlock 7,14cf9b9 7394 en punlock 7,14af9bf 7268 en punlock 7,14af9c2 7332 en punlock 7,14af9c5 7304 en punlock 7,14af9c7 7399 en punlock 7,14cf9b1 7341 en punlock 7,14cf9b0 7306 en punlock 7,14af9d0 7303 en punlock 7,14af9d9 6781 un 2,14af9e1 10194 5 0 6713 un 2,14af9e1 40334 5 0 6858 un 2,14af9bf 3012c 5 0 7012 un 2,14af9bf 10391 5 0 6935 un 2,14af9bd 40190 5 0 [root@morph-03 ~]# cat /proc/cluster/lock_dlm/debug req 7,296f33e ex 6d77-7056 lkf 2000 wait 1 7262 lk 7,296f33e id 0 -1,5 2000 7262 lk 11,296f33e id 200e4 5,0 4 6527 qc 7,296f33e -1,5 id 1000fe sts 0 0 6527 qc 11,296f33e 5,0 id 200e4 sts 0 0 7262 ex plock 0 7309 en punlock 7,296f347 7309 lk 11,296f347 id 20190 0,5 4 7296 en punlock 7,295f344 6527 qc 11,296f347 0,5 id 20190 sts 0 0 7296 lk 11,295f344 id 202f6 0,5 4 6527 qc 11,295f344 0,5 id 202f6 sts 0 0 7309 remove 7,296f347 7309 un 7,296f347 120379 5 0 7296 remove 7,295f344 7296 un 7,295f344 13003f 5 0 6527 qc 7,296f347 5,5 id 120379 sts -65538 0 6527 qc 7,295f344 5,5 id 13003f sts -65538 0 7296 lk 11,295f344 id 202f6 5,0 4 7309 lk 11,296f347 id 20190 5,0 4 6527 qc 11,295f344 5,0 id 202f6 sts 0 0 6527 qc 11,296f347 5,0 id 20190 sts 0 0 7296 ex punlock 0 7309 ex punlock 0 7296 en plock 7,295f344 7296 lk 11,295f344 id 202f6 0,5 4 6527 qc 11,295f344 0,5 id 202f6 sts 0 0 7296 req 7,295f344 ex 2ea870-2ecd2e lkf 2000 wait 1 7296 lk 7,295f344 id 0 -1,5 2000 7296 lk 11,295f344 id 202f6 5,0 4 6527 qc 11,295f344 5,0 id 202f6 sts 0 0 7309 en plock 7,296f347 7309 lk 11,296f347 id 20190 0,5 4 6527 qc 11,296f347 0,5 id 20190 sts 0 0 7309 req 7,296f347 ex 68c7-7522 lkf 2000 wait 1 7309 lk 7,296f347 id 0 -1,5 2000 7309 lk 11,296f347 id 20190 5,0 4 6527 qc 11,296f347 5,0 id 20190 sts 0 0 7332 en punlock 7,297f33d 7332 lk 11,297f33d id 900be 0,5 4 6527 qc 11,297f33d 0,5 id 900be sts 0 0 7332 remove 7,297f33d 7332 un 7,297f33d 1002a3 5 0 6527 qc 7,297f33d 5,5 id 1002a3 sts -65538 0 7332 lk 11,297f33d id 900be 5,0 4 6527 qc 11,297f33d 5,0 id 900be sts 0 0 7332 ex punlock 0 7332 en plock 7,297f33d 7332 lk 11,297f33d id 900be 0,5 4 6527 qc 11,297f33d 0,5 id 900be sts 0 0 7332 req 7,297f33d ex 4f22-634a lkf 2000 wait 1 7332 lk 7,297f33d id 0 -1,5 2000 7332 lk 11,297f33d id 900be 5,0 4 6527 qc 7,297f33d -1,5 id 601a7 sts 0 0 6527 qc 11,297f33d 5,0 id 900be sts 0 0 7332 ex plock 0 7383 en punlock 7,295f356 7383 lk 11,295f356 id 203f0 0,5 4 6527 qc 11,295f356 0,5 id 203f0 sts 0 0 7383 remove 7,295f356 7383 un 7,295f356 b02cb 5 0 6527 qc 7,295f356 5,5 id b02cb sts -65538 0 7383 lk 11,295f356 id 203f0 5,0 4 6527 qc 11,295f356 5,0 id 203f0 sts 0 0 7383 ex punlock 0 7383 en plock 7,295f356 7383 lk 11,295f356 id 203f0 0,5 4 6527 qc 11,295f356 0,5 id 203f0 sts 0 0 7383 req 7,295f356 ex 0-4909 lkf 2000 wait 1 7383 lk 7,295f356 id 0 -1,5 2000 7383 lk 11,295f356 id 203f0 5,0 4 6527 qc 11,295f356 5,0 id 203f0 sts 0 0 7358 en punlock 7,295f345 7358 lk 11,295f345 id 40098 0,5 4 6527 qc 11,295f345 0,5 id 40098 sts 0 0 7358 remove 7,295f345 7358 un 7,295f345 100135 5 0 6527 qc 7,295f345 5,5 id 100135 sts -65538 0 7358 lk 11,295f345 id 40098 5,0 4 6527 qc 11,295f345 5,0 id 40098 sts 0 0 7358 ex punlock 0 7359 lk 11,295f345 id 40098 0,5 4 7358 en plock 7,296f351 6527 qc 11,295f345 0,5 id 40098 sts 0 0 7359 req 7,295f345 ex 0-7fffffffffffffff lkf 2000 wait 1 7359 lk 7,295f345 id 0 -1,5 2000 7359 lk 11,295f345 id 40098 5,0 4 6527 qc 11,295f345 5,0 id 40098 sts 0 0 6527 qc 7,295f356 -1,5 id 150113 sts 0 0 7383 ex plock 0 6527 qc 7,295f344 -1,5 id 120036 sts 0 0 6527 qc 7,296f347 -1,5 id 1201b2 sts 0 0 6527 qc 7,295f345 -1,5 id f0262 sts 0 0 7296 ex plock 0 7359 ex plock 0 7309 ex plock 0 7296 en punlock 7,295f344 7296 lk 11,295f344 id 202f6 0,5 4 6527 qc 11,295f344 0,5 id 202f6 sts 0 0 7296 remove 7,295f344 7296 un 7,295f344 120036 5 0 6527 qc 7,295f344 5,5 id 120036 sts -65538 0 7296 lk 11,295f344 id 202f6 5,0 4 6527 qc 11,295f344 5,0 id 202f6 sts 0 0 7296 ex punlock 0 7296 en plock 7,295f344 7296 lk 11,295f344 id 202f6 0,5 4 6527 qc 11,295f344 0,5 id 202f6 sts 0 0 7296 req 7,295f344 ex 2ed993-2edf4b lkf 2000 wait 1 7296 lk 7,295f344 id 0 -1,5 2000 7296 lk 11,295f344 id 202f6 5,0 4 6527 qc 11,295f344 5,0 id 202f6 sts 0 0 7257 en punlock 7,295f344 7332 en punlock 7,297f33d 7383 en punlock 7,295f356 7361 en punlock 7,296f351 7309 en punlock 7,296f347 7262 en punlock 7,296f33e 7350 en punlock 7,295f344 7359 en punlock 7,295f345 6781 un 2,295f34a 4024f 5 0 6704 un 2,297f349 20075 5 0 6858 un 2,295f34d 8037a 5 0 7012 un 2,295f347 200c8 5 0 6935 un 2,295f348 302df 5 0 [root@morph-04 ~]# cat /proc/cluster/lock_dlm/debug cfa 0,5 id 100ba sts 0 0 7252 lk 11,3e18cf9 id 4011c 0,5 4 7251 remove 7,3e18cfa 7251 un 7,3e18cfa f03b6 5 0 6520 qc 11,3e18cf9 0,5 id 4011c sts 0 0 6520 qc 7,3e18cfa 5,5 id f03b6 sts -65538 0 7252 req 7,3e18cf9 ex 0-5c94 lkf 2000 wait 1 7252 lk 7,3e18cf9 id 0 -1,5 2000 7251 lk 11,3e18cfa id 100ba 5,0 4 6520 qc 11,3e18cfa 5,0 id 100ba sts 0 0 7251 ex punlock 0 7252 lk 11,3e18cf9 id 4011c 5,0 4 6520 qc 7,3e18cf9 -1,5 id 1000b9 sts 0 0 6520 qc 11,3e18cf9 5,0 id 4011c sts 0 0 7252 ex plock 0 7251 en plock 7,3e18cfa 7251 lk 11,3e18cfa id 100ba 0,5 4 6520 qc 11,3e18cfa 0,5 id 100ba sts 0 0 7251 req 7,3e18cfa ex 2ed19c-2ed6a4 lkf 2000 wait 1 7251 lk 7,3e18cfa id 0 -1,5 2000 7251 lk 11,3e18cfa id 100ba 5,0 4 6520 qc 11,3e18cfa 5,0 id 100ba sts 0 0 7354 en punlock 7,3e38cf1 7356 en punlock 7,3e18d05 7354 lk 11,3e38cf1 id 300bf 0,5 4 7356 lk 11,3e18d05 id 303c1 0,5 4 6520 qc 11,3e38cf1 0,5 id 300bf sts 0 0 6520 qc 11,3e18d05 0,5 id 303c1 sts 0 0 7354 remove 7,3e38cf1 7354 un 7,3e38cf1 d020a 5 0 7356 remove 7,3e18d05 7356 un 7,3e18d05 e02c4 5 0 6520 qc 7,3e38cf1 5,5 id d020a sts -65538 0 6520 qc 7,3e18d05 5,5 id e02c4 sts -65538 0 7354 lk 11,3e38cf1 id 300bf 5,0 4 7356 lk 11,3e18d05 id 303c1 5,0 4 6520 qc 11,3e38cf1 5,0 id 300bf sts 0 0 6520 qc 11,3e18d05 5,0 id 303c1 sts 0 0 7356 ex punlock 0 7354 ex punlock 0 7356 en plock 7,3e18d05 7356 lk 11,3e18d05 id 303c1 0,5 4 6520 qc 11,3e18d05 0,5 id 303c1 sts 0 0 7356 req 7,3e18d05 ex 6251-71e5 lkf 2000 wait 1 7354 en plock 7,3e38cf1 7356 lk 7,3e18d05 id 0 -1,5 2000 7354 lk 11,3e38cf1 id 300bf 0,5 4 6520 qc 11,3e38cf1 0,5 id 300bf sts 0 0 7354 req 7,3e38cf1 ex 2ed260-2edb1d lkf 2000 wait 1 7356 lk 11,3e18d05 id 303c1 5,0 4 7354 lk 7,3e38cf1 id 0 -1,5 2000 6520 qc 11,3e18d05 5,0 id 303c1 sts 0 0 7354 lk 11,3e38cf1 id 300bf 5,0 4 6520 qc 11,3e38cf1 5,0 id 300bf sts 0 0 6520 qc 7,3e18cfa -1,5 id 1402e6 sts 0 0 7251 ex plock 0 6520 qc 7,3e38cf1 -1,5 id c0012 sts 0 0 6520 qc 7,3e18d05 -1,5 id c022c sts 0 0 7354 ex plock 0 7356 ex plock 0 7354 en punlock 7,3e38cf1 7354 lk 11,3e38cf1 id 300bf 0,5 4 6520 qc 11,3e38cf1 0,5 id 300bf sts 0 0 7354 remove 7,3e38cf1 7354 un 7,3e38cf1 c0012 5 0 7297 en punlock 7,3e18cfa 6520 qc 7,3e38cf1 5,5 id c0012 sts -65538 0 7297 lk 11,3e18cfa id 100fd 0,5 4 6520 qc 11,3e18cfa 0,5 id 100fd sts 0 0 7287 en punlock 7,3e18cf9 7354 lk 11,3e38cf1 id 300bf 5,0 4 6520 qc 11,3e38cf1 5,0 id 300bf sts 0 0 7354 ex punlock 0 7287 lk 11,3e18cf9 id 3006b 0,5 4 7297 remove 7,3e18cfa 6520 qc 11,3e18cf9 0,5 id 3006b sts 0 0 7297 un 7,3e18cfa 15033d 5 0 7287 remove 7,3e18cf9 7287 un 7,3e18cf9 e01d8 5 0 6520 qc 7,3e18cfa 5,5 id 15033d sts -65538 0 6520 qc 7,3e18cf9 5,5 id e01d8 sts -65538 0 7297 lk 11,3e18cfa id 100fd 5,0 4 6520 qc 11,3e18cfa 5,0 id 100fd sts 0 0 7287 lk 11,3e18cf9 id 3006b 5,0 4 7354 en plock 7,3e38cf1 7297 ex punlock 0 7354 lk 11,3e38cf1 id 300bf 0,5 4 6520 qc 11,3e18cf9 5,0 id 3006b sts 0 0 6520 qc 11,3e38cf1 0,5 id 300bf sts 0 0 7297 en plock 7,3e18cfa 7354 req 7,3e38cf1 ex 2edb1e-2edf69 lkf 2000 wait 1 7297 lk 11,3e18cfa id 100fd 0,5 4 7354 lk 7,3e38cf1 id 0 -1,5 2000 7287 ex punlock 0 6520 qc 11,3e18cfa 0,5 id 100fd sts 0 0 7354 lk 11,3e38cf1 id 300bf 5,0 4 6520 qc 11,3e38cf1 5,0 id 300bf sts 0 0 7297 req 7,3e18cfa ex 2edd49-2edfa1 lkf 2000 wait 1 7287 en plock 7,3e18cf9 7297 lk 7,3e18cfa id 0 -1,5 2000 7287 lk 11,3e18cf9 id 3006b 0,5 4 6520 qc 11,3e18cf9 0,5 id 3006b sts 0 0 7297 lk 11,3e18cfa id 100fd 5,0 4 7287 req 7,3e18cf9 ex 72a5-74cc lkf 2000 wait 1 7287 lk 7,3e18cf9 id 0 -1,5 2000 6520 qc 11,3e18cfa 5,0 id 100fd sts 0 0 7287 lk 11,3e18cf9 id 3006b 5,0 4 6520 qc 7,3e18cf9 -1,5 id 130312 sts 0 0 6520 qc 11,3e18cf9 5,0 id 3006b sts 0 0 7287 ex plock 0 6520 qc 7,3e38cf1 -1,5 id 110217 sts 0 0 7354 ex plock 0 6520 qc 7,3e18cfa -1,5 id 1c01dd sts 0 0 7297 ex plock 0 7252 en punlock 7,3e18cf9 7251 en punlock 7,3e18cfa 7356 en punlock 7,3e18d05 7354 en punlock 7,3e38cf1 7287 en punlock 7,3e18cf9 7297 en punlock 7,3e18cfa 6774 un 2,3e18d1a 1030a 5 0 6697 un 2,3e18d0e 40283 5 0 7005 un 2,3e18cfb 301f3 5 0 6928 un 2,3e18cfe 40281 5 0 6851 un 2,3e18d0a 50194 5 0
The original report shows stuck gfs/lock_dlm recovery (state recover 2). Comment 5 shows stuck dlm recovery where /proc/cluster/dlm_debug might help. "ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd" might also be useful.
Hit this again on a 5 node cluster, 7 gfs. Shot 4 nodes, only one left up was morph-02. Here's all the info you asked for: [root@morph-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [3 2 1 4 5] DLM Lock Space: "clvmd" 3 4 run - [3] DLM Lock Space: "gfs0" 4 5 run - [3] DLM Lock Space: "gfs1" 6 7 run - [3] DLM Lock Space: "gfs2" 8 9 run - [3] DLM Lock Space: "gfs3" 10 11 run - [3] DLM Lock Space: "gfs4" 12 13 run - [3] DLM Lock Space: "gfs5" 14 15 run - [3] DLM Lock Space: "gfs6" 16 17 run - [3] GFS Mount Group: "gfs0" 5 6 recover 2 - [3] GFS Mount Group: "gfs1" 7 8 recover 2 - [3] GFS Mount Group: "gfs2" 9 10 run - [3] GFS Mount Group: "gfs3" 11 12 run - [3] GFS Mount Group: "gfs5" 15 16 run - [3] GFS Mount Group: "gfs6" 17 18 run - [3] [root@morph-02 ~]# cat /proc/cluster/lock_dlm/debug b7e7 0,5 id 10225 sts 0 0 7794 req 7,ebb7e7 ex 0-7fffffffffffffff lkf 2000 wait 1 7794 lk 7,ebb7e7 id 0 -1,5 2000 7794 lk 11,ebb7e7 id 10225 5,0 4 6619 qc 7,ebb7e7 -1,5 id e102c8 sts 0 0 6619 qc 11,ebb7e7 5,0 id 10225 sts 0 0 7794 ex plock 0 7620 en punlock 7,ebb7e4 7811 en punlock 7,ebb7ee 7620 lk 11,ebb7e4 id 101af 0,5 4 7811 lk 11,ebb7ee id 201d7 0,5 4 6619 qc 11,ebb7e4 0,5 id 101af sts 0 0 6619 qc 11,ebb7ee 0,5 id 201d7 sts 0 0 7620 remove 7,ebb7e4 7620 un 7,ebb7e4 113013c 5 0 6619 qc 7,ebb7e4 5,5 id 113013c sts -65538 0 7811 remove 7,ebb7ee 7811 un 7,ebb7ee f603cf 5 0 7620 lk 11,ebb7e4 id 101af 5,0 4 6619 qc 7,ebb7ee 5,5 id f603cf sts -65538 0 6619 qc 11,ebb7e4 5,0 id 101af sts 0 0 7620 ex punlock 0 7620 en plock 7,ebb7e4 7620 lk 11,ebb7e4 id 101af 0,5 4 6619 qc 11,ebb7e4 0,5 id 101af sts 0 0 7620 req 7,ebb7e4 ex 0-7fffffffffffffff lkf 2000 wait 1 7620 lk 7,ebb7e4 id 0 -1,5 2000 7620 lk 11,ebb7e4 id 101af 5,0 4 6619 qc 7,ebb7e4 -1,5 id 125019e sts 0 0 6619 qc 11,ebb7e4 5,0 id 101af sts 0 0 7620 ex plock 0 7811 lk 11,ebb7ee id 201d7 5,0 4 6619 qc 11,ebb7ee 5,0 id 201d7 sts 0 0 7811 ex punlock 0 7811 en plock 7,ebb7ee 7811 lk 11,ebb7ee id 201d7 0,5 4 6619 qc 11,ebb7ee 0,5 id 201d7 sts 0 0 7811 req 7,ebb7ee ex 2e1138-2e230a lkf 2000 wait 1 7811 lk 7,ebb7ee id 0 -1,5 2000 7811 lk 11,ebb7ee id 201d7 5,0 4 6619 qc 7,ebb7ee -1,5 id f702f2 sts 0 0 6619 qc 11,ebb7ee 5,0 id 201d7 sts 0 0 7811 ex plock 0 7641 en punlock 7,ebbc52 7641 lk 11,ebbc52 id 103c1 0,5 4 6619 qc 11,ebbc52 0,5 id 103c1 sts 0 0 7641 remove 7,ebbc52 7641 un 7,ebbc52 f20135 5 0 6619 qc 7,ebbc52 5,5 id f20135 sts -65538 0 7641 lk 11,ebbc52 id 103c1 5,0 4 6619 qc 11,ebbc52 5,0 id 103c1 sts 0 0 7641 ex punlock 0 7641 en plock 7,ebbc5f 7641 lk 11,ebbc5f id 102a4 0,5 4 6619 qc 11,ebbc5f 0,5 id 102a4 sts 0 0 7641 req 7,ebbc5f ex 0-7fffffffffffffff lkf 2000 wait 1 7641 lk 7,ebbc5f id 0 -1,5 2000 7794 en punlock 7,ebb7e7 7794 lk 11,ebb7e7 id 10225 0,5 4 7641 lk 11,ebbc5f id 102a4 5,0 4 6619 qc 7,ebbc5f -1,5 id f401f1 sts 0 0 6619 qc 11,ebb7e7 0,5 id 10225 sk 7,edb746 7746 lk 11,edb746 id 2006d 0,5 4 6619 qc 11,edb746 0,5 id 2006d sts 0 0 7621 en punlock 7,ebb7e2 7746 remove 7,edb746 7746 un 7,edb746 112032b 5 0 6619 qc 7,edb746 5,5 id 112032b sts -65538 0 7621 lk 11,ebb7e2 id 2003a 0,5 4 7746 lk 11,edb746 id 2006d 5,0 4 6619 qc 11,ebb7e2 0,5 id 2003a sts 0 0 6619 qc 11,edb746 5,0 id 2006d sts 0 0 7746 ex punlock 0 7746 en plock 7,edb74e 7621 remove 7,ebb7e2 7621 un 7,ebb7e2 11202bd 5 0 6619 qc 7,ebb7e2 5,5 id 11202bd sts -65538 0 7621 lk 11,ebb7e2 id 2003a 5,0 4 6619 qc 11,ebb7e2 5,0 id 2003a sts 0 0 7621 ex punlock 0 7621 en plock 7,ebb7e5 7621 lk 11,ebb7e5 id 10097 0,5 4 6619 qc 11,ebb7e5 0,5 id 10097 sts 0 0 7621 req 7,ebb7e5 ex 0-7fffffffffffffff lkf 2000 wait 1 7621 lk 7,ebb7e5 id 0 -1,5 2000 7621 lk 11,ebb7e5 id 10097 5,0 4 6619 qc 7,ebb7e5 -1,5 id 11e00d6 sts 0 0 6619 qc 11,ebb7e5 5,0 id 10097 sts 0 0 7621 ex plock 0 7622 lk 11,ebb7e2 id 2003a 0,5 4 6619 qc 11,ebb7e2 0,5 id 2003a sts 0 0 7622 req 7,ebb7e2 ex 0-7fffffffffffffff lkf 2000 wait 1 7622 lk 7,e22 lk 11,ebb7e2 id 2003a 5,0 4 6619 qc 7,ebb7e2 -1,5 id 11202dc sts 0 0 6619 qc 11,ebb7e2 5,0 id 2003a sts 0 0 7622 ex plock 0 7639 en punlock 7,ebbc5e 7639 lk 11,ebbc5e id 302a7 0,5 4 6619 qc 11,ebbc5e 0,5 id 302a7 sts 0 0 7639 remove 7,ebbc5e 7639 un 7,ebbc5e e70074 5 0 6619 qc 7,ebbc5e 5,5 id e70074 sts -65538 0 7639 lk 11,ebbc5e id 302a7 5,0 4 6619 qc 11,ebbc5e 5,0 id 302a7 sts 0 0 7639 ex punlock 0 7639 en plock 7,ebbc5f 7644 lk 11,ebbc5e id 302a7 0,5 4 6619 qc 11,ebbc5e 0,5 id 302a7 sts 0 0 7644 req 7,ebbc5e ex 0-7fffffffffffffff lkf 2000 wait 1 7644 lk 7,ebbc5e id 0 -1,5 2000 7644 lk 11,ebbc5e id 302a7 5,0 4 6619 qc 7,ebbc5e -1,5 id 1000185 sts 0 0 6619 qc 11,ebbc5e 5,0 id 302a7 sts 0 0 7644 ex plock 0 7621 en punlock 7,ebb7e5 7621 lk 11,ebb7e5 id 10097 0,5 4 6619 qc 11,ebb7e5 0,5 id 10097 sts 0 0 7621 remove 7,ebb7e5 7621 un 7,ebb7e5 11e00d6 5 0 6619 qc 7,ebb7e5 5,5 id 11e00d6 sts -65538 0 7621 lk 11,ebb7e5 id 10097 5,0 4 6619 qc 11,ebb7e5 5,0 id 10097 sts 0 0 7621 ex punlock 0 7621 en plock 7,ebb7e4 [root@morph-02 ~]# cat /proc/cluster/dlm_debug 1 " 11 gfs2 resent 7 requests gfs1 recover event 71 finished gfs2 recover event 71 finished gfs4 move flags 0,0,1 ids 65,71,71 gfs0 processed 0 requests gfs0 resend marked requests gfs0 resend e0222 lq 3 flg 1200008 node 0/0 " 8 gfs0 resend 9034e lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 702e0 lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 703bd lq 1 flg 200000 node -1/-1 " 5 gfs0 resend b02fa lq 1 flg 200000 node -1/-1 " 7 gfs0 resent 5 requests gfs0 recover event 71 finished gfs4 process held requests gfs4 processed 0 requests gfs4 resend marked requests gfs4 resend d01ab lq 3 flg 1200008 node 0/0 " 8 gfs4 resend 60116 lq 1 flg 200000 node -1/-1 " 7 gfs4 resend c03ec lq 1 flg 200000 node -1/-1 " 7 gfs4 resend a018d lq 1 flg 200000 node -1/-1 " 7 gfs4 resend 8011f lq 1 flg 200000 node -1/-1 " 5 gfs4 resent 5 requests gfs4 recover event 71 finished [root@morph-02 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd PID WIDE-WCHAN-COLUMN CMD 1 - init [3] 2 migration_thread [migration/0] 3 ksoftirqd [ksoftirqd/0] 4 migration_thread [migration/1] 5 ksoftirqd [ksoftirqd/1] 6 worker_thread [events/0] 7 worker_thread [events/1] 8 worker_thread [khelper] 9 worker_thread [kacpid] 41 worker_thread [kblockd/0] 42 worker_thread [kblockd/1] 43 hub_thread [khubd] 52 pdflush [pdflush] 53 - [pdflush] 55 worker_thread [aio/0] 56 worker_thread [aio/1] 54 kswapd [kswapd0] 129 serio_thread [kseriod] 198 - [scsi_eh_0] 199 - [qla2300_0_dpc] 224 worker_thread [kmirrord/0] 225 worker_thread [kmirrord/1] 241 kjournald [kjournald] 1104 - udevd 1464 kjournald [kjournald] 1810 - syslogd -m 0 1814 syslog klogd -x 1824 - irqbalance 1834 - portmap 1853 - rpc.statd 1883 - rpc.idmapd 1985 - /usr/sbin/smartd 1994 - /usr/sbin/acpid 2005 - cupsd 2040 - /usr/sbin/sshd 2053 - xinetd -stayalive -pidfile /var/run/xinetd.pid 2072 - sendmail: rejecting connections on daemon MTA: load average: 93 2080 pause sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 2135 - gpm -m /dev/input/mice -t imps2 2188 - crond 2213 - xfs -droppriv -daemon 2230 - /usr/sbin/atd 2239 - dbus-daemon-1 --system 2250 - cups-config-daemon 2260 - hald 2267 - /sbin/agetty ttyS0 115200 vt100 2268 - /sbin/mingetty tty1 2269 - /sbin/mingetty tty2 2270 - /sbin/mingetty tty3 2271 - /sbin/mingetty tty4 2272 - /sbin/mingetty tty5 2273 - /sbin/mingetty tty6 5871 - ccsd 5893 cluster_kthread [cman_comms] 5895 serviced [cman_serviced] 5894 membership_kthrea [cman_memb] 5972 hello_kthread [cman_hbeat] 6031 rt_sigsuspend fenced 6618 - clvmd 6619 dlm_astd [dlm_astd] 6620 dlm_recvd [dlm_recvd] 6621 dlm_sendd [dlm_sendd] 6622 dlm_recoverd [dlm_recoverd] 6814 dlm_recoverd [dlm_recoverd] 6815 dlm_async [lock_dlm1] 6816 dlm_async [lock_dlm2] 6817 - [gfs_scand] 6818 gfs_glockd [gfs_glockd] 6819 wait_on_buffer [gfs_recoverd] 6820 - [gfs_logd] 6821 glock_wait_intern [gfs_quotad] 6822 - [gfs_inoded] 6882 dlm_recoverd [dlm_recoverd] 6883 dlm_async [lock_dlm1] 6884 dlm_async [lock_dlm2] 6885 - [gfs_scand] 6886 gfs_glockd [gfs_glockd] 6887 wait_on_buffer [gfs_recoverd] 6888 - [gfs_logd] 6889 glock_wait_intern [gfs_quotad] 6890 - [gfs_inoded] 6959 dlm_recoverd [dlm_recoverd] 6960 dlm_async [lock_dlm1] 6961 dlm_async [lock_dlm2] 6962 - [gfs_scand] 6963 gfs_glockd [gfs_glockd] 6964 - [gfs_recoverd] 6965 - [gfs_logd] 6966 - [gfs_quotad] 6967 - [gfs_inoded] 7036 dlm_recoverd [dlm_recoverd] 7037 dlm_async [lock_dlm1] 7038 dlm_async [lock_dlm2] 7039 - [gfs_scand] 7040 - [gfs_glockd] 7041 - [gfs_recoverd] 7042 - [gfs_logd] 7043 - [gfs_quotad] 7044 - [gfs_inoded] 7113 dlm_recoverd [dlm_recoverd] 7114 dlm_async [lock_dlm1] 7115 dlm_async [lock_dlm2] 7116 - [gfs_scand] 7117 gfs_glockd [gfs_glockd] 7118 - [gfs_recoverd] 7119 - [gfs_logd] 7120 - [gfs_quotad] 7121 - [gfs_inoded] 7190 dlm_recoverd [dlm_recoverd] 7191 dlm_async [lock_dlm1] 7192 dlm_async [lock_dlm2] 7193 - [gfs_scand] 7194 gfs_glockd [gfs_glockd] 7195 - [gfs_recoverd] 7196 - [gfs_logd] 7197 - [gfs_quotad] 7198 - [gfs_inoded] 7267 dlm_recoverd [dlm_recoverd] 7268 - [lock_dlm1] 7269 - [lock_dlm2] 7270 - [gfs_scand] 7271 gfs_glockd [gfs_glockd] 7272 - [gfs_recoverd] 7273 - [gfs_logd] 7274 - [gfs_quotad] 7275 - [gfs_inoded] 7446 - sshd: root@notty 7448 - sshd: root@notty 7450 - sshd: root@notty 7451 - sshd: root@notty 7454 - sshd: root@notty 7456 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7457 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7465 - sshd: root@notty 7468 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7481 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7482 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7541 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7565 - sshd: root@notty 7568 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 7586 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7587 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7588 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7597 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 - 7598 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 - 7599 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7602 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7605 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7608 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l 7609 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l 7610 wait_on_buffer growfiles -i 0 -N 500 -n 4 -b 7611 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l 7612 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7613 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l 7614 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7615 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l 7616 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7617 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l 7618 wait genesis -n 500 -d 150 -p 4 7619 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7620 - accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7621 wait_on_buffer accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7622 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7623 wait_on_buffer genesis -n 500 -d 150 -p 4 7624 wait_on_buffer genesis -n 500 -d 150 -p 4 7625 wait_on_buffer genesis -n 500 -d 150 -p 4 7626 wait_on_buffer genesis -n 500 -d 150 -p 4 7627 wait_on_buffer growfiles -i 0 -N 500 -n 4 -b 7628 - growfiles -i 0 -N 500 -n 4 -b 7629 - growfiles -i 0 -N 500 -n 4 -b 7630 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l 7631 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l 7632 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7633 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7634 lock_page doio -avk 7635 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l 7636 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7637 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l 7638 wait genesis -n 500 -d 150 -p 4 7639 - accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7640 - genesis -n 500 -d 150 -p 4 7641 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7642 wait_on_buffer genesis -n 500 -d 150 -p 4 7643 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l 7644 wait_on_buffer accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7645 - growfiles -i 0 -N 500 -n 4 -b 7646 - growfiles -i 0 -N 500 -n 4 -b 7647 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l 7648 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7649 - genesis -n 500 -d 150 -p 4 7650 glock_wait_intern genesis -n 500 -d 150 -p 4 7651 - accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7652 wait_on_buffer growfiles -i 0 -N 500 -n 4 -b 7653 - growfiles -i 0 -N 500 -n 4 -b 7654 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7655 - doio -avk 7657 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 - 7659 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 - 7660 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l 7661 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l 7662 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7663 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l 7664 wait genesis -n 500 -d 150 -p 4 7665 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l 7666 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7667 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l 7668 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7669 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7671 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l 7672 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7673 - growfiles -i 0 -N 500 -n 4 -b 7674 wait_async accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7675 - genesis -n 500 -d 150 -p 4 7676 glock_wait_intern genesis -n 500 -d 150 -p 4 7677 glock_wait_intern genesis -n 500 -d 150 -p 4 7678 - genesis -n 500 -d 150 -p 4 7670 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7679 - growfiles -i 0 -N 500 -n 4 -b 7680 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 7681 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 7682 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7683 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7684 wait_async doio -avk 7685 wait_async doio -avk 7686 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7687 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7688 - doio -avk 7689 - doio -avk 7728 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 - 7729 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l 7730 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l 7731 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7732 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l 7733 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7734 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l 7735 wait genesis -n 500 -d 150 -p 4 7736 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l 7737 - growfiles -i 0 -N 500 -n 4 -b 7738 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7739 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l 7740 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7741 wait_on_buffer genesis -n 500 -d 150 -p 4 7742 - doio -avk 7743 - genesis -n 500 -d 150 -p 4 7744 wait_on_buffer genesis -n 500 -d 150 -p 4 7745 wait_on_buffer genesis -n 500 -d 150 -p 4 7746 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7747 wait_on_buffer accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7748 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7749 - growfiles -i 0 -N 500 -n 4 -b 7750 - growfiles -i 0 -N 500 -n 4 -b 7751 - growfiles -i 0 -N 500 -n 4 -b 7752 - accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7753 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7754 - doio -avk 7755 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l 7756 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l 7757 - growfiles -i 0 -N 500 -n 4 -b 7758 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l 7759 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7761 - growfiles -i 0 -N 500 -n 4 -b 7760 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l 7762 wait_on_buffer growfiles -i 0 -N 500 -n 4 -b 7763 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l 7764 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7765 - growfiles -i 0 -N 500 -n 4 -b 7766 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l 7767 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7768 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7769 - doio -avk 7770 wait genesis -n 500 -d 150 -p 4 7771 wait_on_buffer genesis -n 500 -d 150 -p 4 7773 wait_on_buffer genesis -n 500 -d 150 -p 4 7772 wait_on_buffer accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7774 - genesis -n 500 -d 150 -p 4 7775 wait_on_buffer genesis -n 500 -d 150 -p 4 7776 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7777 wait_on_buffer accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7778 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7779 wait_on_buffer accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7780 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 7781 wait_on_buffer doio -avk 7785 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 - 7786 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l 7787 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l 7788 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7789 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l 7790 - growfiles -i 0 -N 500 -n 4 -b 7792 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7791 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l 7794 - accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7795 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l 7793 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7797 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7796 wait genesis -n 500 -d 150 -p 4 7798 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l 7799 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7800 - accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7801 - growfiles -i 0 -N 500 -n 4 -b 7802 wait_on_buffer growfiles -i 0 -N 500 -n 4 -b 7803 - growfiles -i 0 -N 500 -n 4 -b 7804 wait_on_buffer genesis -n 500 -d 150 -p 4 7805 wait_on_buffer genesis -n 500 -d 150 -p 4 7806 wait_on_buffer genesis -n 500 -d 150 -p 4 7807 wait_on_buffer genesis -n 500 -d 150 -p 4 7808 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7809 - doio -avk 7810 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7811 - doio -avk 7813 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 - 7814 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l 7815 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l 7817 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l 7816 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 7818 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l 7819 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7821 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 7820 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7822 wait_async doio -avk 7823 wait_async accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7824 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7825 wait genesis -n 500 -d 150 -p 4 7826 - genesis -n 500 -d 150 -p 4 7827 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 7828 glock_wait_intern genesis -n 500 -d 150 -p 4 7829 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l 7830 - genesis -n 500 -d 150 -p 4 7831 - growfiles -i 0 -N 500 -n 4 -b 7832 - genesis -n 500 -d 150 -p 4 7833 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l 7834 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 7835 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 7836 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 7837 - growfiles -i 0 -N 500 -n 4 -b 7838 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 7839 wait_async doio -avk 9019 - sshd: root@pts/0 9021 wait -bash 9059 - ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd
WHOA! I take that last comment back! It just took about 15-20 minutes is all. :( I'll try more super high lock count recovery to see if this isn't just taking a SUPER long time to recover.
While gfs0 and gfs1 are in state "recover 2", gfs should be doing journal recovery for those two fs's. At the end there should be a message from gfs stating how long it took to do each journal replay; it would be interesting to see what that showed. I'm betting that the journal recovery time will indicate the 15-20 minutes you waited. The ps shows that a bunch of processes, including the first two gfs_recoverd threads, are blocked on "wait_on_buffer". If I'm not mistaken, this means they are waiting for i/o to complete. I've seen situations where the storage device or drivers or something below the fs hangs for a long time and all i/o hangs until the problem is resolved. This could be the situation. It's also possible that there's no i/o hang, but just an i/o bottleneck. The i/o on all the running fs's could be starving the i/o from the gfs_recoverd threads on the two unrecovered fs's. Running top during this time would probably tell a lot about which of these was going on.
Hit this bug for real this time. The DLM service is actually still stuck in the recovery state. 5 node cluster, 4 gfs. Shot 2 nodes (morph-03 and morph-04). [root@morph-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 3 1 4 5] DLM Lock Space: "clvmd" 2 3 recover 4 - [2 3 1] DLM Lock Space: "gfs0" 3 4 recover 2 - [2 3 1] DLM Lock Space: "gfs1" 5 6 recover 2 - [2 3 1] DLM Lock Space: "gfs2" 7 8 recover 2 - [2 3 1] DLM Lock Space: "gfs3" 9 10 recover 2 - [2 3 1] GFS Mount Group: "gfs0" 4 5 recover 0 - [2 3 1] GFS Mount Group: "gfs1" 6 7 recover 0 - [2 3 1] GFS Mount Group: "gfs2" 8 9 recover 0 - [2 3 1] GFS Mount Group: "gfs3" 10 11 recover 0 - [2 3 1] [root@morph-01 ~]# cat /proc/cluster/lock_dlm/debug nlock 0 6425 en plock 7,10412 6425 lk 11,10412 id 200ec 0,5 4 5905 qc 11,10412 0,5 id 200ec sts 0 0 6425 req 7,10412 ex 0-43b4 lkf 2000 wait 1 6425 lk 7,10412 id 0 -1,5 2000 6425 lk 11,10412 id 200ec 5,0 4 5905 qc 7,10412 -1,5 id a0209 sts 0 0 5905 qc 11,10412 5,0 id 200ec sts 0 0 6425 ex plock 0 6328 en punlock 7,2040f 6328 lk 11,2040f id 201e1 0,5 4 5905 qc 11,2040f 0,5 id 201e1 sts 0 0 6328 remove 7,2040f 6328 un 7,2040f d0039 5 0 5905 qc 7,2040f 5,5 id d0039 sts -65538 0 6328 lk 11,2040f id 201e1 5,0 4 5905 qc 11,2040f 5,0 id 201e1 sts 0 0 6328 ex punlock 0 6328 en plock 7,2040f 6328 lk 11,2040f id 201e1 0,5 4 5905 qc 11,2040f 0,5 id 201e1 sts 0 0 6328 req 7,2040f ex 2ec536-2edcef lkf 2000 wait 1 6328 lk 7,2040f id 0 -1,5 2000 6328 lk 11,2040f id 201e1 5,0 4 5905 qc 11,2040f 5,0 id 201e1 sts 0 0 5905 qc 7,2040f -1,5 id c03f6 sts 0 0 6328 ex plock 0 6328 en punlock 7,2040f 6328 lk 11,2040f id 201e1 0,5 4 5905 qc 11,2040f 0,5 id 201e1 sts 0 0 6328 remove 7,2040f 6328 un 7,2040f c03f6 5 0 5905 qc 7,2040f 5,5 id c03f6 sts -65538 0 6328 lk 11,2040f id 201e1 5,0 4 5905 qc 11,2040f 5,0 id 201e1 sts 0 0 6430 en punlock 7,55 6328 ex punlock 0 6430 lk 11,55 id 30332 0,5 4 6328 en plock 7,2040f 5905 qc 11,55 0,5 id 30332 sts 0 0 6328 lk 11,2040f id 201e1 0,5 4 5905 qc 11,2040f 0,5 id 201e1 sts 0 0 6430 remove 7,55 6430 un 7,55 d00df 5 0 6328 req 7,2040f ex 2edcf0-2edfe4 lkf 2000 wait 1 6328 lk 7,2040f id 0 -1,5 2000 5905 qc 7,55 5,5 id d00df sts -65538 0 6430 lk 11,55 id 30332 5,0 4 6328 lk 11,2040f id 201e1 5,0 4 5905 qc 11,55 5,0 id 30332 sts 0 0 5905 qc 11,2040f 5,0 id 201e1 sts 0 0 6430 ex punlock 0 5905 qc 7,2040f -1,5 id a0380 sts 0 0 6328 ex plock 0 6430 en plock 7,55 6430 lk 11,55 id 30332 0,5 4 5905 qc 11,55 0,5 id 30332 sts 0 0 6430 req 7,55 ex 2abfa7-2da36b lkf 2000 wait 1 6430 lk 7,55 id 0 -1,5 2000 6430 lk 11,55 id 30332 5,0 4 6328 en punlock 7,2040f 5905 qc 11,55 5,0 id 30332 sts 0 0 6328 lk 11,2040f id 201e1 0,5 4 5905 qc 11,2040f 0,5 id 201e1 sts 0 0 6328 remove 7,2040f 6328 un 7,2040f a0380 5 0 5905 qc 7,2040f 5,5 id a0380 sts -65538 0 6328 lk 11,2040f id 201e1 5,0 4 5905 qc 11,2040f 5,0 id 201e1 sts 0 0 6328 ex punlock 0 5905 qc 7,55 -1,5 id b033c sts 0 0 6430 ex plock 0 6272 en punlock 7,10415 6272 lk 11,10415 id 10148 0,5 4 5905 qc 11,10415 0,5 id 10148 sts 0 0 6272 remove 7,10415 6272 un 7,10415 b0169 5 0 5905 qc 7,10415 5,5 id b0169 sts -65538 0 6272 lk 11,10415 id 10148 5,0 4 5905 qc 11,10415 5,0 id 10148 sts 0 0 6272 ex punlock 0 6272 en plock 7,10415 6272 lk 11,10415 id 10148 0,5 4 5905 qc 11,10415 0,5 id 10148 sts 0 0 6425 en punlock 7,10412 6425 lk 11,10412 id 200ec 0,5 4 5905 qc 11,10412 0,5 id 200ec sts 0 0 6425 remove 7,10412 6425 un 7,10412 a0209 5 0 5905 qc 7,10412 5,5 id a0209 sts -65538 0 6425 lk 11,10412 id 200ec 5,0 4 5905 qc 11,10412 5,0 id 200ec sts 0 0 6425 ex punlock 0 6425 en plock 7,10412 6425 lk 11,10412 id 200ec 0,5 4 5905 qc 11,10412 0,5 id 200ec sts 0 0 6425 req 7,10412 ex 43b4-7127 lkf 2000 wait 1 6425 lk 7,10412 id 0 -1,5 2000 6425 lk 11,10412 id 200ec 5,0 4 5905 qc 7,10412 -1,5 id a0057 sts 0 0 5905 qc 11,10412 5,0 id 200ec sts 0 0 6425 ex plock 0 6381 en punlock 7,10414 6381 lk 11,10414 id 30372 0,5 4 5905 qc 11,10414 0,5 id 30372 sts 0 0 6381 remove 7,10414 6381 un 7,10414 c003b 5 0 6272 req 7,10415 ex 64f4-72e5 lkf 2000 wait 1 6272 lk 7,10415 id 0 -1,5 2000 5905 qc 7,10414 5,5 id c003b sts -65538 0 6430 en punlock 7,55 6430 lk 11,55 id 30332 0,5 4 5905 qc 7,10415 -1,5 id 90090 sts 0 0 5905 qc 11,55 0,5 id 30332 sts 0 0 6272 lk 11,10415 id 10148 5,0 4 6381 lk 11,10414 id 30372 5,0 4 6430 remove 7,55 6430 un 7,55 b033c 5 0 5905 qc 11,10415 5,0 id 10148 sts 0 0 5905 qc 11,10414 5,0 id 30372 sts 0 0 6272 ex plock 0 5905 qc 7,55 5,5 id b033c sts -65538 0 6430 lk 11,55 id 30332 5,0 4 5905 qc 11,55 5,0 id 30332 sts 0 0 6430 ex punlock 0 6430 en plock 7,55 6381 ex punlock 0 6328 en plock 7,2040f 6381 en plock 7,10414 6384 en punlock 7,10415 6425 en punlock 7,10412 6272 en punlock 7,10415 6026 un 2,2058c 30153 5 0 6065 un 2,19 100c3 3 0 6104 un 2,609b7 3018f 5 0 6143 un 2,19 102f3 3 0 [root@morph-02 ~]# cat /proc/cluster/lock_dlm/debug 5,0 4 5922 qc 11,1a00274 5,0 id 20320 sts 0 0 5922 qc 7,1a00274 -1,5 id 83037b sts 0 0 6456 ex plock 0 6443 en punlock 7,19e0273 6443 lk 11,19e0273 id 50274 0,5 4 5922 qc 11,19e0273 0,5 id 50274 sts 0 0 6443 remove 7,19e0273 6443 un 7,19e0273 7f014a 5 0 5922 qc 7,19e0273 5,5 id 7f014a sts -65538 0 6443 lk 11,19e0273 id 50274 5,0 4 5922 qc 11,19e0273 5,0 id 50274 sts 0 0 6443 ex punlock 0 6443 en plock 7,19e0273 6443 lk 11,19e0273 id 50274 0,5 4 5922 qc 11,19e0273 0,5 id 50274 sts 0 0 6443 req 7,19e0273 ex 5fc7-6cbb lkf 2000 wait 1 6443 lk 7,19e0273 id 0 -1,5 2000 6443 lk 11,19e0273 id 50274 5,0 4 5922 qc 7,19e0273 -1,5 id 8e02ec sts 0 0 5922 qc 11,19e0273 5,0 id 50274 sts 0 0 6443 ex plock 0 6415 en punlock 7,1a00270 6415 lk 11,1a00270 id 10134 0,5 4 5922 qc 11,1a00270 0,5 id 10134 sts 0 0 6415 remove 7,1a00270 6415 un 7,1a00270 8b005d 5 0 5922 qc 7,1a00270 5,5 id 8b005d sts -65538 0 6415 lk 11,1a00270 id 10134 5,0 4 5922 qc 11,1a00270 5,0 id 10134 sts 0 0 6415 ex punlock 0 6415 en plock 7,1a00270 6415 lk 11,1a00270 id 10134 0,5 4 5922 qc 11,1a00270 0,5 id 10134 sts 0 0 6415 req 7,1a00270 ex 0-101e lkf 2000 wait 1 6415 lk 7,1a00270 id 0 -1,5 2000 6415 lk 11,1a00270 id 10134 5,0 4 5922 qc 7,1a00270 -1,5 id 7900e0 sts 0 0 5922 qc 11,1a00270 5,0 id 10134 sts 0 0 6415 ex plock 0 6426 en punlock 7,19e0282 6426 lk 11,19e0282 id 30290 0,5 4 5922 qc 11,19e0282 0,5 id 30290 sts 0 0 6426 remove 7,19e0282 6426 un 7,19e0282 87025c 5 0 5922 qc 7,19e0282 5,5 id 87025c sts -65538 0 6426 lk 11,19e0282 id 30290 5,0 4 5922 qc 11,19e0282 5,0 id 30290 sts 0 0 6426 ex punlock 0 6426 en plock 7,19e0282 6426 lk 11,19e0282 id 30290 0,5 4 5922 qc 11,19e0282 0,5 id 30290 sts 0 0 6426 req 7,19e0282 ex 0-3ba4 lkf 2000 wait 1 6426 lk 7,19e0282 id 0 -1,5 2000 6426 lk 11,19e0282 id 30290 5,0 4 5922 qc 11,19e0282 5,0 id 30290 sts 0 0 6428 en punlock 7,19e0280 6428 lk 11,19e0280 id 300ae 0,5 4 5922 qc 11,19e0280 0,5 id 300ae sts 0 0 6428 remove 7,19e0280 6428 un 7,19e0280 8a01c3 5 0 5922 qc 7,19e0280 5,5 id 8a01c3 sts -65538 0 6428 lk 11,19e0280 id 300ae 5,0 4 5922 qc 7,19e0282 -1,5 id 930297 sts 0 0 5922 qc 11,19e0280 5,0 id 300ae sts 0 0 6428 ex punlock 0 6428 en plock 7,19e0280 6428 lk 11,19e0280 id 300ae 0,5 4 5922 qc 11,19e0280 0,5 id 300ae sts 0 0 6428 req 7,19e0280 ex 2ed897-2edab9 lkf 2000 wait 1 6428 lk 7,19e0280 id 0 -1,5 2000 6428 lk 11,19e0280 id 300ae 5,0 4 5922 qc 7,19e0280 -1,5 id 8902e7 sts 0 0 5922 qc 11,19e0280 5,0 id 300ae sts 0 0 6428 ex plock 0 6426 ex plock 0 6428 en punlock 7,19e0280 6428 lk 11,19e0280 id 300ae 0,5 4 5922 qc 11,19e0280 0,5 id 300ae sts 0 0 6428 remove 7,19e0280 6428 un 7,19e0280 8902e7 5 0 5922 qc 7,19e0280 5,5 id 8902e7 sts -65538 0 6428 lk 11,19e0280 id 300ae 5,0 4 6456 en punlock 7,1a00274 5922 qc 11,19e0280 5,0 id 300ae sts 0 0 6456 lk 11,1a00274 id 20320 0,5 4 6428 ex punlock 0 5922 qc 11,1a00274 0,5 id 20320 sts 0 0 6443 en punlock 7,19e0273 6443 lk 11,19e0273 id 50274 0,5 4 6456 remove 7,1a00274 5922 qc 11,19e0273 0,5 id 50274 sts 0 0 6456 un 7,1a00274 83037b 5 0 5922 qc 7,1a00274 5,5 id 83037b sts -65538 0 6456 lk 11,1a00274 id 20320 5,0 4 5922 qc 11,1a00274 5,0 id 20320 sts 0 0 6456 ex punlock 0 6456 en plock 7,1a00274 6456 lk 11,1a00274 id 20320 0,5 4 5922 qc 11,1a00274 0,5 id 20320 sts 0 0 6456 req 7,1a00274 ex 6060-70da lkf 2000 wait 1 6456 lk 7,1a00274 id 0 -1,5 2000 6456 lk 11,1a00274 id 20320 5,0 4 5922 qc 11,1a00274 5,0 id 20320 sts 0 0 6443 remove 7,19e0273 6443 un 7,19e0273 8e02ec 5 0 5922 qc 7,19e0273 5,5 id 8e02ec sts -65538 0 6443 lk 11,19e0273 id 50274 5,0 4 5922 qc 11,19e0273 5,0 id 50274 sts 0 0 6443 ex punlock 0 5922 qc 7,1a00274 -1,5 id 98007d sts 0 0 6443 en plock 7,19e0273 6443 lk 11,19e0273 id 50274 0,5 4 5922 qc 11,19e0273 0,5 id 50274 sts 0 0 6443 req 7,19e0273 ex 0-1dee lkf 2000 wait 1 6443 lk 7,19e0273 id 0 -1,5 2000 6443 lk 11,19e0273 id 50274 5,0 4 6456 ex plock 0 6428 en plock 7,19e0280 6426 en punlock 7,19e0282 6415 en punlock 7,1a00270 6456 en punlock 7,1a00274 6116 un 2,1a10fdd 1e018e 5 0 6196 un 2,1aa03ce 5303a4 5 0 6156 un 2,1a04889 630244 5 0 6085 un 2,1a70551 48030e 5 0 [root@morph-05 ~]# cat /proc/cluster/lock_dlm/debug lk 11,678a01d id 202b7 0,5 4 5907 qc 11,678a01d 0,5 id 202b7 sts 0 0 6410 req 7,678a01d ex 2ed490-2ed762 lkf 2000 wait 1 6410 lk 7,678a01d id 0 -1,5 2000 6410 lk 11,678a01d id 202b7 5,0 4 5907 qc 7,678a01d -1,5 id 4000a6 sts 0 0 5907 qc 11,678a01d 5,0 id 202b7 sts 0 0 6410 ex plock 0 6317 en punlock 7,678a009 6317 lk 11,678a009 id 1026e 0,5 4 5907 qc 11,678a009 0,5 id 1026e sts 0 0 6317 remove 7,678a009 6317 un 7,678a009 390318 5 0 5907 qc 7,678a009 5,5 id 390318 sts -65538 0 6317 lk 11,678a009 id 1026e 5,0 4 5907 qc 11,678a009 5,0 id 1026e sts 0 0 6317 ex punlock 0 6317 en plock 7,678a009 6317 lk 11,678a009 id 1026e 0,5 4 5907 qc 11,678a009 0,5 id 1026e sts 0 0 6317 req 7,678a009 ex 0-6628 lkf 2000 wait 1 6317 lk 7,678a009 id 0 -1,5 2000 6317 lk 11,678a009 id 1026e 5,0 4 5907 qc 11,678a009 5,0 id 1026e sts 0 0 6255 en punlock 7,678a006 6255 lk 11,678a006 id 30185 0,5 4 5907 qc 11,678a006 0,5 id 30185 sts 0 0 6255 remove 7,678a006 6255 un 7,678a006 2c026c 5 0 5907 qc 7,678a006 5,5 id 2c026c sts -65538 0 6255 lk 11,678a006 id 30185 5,0 4 5907 qc 11,678a006 5,0 id 30185 sts 0 0 6255 ex punlock 0 6255 en plock 7,678a006 6255 lk 11,678a006 id 30185 0,5 4 5907 qc 11,678a006 0,5 id 30185 sts 0 0 6255 req 7,678a006 ex 6ee8-7453 lkf 2000 wait 1 6255 lk 7,678a006 id 0 -1,5 2000 6255 lk 11,678a006 id 30185 5,0 4 5907 qc 7,678a006 -1,5 id 230345 sts 0 0 5907 qc 11,678a006 5,0 id 30185 sts 0 0 6255 ex plock 0 6357 en punlock 7,678a015 6357 lk 11,678a015 id 302d4 0,5 4 5907 qc 11,678a015 0,5 id 302d4 sts 0 0 6357 remove 7,678a015 6357 un 7,678a015 2e028d 5 0 5907 qc 7,678a015 5,5 id 2e028d sts -65538 0 6357 lk 11,678a015 id 302d4 5,0 4 5907 qc 11,678a015 5,0 id 302d4 sts 0 0 6357 ex punlock 0 6357 en plock 7,678a015 6357 lk 11,678a015 id 302d4 0,5 4 5907 qc 11,678a015 0,5 id 302d4 sts 0 0 6357 req 7,678a015 ex 72c3-74ea lkf 2000 wait 1 6357 lk 7,678a015 id 0 -1,5 2000 6357 lk 11,678a015 id 302d4 5,0 4 5907 qc 11,678a015 5,0 id 302d4 sts 0 0 6413 en punlock 7,678a00e 6413 lk 11,678a00e id 40180 0,5 4 5907 qc 11,678a00e 0,5 id 40180 sts 0 0 6413 remove 7,678a00e 6413 un 7,678a00e 420182 5 0 5907 qc 7,678a00e 5,5 id 420182 sts -65538 0 6413 lk 11,678a00e id 40180 5,0 4 5907 qc 11,678a00e 5,0 id 40180 sts 0 0 6413 ex punlock 0 6413 en plock 7,678a010 6413 lk 11,678a010 id 400c4 0,5 4 5907 qc 11,678a010 0,5 id 400c4 sts 0 0 6413 req 7,678a010 ex 0-7fffffffffffffff lkf 2000 wait 1 6413 lk 7,678a010 id 0 -1,5 2000 6413 lk 11,678a010 id 400c4 5,0 4 5907 qc 11,678a010 5,0 id 400c4 sts 0 0 6412 lk 11,678a00e id 40180 0,5 4 5907 qc 11,678a00e 0,5 id 40180 sts 0 0 6412 req 7,678a00e ex 0-7fffffffffffffff lkf 2000 wait 1 6412 lk 7,678a00e id 0 -1,5 2000 6412 lk 11,678a00e id 40180 5,0 4 5907 qc 11,678a00e 5,0 id 40180 sts 0 0 6410 en punlock 7,678a01d 6410 lk 11,678a01d id 202b7 0,5 4 5907 qc 11,678a01d 0,5 id 202b7 sts 0 0 6410 remove 7,678a01d 6410 un 7,678a01d 4000a6 5 0 5907 qc 7,678a01d 5,5 id 4000a6 sts -65538 0 6410 lk 11,678a01d id 202b7 5,0 4 5907 qc 11,678a01d 5,0 id 202b7 sts 0 0 6410 ex punlock 0 5907 qc 7,679a007 -1,5 id 290032 sts 0 0 5907 qc 7,678a009 -1,5 id 3603ca sts 0 0 5907 qc 7,678a015 -1,5 id 37018d sts 0 0 5907 qc 7,678a00e -1,5 id 3f01d4 sts 0 0 6412 ex plock 0 6317 ex plock 0 6357 ex plock 0 6356 ex plock 0 6410 en plock 7,678a01d 6410 lk 11,678a01d id 202b7 0,5 4 5907 qc 11,678a01d 0,5 id 202b7 sts 0 0 6410 req 7,678a01d ex 0-2292b5 lkf 2000 wait 1 6410 lk 7,678a01d id 0 -1,5 2000 6410 lk 11,678a01d id 202b7 5,0 4 5907 qc 7,678a01d -1,5 id 310385 sts 0 0 5907 qc 11,678a01d 5,0 id 202b7 sts 0 0 6410 ex plock 0 5907 qc 7,678a018 -1,5 id 3b0112 sts 0 0 5907 qc 7,678a011 -1,5 id 3a0043 sts 0 0 5907 qc 7,678a010 -1,5 id 3001d6 sts 0 0 6415 ex plock 0 6411 ex plock 0 6413 ex plock 0 6412 en punlock 7,678a00e 6356 en punlock 7,679a007 6255 en punlock 7,678a006 6317 en punlock 7,678a009 6411 en punlock 7,678a011 6357 en punlock 7,678a015 6413 en punlock 7,678a010 6415 en punlock 7,678a018 6410 en punlock 7,678a01d 6009 un 2,67aa392 15027f 5 0 6048 un 2,67ba2c2 180132 5 0 6087 un 2,67da870 1203c7 5 0 6126 un 2,679ae48 220177 5 0
more info: [root@morph-01 ~]# cat /proc/cluster/dlm_debug 0,1,0 ids 13,21,13 gfs0 move use event 21 gfs0 recover event 21 gfs0 remove node 4 gfs0 remove node 5 clvmd move flags 0,1,0 ids 4,21,4 clvmd move use event 21 clvmd recover event 21 clvmd remove node 4 clvmd remove node 5 gfs2 total nodes 3 gfs2 rebuild resource directory gfs0 total nodes 3 gfs0 rebuild resource directory gfs1 total nodes 3 gfs1 rebuild resource directory clvmd total nodes 3 clvmd rebuild resource directory clvmd rebuilt 1 resources clvmd purge requests clvmd purged 0 requests gfs3 rebuilt 6411 resources gfs3 purge requests gfs3 purged 0 requests gfs2 rebuilt 6559 resources gfs2 purge requests gfs2 purged 0 requests gfs0 rebuilt 6638 resources gfs0 purge requests gfs0 purged 0 requests gfs1 rebuilt 6698 resources gfs1 purge requests gfs1 purged 0 requests clvmd mark waiting requests clvmd marked 0 requests clvmd purge locks of departed nodes clvmd purged 0 locks clvmd update remastered resources clvmd updated 0 resources clvmd rebuild locks clvmd rebuilt 0 locks clvmd recover event 21 done [root@morph-02 ~]# cat /proc/cluster/dlm_debug move flags 1,0,0 ids 81,81,81 gfs3 move flags 0,1,0 ids 81,83,81 gfs3 move use event 83 gfs3 recover event 83 gfs3 remove node 4 gfs3 remove node 5 gfs2 move flags 0,1,0 ids 79,83,79 gfs2 move use event 83 gfs2 recover event 83 gfs2 remove node 4 gfs2 remove node 5 gfs1 move flags 0,1,0 ids 77,83,77 gfs1 move use event 83 gfs1 recover event 83 gfs1 remove node 4 gfs1 remove node 5 gfs0 move flags 0,1,0 ids 75,83,75 gfs0 move use event 83 gfs0 recover event 83 gfs0 remove node 4 gfs0 remove node 5 clvmd move flags 0,1,0 ids 66,83,66 clvmd move use event 83 clvmd recover event 83 clvmd remove node 4 clvmd remove node 5 gfs3 total nodes 3 gfs3 rebuild resource directory gfs2 total nodes 3 gfs2 rebuild resource directory clvmd total nodes 3 clvmd rebuild resource directory gfs0 total nodes 3 gfs0 rebuild resource directory gfs1 total nodes 3 gfs1 rebuild resource directory clvmd rebuilt 2 resources clvmd purge requests clvmd purged 0 requests gfs3 rebuilt 6438 resources gfs3 purge requests gfs3 purged 0 requests [root@morph-05 ~]# cat /proc/cluster/dlm_debug 5 gfs0 move flags 0,1,0 ids 24,32,24 gfs0 move use event 32 gfs0 recover event 32 gfs0 remove node 4 gfs0 remove node 5 clvmd move flags 0,1,0 ids 15,32,15 clvmd move use event 32 clvmd recover event 32 clvmd remove node 4 clvmd remove node 5 gfs3 total nodes 3 gfs3 rebuild resource directory gfs2 total nodes 3 gfs2 rebuild resource directory gfs0 total nodes 3 clvmd total nodes 3 gfs0 rebuild resource directory clvmd rebuild resource directory gfs1 total nodes 3 gfs1 rebuild resource directory clvmd rebuilt 1 resources clvmd purge requests clvmd purged 0 requests gfs3 rebuilt 6327 resources gfs3 purge requests gfs3 purged 0 requests gfs1 rebuilt 6695 resources gfs1 purge requests gfs1 purged 0 requests gfs0 rebuilt 6705 resources gfs0 purge requests gfs0 purged 0 requests clvmd mark waiting requests clvmd marked 0 requests clvmd purge locks of departed nodes clvmd purged 0 locks clvmd update remastered resources clvmd updated 0 resources clvmd rebuild locks clvmd rebuilt 0 locks clvmd recover event 32 done [root@morph-01 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd PID WIDE-WCHAN-COLUMN CMD 1 - init [3] 2 migration_thread [migration/0] 3 ksoftirqd [ksoftirqd/0] 4 migration_thread [migration/1] 5 ksoftirqd [ksoftirqd/1] 6 migration_thread [migration/2] 7 ksoftirqd [ksoftirqd/2] 8 migration_thread [migration/3] 9 ksoftirqd [ksoftirqd/3] 10 worker_thread [events/0] 11 worker_thread [events/1] 12 worker_thread [events/2] 13 worker_thread [events/3] 14 worker_thread [khelper] 15 worker_thread [kacpid] 47 worker_thread [kblockd/0] 48 worker_thread [kblockd/1] 49 worker_thread [kblockd/2] 50 worker_thread [kblockd/3] 51 hub_thread [khubd] 60 pdflush [pdflush] 61 pdflush [pdflush] 63 worker_thread [aio/0] 64 worker_thread [aio/1] 65 worker_thread [aio/2] 66 worker_thread [aio/3] 62 kswapd [kswapd0] 139 serio_thread [kseriod] 208 - [scsi_eh_0] 209 - [qla2300_0_dpc] 234 worker_thread [kmirrord/0] 235 worker_thread [kmirrord/1] 236 worker_thread [kmirrord/2] 237 worker_thread [kmirrord/3] 250 kjournald [kjournald] 1071 - udevd 1473 kjournald [kjournald] 1838 - syslogd -m 0 1842 syslog klogd -x 1852 - irqbalance 1862 - portmap 1881 - rpc.statd 1911 - rpc.idmapd 1968 - /usr/sbin/smartd 1977 - /usr/sbin/acpid 1988 - cupsd 2035 - /usr/sbin/sshd 2075 - xinetd -stayalive -pidfile /var/run/xinetd.pid 2106 - sendmail: rejecting connections on daemon MTA: load average: 52 2118 pause sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 2136 - gpm -m /dev/input/mice -t imps2 2181 - crond 2206 - xfs -droppriv -daemon 2240 - /usr/sbin/atd 2274 - dbus-daemon-1 --system 2303 - cups-config-daemon 2315 - hald 2322 - /sbin/agetty ttyS0 115200 vt100 2323 - /sbin/mingetty tty1 2324 - /sbin/mingetty tty2 2325 - /sbin/mingetty tty3 2326 - /sbin/mingetty tty4 2327 - /sbin/mingetty tty5 2328 - /sbin/mingetty tty6 3708 - ccsd 5781 cluster_kthread [cman_comms] 5783 serviced [cman_serviced] 5782 membership_kthrea [cman_memb] 5784 hello_kthread [cman_hbeat] 5843 rt_sigsuspend fenced 5904 - clvmd 5905 dlm_astd [dlm_astd] 5906 dlm_recvd [dlm_recvd] 5907 dlm_sendd [dlm_sendd] 5908 dlm_recoverd [dlm_recoverd] 6022 dlm_wait_function [dlm_recoverd] 6023 dlm_async [lock_dlm1] 6024 dlm_async [lock_dlm2] 6025 - [gfs_scand] 6026 - [gfs_glockd] 6027 - [gfs_recoverd] 6028 - [gfs_logd] 6029 glock_wait_intern [gfs_quotad] 6030 - [gfs_inoded] 6052 dlm_wait_function [dlm_recoverd] 6053 dlm_async [lock_dlm1] 6054 dlm_async [lock_dlm2] 6064 - [gfs_scand] 6065 - [gfs_glockd] 6066 - [gfs_recoverd] 6067 - [gfs_logd] 6068 glock_wait_intern [gfs_quotad] 6069 - [gfs_inoded] 6091 dlm_wait_function [dlm_recoverd] 6101 dlm_async [lock_dlm1] 6102 dlm_async [lock_dlm2] 6103 - [gfs_scand] 6104 - [gfs_glockd] 6105 - [gfs_recoverd] 6106 - [gfs_logd] 6107 glock_wait_intern [gfs_quotad] 6108 - [gfs_inoded] 6130 dlm_wait_function [dlm_recoverd] 6140 dlm_async [lock_dlm1] 6141 dlm_async [lock_dlm2] 6142 - [gfs_scand] 6143 - [gfs_glockd] 6144 - [gfs_recoverd] 6145 - [gfs_logd] 6146 glock_wait_intern [gfs_quotad] 6147 - [gfs_inoded] 6233 - sshd: root@notty 6235 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 6253 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 6257 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 - 6258 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 -l /tmp/revolver/6253/revolver_l 6259 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 -l /tmp/revolver/6253/revolver_l 6260 - growfiles -i 0 -N 500 -n 4 -b 6261 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 -l /tmp/revolver/6253/revolver_l 6262 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 6263 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 -l /tmp/revolver/6253/revolver_l 6264 wait genesis -n 500 -d 150 -p 4 6265 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 -l /tmp/revolver/6253/revolver_l 6266 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6267 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6253 -l /tmp/revolver/6253/revolver_l 6268 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 6269 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 6270 wait_async doio -avk 6271 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 6272 dlm_lock_sync doio -avk 6273 - genesis -n 500 -d 150 -p 4 6274 - genesis -n 500 -d 150 -p 4 6275 glock_wait_intern genesis -n 500 -d 150 -p 4 6276 - genesis -n 500 -d 150 -p 4 6277 - sshd: root@notty 6278 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6279 - growfiles -i 0 -N 500 -n 4 -b 6280 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6281 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6282 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6283 wait_async accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6284 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6286 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 6304 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 6308 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 - 6309 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 -l /tmp/revolver/6304/revolver_l 6310 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 -l /tmp/revolver/6304/revolver_l 6311 - growfiles -i 0 -N 500 -n 4 -b 6312 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 -l /tmp/revolver/6304/revolver_l 6313 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 6314 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 -l /tmp/revolver/6304/revolver_l 6315 wait genesis -n 500 -d 150 -p 4 6316 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 -l /tmp/revolver/6304/revolver_l 6317 - growfiles -i 0 -N 500 -n 4 -b 6318 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6319 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6304 -l /tmp/revolver/6304/revolver_l 6321 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6320 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 6322 - genesis -n 500 -d 150 -p 4 6323 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6324 glock_wait_intern genesis -n 500 -d 150 -p 4 6325 glock_wait_intern genesis -n 500 -d 150 -p 4 6326 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 6327 - genesis -n 500 -d 150 -p 4 6328 dlm_lock_sync doio -avk 6329 wait_async accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6330 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 6331 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6332 wait_async doio -avk 6333 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6334 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6335 - sshd: root@notty 6337 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 6355 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 6359 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 - 6360 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 -l /tmp/revolver/6355/revolver_l 6361 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 -l /tmp/revolver/6355/revolver_l 6362 - growfiles -i 0 -N 500 -n 4 -b 6363 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 -l /tmp/revolver/6355/revolver_l 6364 wait genesis -n 500 -d 150 -p 4 6365 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 -l /tmp/revolver/6355/revolver_l 6366 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 6367 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 -l /tmp/revolver/6355/revolver_l 6368 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6369 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 6370 dlm_lock_sync genesis -n 500 -d 150 -p 4 6371 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6355 -l /tmp/revolver/6355/revolver_l 6372 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6373 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6374 - genesis -n 500 -d 150 -p 4 6375 - growfiles -i 0 -N 500 -n 4 -b 6376 - genesis -n 500 -d 150 -p 4 6377 glock_wait_intern genesis -n 500 -d 150 -p 4 6378 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 6379 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6380 wait_async accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6381 dlm_lock_sync doio -avk 6382 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 6383 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6384 dlm_lock_sync doio -avk 6385 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6386 - sshd: root@notty 6388 wait bash -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c 6406 pipe_wait /usr/bin/perl -w /tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM 6410 wait sh -c PATH=$PATH:/tmp/STS/bin; /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 - 6411 - /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 -l /tmp/revolver/6406/revolver_l 6412 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 -l /tmp/revolver/6406/revolver_l 6413 wait sh -c iogen -f sync -m sequential -s read,write,readv,writev -t 1b -T 30000 30000:rwsy 6414 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 -l /tmp/revolver/6406/revolver_l 6415 wait accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6416 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 -l /tmp/revolver/6406/revolver_l 6418 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 -l /tmp/revolver/6406/revolver_l 6417 wait genesis -n 500 -d 150 -p 4 6419 pipe_wait /tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.6406 -l /tmp/revolver/6406/revolver_l 6420 wait sh -c iogen -f buffered -m sequential -s read,write,readv,writev -t 1b -T 6000b 6000b: 6421 - growfiles -i 0 -N 500 -n 4 -b 6422 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6423 pipe_wait iogen -f sync -m sequential -s read write readv writev -t 1b -T 30000 30000:rwsynclarg 6424 glock_wait_intern genesis -n 500 -d 150 -p 4 6425 dlm_lock_sync doio -avk 6426 dlm_lock_sync genesis -n 500 -d 150 -p 4 6427 - genesis -n 500 -d 150 -p 4 6428 glock_wait_intern genesis -n 500 -d 150 -p 4 6429 pipe_wait iogen -f buffered -m sequential -s read write readv writev -t 1b -T 6000b 6000b:rwbufl 6430 dlm_lock_sync doio -avk 6431 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6432 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6433 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b 6434 - growfiles -i 0 -N 500 -n 4 -b 6435 wait_async accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6436 wait_local accordion -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6990 - sshd: root@pts/0 6992 wait -bash 7038 - ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd
[root@morph-01 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 37236 Unlock operations: 21084 Convert operations: 78110 Completion ASTs: 136412 Blocking ASTs: 4 Lockqueue num waittime ave WAIT_RSB 26397 170661 6 WAIT_GRANT 8443 5663 0 WAIT_UNLOCK 44 86 1 Total 34884 176410 5 [root@morph-02 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 521649 Unlock operations: 483282 Convert operations: 1594283 Completion ASTs: 2599179 Blocking ASTs: 99 Lockqueue num waittime ave WAIT_RSB 343301 1773399 5 WAIT_CONV 11 58 5 WAIT_GRANT 8624 2673 0 WAIT_UNLOCK 190 96 0 Total 352126 1776226 5 root@morph-05 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 185748 Unlock operations: 144044 Convert operations: 462750 Completion ASTs: 792510 Blocking ASTs: 51 Lockqueue num waittime ave WAIT_RSB 134880 980725 7 WAIT_CONV 5 52 10 WAIT_GRANT 8585 16370 1 WAIT_UNLOCK 184 1267 6 Total 143654 998414 6
copied from email: I don't suppose those nodes are still stuck there? It would be helpful to get the ps from the other two nodes, as well as /proc/meminfo and /proc/slabinfo from all. There's actually nothing wrong on morph-01. It's morph-02 and 05 where dlm recovery has stalled in the rebuild resource directory stage. That stage can eat up a lot of memory and in the past I've seen nodes run out of memory completely there. There are usually other indications that the system memory is gone, though, which is why I'm interested in meminfo/slabinfo. If it is in fact a memory problem, the solution isn't too clear... With the load you're running multiplied by 4 fs's, the situation seems ripe for running memory dry. It's really high numbers of locks in the dlm, combined with recovery that can lead to this. /proc/cluster/lock_dlm/drop_count is one crude method we have for trying to keep the number of dlm/gfs locks down to try to avoid running out of memory during recovery. By default I think it's set to 50000, and the only way to find a better number is trial and error. Lowering it makes out-of-memory during recovery less likely, but can limit gfs caching and hurt performance.
I reproduced and gathered all requested info and put it in: /home/msp/cmarthal/pub/bugs/145683 Memory is low but there isn't a OOM case. Plenty of swap as well. There were 3 filesystems and only one is hung, the other two I can contiune to write/read to/from.
It's back to being gfs/lock_dlm recovery that's stuck, not the dlm. We can't see what happened in lock_dlm because logging from the other running fs's has wiped out info from the stuck fs. stuck lock_dlm: comments 1, 15, 8 stuck dlm: comments 5, 11 We eventually decided that comment 8 was a different bug (bz 152451) since it completed after a long i/o delay. Might comments 1 and 15 be in the same category? I don't know. Kdb would probably give a quick and certain answer to what's going on in both cases (stuck lock_dlm and dlm). Collecting data as we've been doing might work after a while if just the right clue pops out that can be pieced together to say what's happening, but there's no telling how many repetitions that might take.
Moving to need info. After 17 hours of being stuck, a bad block was finally reported, which finally caused recovery to finish. Thus, this may be 152451. I'll test this on the "known good" MSA1000 storage to see if this can be reproduced.
blaming this on "cheap" storage, will reopen if ever seen on "good" storage.