Bug 143269 - lock_dlm recovery gets stuck because lock cancelation appeared to not be completing
lock_dlm recovery gets stuck because lock cancelation appeared to not be comp...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
i686 Linux
medium Severity high
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks: 144795
  Show dependency treegraph
 
Reported: 2004-12-17 16:25 EST by Corey Marthaler
Modified: 2009-04-16 16:29 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-01-20 11:51:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2004-12-17 16:25:30 EST
Description of problem:
I've been seeing mount hangs lately, while running revolver, after
taking down enough nodes to lose quorum. I then bring everybody back
up into the cluster, start all services, and even get 1 or 2
filesystem mounted before another attempt ends up hanging. All the
mount attempts are run sequentially on only one node at a time. I have
only seen this at times after a recovery which involved a loss of
quorum but that doesn't mean I'm certain that it is necessary to
reporduce this.

strace of the mount attempt (pretty much just states the obvious):
.
.
.
getuid32()                              = 0
geteuid32()                             = 0
lstat64("/etc/mtab", {st_mode=S_IFREG|0644, st_size=256, ...}) = 0
stat64("/sbin/mount.gfs", 0xbffff790)   = -1 ENOENT (No such file or
directory)
rt_sigprocmask(SIG_BLOCK, ~[TRAP SEGV RTMIN], NULL, 8) = 0
mount("/dev/mapper/corey-corey2", "/mnt/corey2", "gfs", 0xc0ed0000, 0


[root@morph-02 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    6   M   morph-01
   2    1    6   M   morph-05
   3    1    6   M   morph-06
   4    1    6   M   morph-04
   5    1    6   M   morph-02
   6    1    6   M   morph-03
[root@morph-02 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[6 4 5 3 1 2]

DLM Lock Space:  "clvmd"                             2   3 run       -
[6 4 5 3 1 2]

DLM Lock Space:  "corey0"                            3   4 run       -
[6 4 5]

DLM Lock Space:  "corey1"                            5   6 run       -
[6 4 5]

DLM Lock Space:  "corey2"                            7   8 run       -
[6 4 5 2 3]

GFS Mount Group: "corey0"                            4   5 run       -
[6 4 5]

GFS Mount Group: "corey1"                            6   7 run       -
[6 4 5]

GFS Mount Group: "corey2"                            0   9 join     
S-1,480,6
[]


Version-Release number of selected component (if applicable):
CMAN <CVS> (built Dec 17 2004 10:19:45) installed

How reproducible:
Sometimes

Actual Results:  To quote one danderso, "Hangy"

Expected Results:  Again to quote one danderso, "No Hangy"
Comment 1 Corey Marthaler 2004-12-21 11:15:08 EST
hit this again today:

GFS Mount Group: "corey0"                            0   5 join     
S-1,80,6
[]

On the only node left up I see this message over and over on the syslog:
lock_dlm: cancel num=1,2
Comment 2 David Teigland 2004-12-30 03:21:24 EST
This is reminiscent of a problem I identified some months ago; may
or may not be the case here.   I'd want the output of
/proc/cluster/lock_dlm/debug from the hung node and hopefully a kdb
backtrace of the waiting mount process.  

Here's how I described the problem in an email on Sept 8:

After lm_mount() returns, lock_dlm assumes that for each NEED_RECOVERY
callback it sends, gfs will in turn call lm_recovery_done().  I'm
seeing a deadlock causing lm_recovery_done() to not get called.  It
happens if lock_dlm does NEED_RECOVERY very quickly after lm_mount(),
while gfs is still in fill_super().

- lm_mount() completes

- lock_dlm begins a recovery operation: it blocks new lock requests
and does a NEED_RECOVERY callback (NOEXP requests are allowed, of course)

- fill_super() calls gfs_jindex_hold() which makes a normal lock
request. this request is blocked in because of recovery

- lock_dlm recovery won't complete until it gets a lm_recovery_done()

- the parts of gfs responsible for lm_recovery_done() (gfs_recoverd?)
have not yet started


If this is in fact the problem, the solution probably lies in getting
gfs to respond to all recovery callbacks after lm_mount returns.
Comment 3 Corey Marthaler 2005-01-06 14:57:42 EST
I reproduced this again. There was nothing in
/proc/cluster/lock_dlm/debug on the node with the hung mount attempt,
however the two nodes which weren't shot (morph-02 and morph-04) did
contain info there.

morph-02:

[root@morph-02 root]# cat /proc/cluster/lock_dlm/debug
a4 sts 0 0
2266 qc 7,1a0e027 -1,5 id 26901fa sts 0 0
2266 qc 7,198e07c -1,5 id 28403e3 sts 0 0
2266 qc 7,1b1de9e -1,5 id 24d0231 sts 0 0
2266 qc 7,198e085 -1,5 id 1010217 sts 0 0
2560 ex plock 0
2563 ex plock 0
2560 en punlock 7,1a0e027
2563 en punlock 7,198e07c
2560 lk 11,1a0e027 id 2570021 0,5 4
2563 lk 11,198e07c id 1021c 0,5 4
2561 ex plock 0
2266 qc 11,1a0e027 0,5 id 2570021 sts 0 0
2266 qc 11,198e07c 0,5 id 1021c sts 0 0
2563 remove 7,198e07c
2563 un 7,198e07c 28403e3 5 0
2561 en punlock 7,1b1de9e
2266 qc 7,198e07c 5,5 id 28403e3 sts -65538 0
2563 lk 11,198e07c id 1021c 5,0 4
2561 lk 11,1b1de9e id 28e03b3 0,5 4
2266 qc 11,198e07c 5,0 id 1021c sts 0 0
2563 ex punlock 0
2266 qc 11,1b1de9e 0,5 id 28e03b3 sts 0 0
2560 remove 7,1a0e027
2560 un 7,1a0e027 26901fa 5 0
2561 remove 7,1b1de9e
2266 qc 7,1a0e027 5,5 id 26901fa sts -65538 0
2561 un 7,1b1de9e 24d0231 5 0
2560 lk 11,1a0e027 id 2570021 5,0 4
2266 qc 7,1b1de9e 5,5 id 24d0231 sts -65538 0
2266 qc 11,1a0e027 5,0 id 2570021 sts 0 0
2561 lk 11,1b1de9e id 28eqc 11,1aedec3 0,5 id 26802eb sts 0 0
2561 req 7,1aedec3 ex 0-7fffffffffffffff lkf 2000 wait 1
2561 lk 7,1aedec3 id 0 -1,5 2000
2514 en punlock 7,198e083
2514 lk 11,198e083 id 50193 0,5 4
2266 qc 11,198e083 0,5 id 50193 sts 0 0
2514 remove 7,198e083
2514 un 7,198e083 10d0137 5 0
2266 qc 7,198e083 5,5 id 10d0137 sts -65538 0
2514 lk 11,198e083 id 50193 5,0 4
2266 qc 11,198e083 5,0 id 50193 sts 0 0
2514 ex punlock 0
2514 en plock 7,198e084
2514 lk 11,198e084 id 1009f 0,5 4
2266 qc 11,198e084 0,5 id 1009f sts 0 0
2514 req 7,198e084 ex 0-7fffffffffffffff lkf 2000 wait 1
2514 lk 7,198e084 id 0 -1,5 2000
2514 lk 11,198e084 id 1009f 5,0 4
2266 qc 7,198e084 -1,5 id 1020283 sts 0 0
2266 qc 11,198e084 5,0 id 1009f sts 0 0
2514 ex plock 0
2561 lk 11,1aedec3 id 26802eb 5,0 4
2266 qc 7,1aedec3 -1,5 id 265037c sts 0 0
2266 qc 11,1aedec3 5,0 id 26802eb sts 0 0
2561 ex plock 0
2514 en punlock 7,198e084
2514 lk 11,198e084 id 1009f 0,5 4
2266 qc 11,198e084 0,5 id 1009f sts 0 0
2266 qc 7,198e07c -1,5 id 2660134 sts 0 0
2514 remove 7,198e084
2514 un 7,198e084 1020283 5 0
2266 qc 7,198e084 5,5 id 1020283 sts -65538 0
2514 lk 11,198e084 id 1009f 5,0 4
2563 ex plock 0
2266 qc 11,198e084 5,0 id 1009f sts 0 0
2514 ex punlock 0
2514 en plock 7,198e086
2514 lk 11,198e086 id 203b1 0,5 4
2266 qc 11,198e086 0,5 id 203b1 sts 0 0
2514 req 7,198e086 ex 0-7fffffffffffffff lkf 2000 wait 1
2514 lk 7,198e086 id 0 -1,5 2000
2514 lk 11,198e086 id 203b1 5,0 4
2266 qc 7,198e086 -1,5 id f500dd sts 0 0
2266 qc 11,198e086 5,0 id 203b1 sts 0 0
2514 ex plock 0
2561 en punlock 7,1aedec3
2561 lk 11,1aedec3 id 26802eb 0,5 4
2266 qc 11,1aedec3 0,5 id 26802eb sts 0 0
2563 en punlock 7,198e07c
2563 lk 11,198e07c id 1021c 0,5 4
2266 qc 11,198e07c 0,5 id 1021c sts 0 0
2563 remove 7,198e07c
2563 un 7,198e07c 2660134 5 0
2515 en punlock 7,198e087
2515 lk 11,198e087 id 4024c 0,5 4
2266 qc 11,198e087 0,5 id 4024c sts 0 0
2515 remove 7,198e087
2515 un 7,198e087 1010171 5 0
2266 qc 7,198e087 5,5 id 1010171 sts -65538 0
2515 lk 11,198e087 id 4024c 5,0 4
2266 qc 11,198e087 5,0 id 4024c sts 0 0
2515 ex punlock 0
2515 en plock 7,198e083
2515 lk 11,198e083 id 50193 0,5 4
2266 qc 11,198e083 0,5 id 50193 sts 0 0
2515 req 7,198e083 ex 0-7fffffffffffffff lkf 2000 wait 1
2515 lk 7,198e083 id 0 -1,5 2000
2515 lk 11,198e083 id 50193 5,0 4
2266 qc 7,198e083 -1,5 id f8038c sts 0 0
2266 qc 11,198e083 5,0 id 50193 sts 0 0
2515 ex plock 0
2266 qc 7,198e07c 5,5 id 2660134 sts -65538 0
2563 lk 11,198e07c id 1021c 5,0 4
2266 qc 11,198e07c 5,0 id 1021c sts 0 0
2563 ex punlock 0
2563 en plock 7,198e07c
2563 lk 11,198e07c id 1021c 0,5 4
2266 qc 11,198e07c 0,5 id 1021c sts 0 0
2563 req 7,198e07c ex 176a66-176cf8 lkf 2000 wait 1
2563 lk 7,198e07c id 0 -1,5 2000
2563 lk 11,198e07c id 1021c 5,0 4
2266 qc 11,198e07c 5,0 id 1021c sts 0 0
2561 remove 7,1aedec3
2561 un 7,1aedec3 265037c 5 0
2266 qc 7,1aedec3 5,5 id 265037c sts -65538 0
2561 lk 11,1aedec3 id 26802eb 5,0 4
2266 qc 11,1aedec3 5,0 id 26802eb sts 0 0
2561 ex punlock 0
2561 lk 5,1aedec3 id 2790383 3,5 805
2266 qc 5,1aedec3 3,5 id 2790383 sts 0 0


morph-04:

root@morph-04 root]# cat /proc/cluster/lock_dlm/debug
1 5 0
2237 qc 7,3b7b3ed 5,5 id 4400a1 sts -65538 0
2495 lk 11,3b7b3ed id 203b5 5,0 4
2237 qc 11,3b7b3ed 5,0 id 203b5 sts 0 0
2495 ex punlock 0
2495 en plock 7,3b7b3ed
2495 lk 11,3b7b3ed id 203b5 0,5 4
2237 qc 11,3b7b3ed 0,5 id 203b5 sts 0 0
2495 req 7,3b7b3ed ex 751592-752f66 lkf 2000 wait 1
2495 lk 7,3b7b3ed id 0 -1,5 2000
2495 lk 11,3b7b3ed id 203b5 5,0 4
2237 qc 7,3b7b3ed -1,5 id 3b009c sts 0 0
2237 qc 11,3b7b3ed 5,0 id 203b5 sts 0 0
2495 ex plock 0
2540 en punlock 7,333bee7
2540 lk 11,333bee7 id 70054 0,5 4
2237 qc 11,333bee7 0,5 id 70054 sts 0 0
2540 remove 7,333bee7
2540 un 7,333bee7 113003c 5 0
2237 qc 7,333bee7 5,5 id 113003c sts -65538 0
2540 lk 11,333bee7 id 70054 5,0 4
2237 qc 11,333bee7 5,0 id 70054 sts 0 0
2540 ex punlock 0
2540 en plock 7,333beca
2540 lk 11,333beca id 201d7 0,5 4
2237 qc 11,333beca 0,5 id 201d7 sts 0 0
2540 req 7,333beca ex 0-7fffffffffffffff lkf 2000 wait 1
2540 lk 7,333beca id 0 -1,5 2000
2540 lk 11,333beca id 201d7 5,0 4
2237 qc 11,333beca 5,0 id 201d7 sts 0 0
2237 qc 7,333beca -1,5 id 13502da sts 0 0
2540 ex plock 0
2495 en punlock 7,3b7b3ed
2495 lk 11,3b7b3ed id 203b5 0,5 4
2237 qc 11,3b7b3ed 0,5 id 203b5 sts 0 0
2495 remove 7,3b7b3ed
2495 un 7,3b7b3ed 3b009c 5 0
2237 qc 7,3b7b3ed 5,5 id 3b009c sts -65538 0
2495 lk 11,3b7b3ed id 203b5 5,0 4
2237 qc 11,3b7b3ed 5,0 id 203b5 sts 0 0
2495 ex punlock 0
2541 en punlock 7,333bef8
2541 lk 11,333bef8 id 40061 0,5 4
2237 qc 11,333bef8 0,5 id 40061 sts 0 0
2541 remove 7,333bef8
2541 un 7,333bef8 12c03f9 5 0
2237 qc 7,333bef8 5,5 id 12c03f9 sts -65538 0
2541 lk 11,333bef8 id 40061 5,0 4
2237 qc 11,333bef8 5,0 id 40061 sts 0 0
2541 ex punlock 0
2541 en plock 7,333bed8
2541 lk 11,333bed8 id 400d4 0,5 4
2237 qc 11,333bed8 0,5 id 400d4 sts 0 0
2541 req 7,333bed8 ex 0-7fffffffffffffff lkf 2000 wait 1
2541 lk 7,333bed8 id 0 -1,5 2000
2541 lk 11,333bed8 id 400d4 5,0 4
2237 qc 11,333bed8 5,0 id 400d4 sts 0 0
2237 qc 7,333bed8 -1,5 id 114017c sts 0 0
2541 ex plock 0
2495 en plock 7,3b7b3ed
2495 lk 11,3b7b3ed id 203b5 0,5 4
2237 qc 11,3b7b3ed 0,5 id 203b5 sts 0 0
2495 req 7,3b7b3ed ex 0-2eda87 lkf 2000 wait 1
2495 lk 7,3b7b3ed id 0 -1,5 2000
2495 lk 11,3b7b3ed id 203b5 5,0 4
2237 qc 7,3b7b3ed -1,5 id 40006e sts 0 0
2237 qc 11,3b7b3ed 5,0 id 203b5 sts 0 0
2495 ex plock 0
2540 en punlock 7,333beca
2540 lk 11,333beca id 201d7 0,5 4
2237 qc 11,333beca 0,5 id 201d7 sts 0 0
2540 remove 7,333beca
2540 un 7,333beca 13502da 5 0
2237 qc 7,333beca 5,5 id 13502da sts -65538 0
2540 lk 11,333beca id 201d7 5,0 4
2237 qc 11,333beca 5,0 id 201d7 sts 0 0
2540 ex punlock 0
2540 en plock 7,333bee7
2540 lk 11,333bee7 id 70054 0,5 4
2237 qc 11,333bee7 0,5 id 70054 sts 0 0
2540 req 7,333bee7 ex 0-7fffffffffffffff lkf 2000 wait 1
2540 lk 7,333bee7 id 0 -1,5 2000
2540 lk 11,333bee7 id 70054 5,0 4
2237 qc 11,333bee7 5,0 id 70054 sts 0 0
2237 qc 7,333bee7 -1,5 id 14101d5 sts 0 0
2540 ex plock 0
2495 en punlock 7,3b7b3ed
2495 lk 11,3b7b3ed id 203b5 0,5 4
2237 qc 11,3b7b3ed 0,5 id 203b5 sts 0 0
2495 remove 7,3b7b3ed
2495 un 7,3b7b3ed 40006e 5 0
2237 qc 7,3b7b3ed 5,5 id 40006e sts -65538 0
2495 lk 11,3b7b3ed id 203b5 5,0 4
2237 qc 11,3b7b3ed 5,0 id 203b5 sts 0 0
2495 ex punlock 0
2541 en punlock 7,333bed8
2541 lk 11,333bed8 id 400d4 0,5 4
2237 qc 11,333bed8 0,5 id 400d4 sts 0 0
2541 remove 7,333bed8
2541 un 7,333bed8 114017c 5 0
2237 qc 7,333bed8 5,5 id 114017c sts -65538 0
2541 lk 11,333bed8 id 400d4 5,0 4
2237 qc 11,333bed8 5,0 id 400d4 sts 0 0
2541 ex punlock 0
2541 en plock 7,333beca
2541 lk 11,333beca id 201d7 0,5 4
2237 qc 11,333beca 0,5 id 201d7 sts 0 0
2541 req 7,333beca ex 0-7fffffffffffffff lkf 2000 wait 1
2541 lk 7,333beca id 0 -1,5 2000
2541 lk 11,333beca id 201d7 5,0 4
2237 qc 11,333beca 5,0 id 201d7 sts 0 0
2237 qc 7,333beca -1,5 id 10f0348 sts 0 0
2541 ex plock 0
2495 en plock 7,3b7b3ed
2495 lk 11,3b7b3ed id 203b5 0,5 4
2237 qc 11,3b7b3ed 0,5 id 203b5 sts 0 0
2495 req 7,3b7b3ed ex 74c4cd-750e13 lkf 2000 wait 1
2495 lk 7,3b7b3ed id 0 -1,5 2000
2495 lk 11,3b7b3ed id 203b5 5,0 4
2237 qc 7,3b7b3ed -1,5 id 4a00d4 sts 0 0
2237 qc 11,3b7b3ed 5,0 id 203b5 sts 0 0
2495 ex plock 0
Comment 4 Corey Marthaler 2005-01-06 15:14:33 EST
morph-02 and morph-04 continually spit lock info to the console while
also spitting out the following message over and over:
lock_dlm: cancel num=1,2
Comment 5 Corey Marthaler 2005-01-06 17:31:35 EST
I noticed that the mount groups for the file system who's mount is
hung, is stuck in the recovery state on the nodes who were not shot
(but lost quorum).

[root@morph-01 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 5 4 3 2]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 5 4 3 2]

DLM Lock Space:  "gfs0"                              4   5 run       -
[1 5 4]

DLM Lock Space:  "gfs1"                              6   7 run       -
[1 5 4]

DLM Lock Space:  "gfs2"                              8   9 run       -
[1 5]

GFS Mount Group: "gfs0"                              5   6 run       -
[1 5 4]

GFS Mount Group: "gfs1"                              7   8 recover 4 -
[1 5]

GFS Mount Group: "gfs2"                              9  10 run       -
[1 5]




[root@morph-02 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 5 4 3 2]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 5 4 3 2]

DLM Lock Space:  "gfs0"                              4   5 run       -
[1 5 4]

DLM Lock Space:  "gfs1"                              6   7 run       -
[1 5 4]

DLM Lock Space:  "gfs2"                              8   9 run       -
[1 5]

GFS Mount Group: "gfs0"                              5   6 run       -
[1 5 4]

GFS Mount Group: "gfs1"                              7   8 recover 2 -
[1 5]

GFS Mount Group: "gfs2"                              9  10 run       -
[1 5]


# Node were the mount attempt is hung
[root@morph-05 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 5 4 3 2]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 5 4 3 2]

DLM Lock Space:  "gfs0"                              4   4 run       -
[1 5 4]

DLM Lock Space:  "gfs1"                              6   6 run       -
[1 5 4]

GFS Mount Group: "gfs0"                              5   5 run       -
[1 5 4]

GFS Mount Group: "gfs1"                              0   7 join     
S-1,80,5
[]


Comment 6 Corey Marthaler 2005-01-06 17:33:55 EST
This could possibly be related to bz133420, although in that bug _all_
services are stuck in recovery.
Comment 7 Corey Marthaler 2005-01-10 19:21:38 EST
bumping priority as most recovery testing eventually results in this
issue.
Comment 8 David Teigland 2005-01-10 21:24:46 EST
Do you only see this when using multiple fs's at once?
Comment 9 David Teigland 2005-01-12 05:15:37 EST
Changing the description to make it clear that this isn't related to
mount at all.  The mounts block because lock_dlm recovery became stuck
after a node failed.  There are currently no clues as to what lock_dlm
is stuck on, the relevant info from proc has been replaced by the
normal activity of the running fs's and this has not yet been
reproduced on a machine with kdb.
Comment 10 Corey Marthaler 2005-01-12 10:48:24 EST
FWIW, I do only appear to hit this with multiple fs's. 
Comment 11 Corey Marthaler 2005-01-12 16:48:01 EST
I was able to hit this bug on the tank cluster with 6 fs's on the
first iteration and with only one fs after a few iterations using
revolver.

So it looks like this doesn't require multiple filesystems. 
Comment 12 David Teigland 2005-01-14 07:04:58 EST
There is one obvious clue that could explain this -- the repeating
cancel messages.  GFS probably needs the cancel to complete before it
will finish recovery which corresponds to the lock_dlm mount group
finishing recovery.  A problem with cancel in the dlm is what we
should look into.

I did reproduce this on my 7 node bench cluster using three
fs's after four iterations of following command:

revolver -f /etc/cluster/cluster.conf -l /root/sistina-test -r
/root/sistina-test -b no -t 2 -x 2
Comment 13 David Teigland 2005-01-14 09:54:19 EST
Thankfully my screen scrollback went back to the start of the
repeated cancels.  It confirmed that the problem I found looking
through the source is in fact what happened:

lock_dlm: cancel num=1,2
lock_dlm: extra completion 1,2 -1,3 id 0

The cancel itself worked and a completion callback was queued to tell
gfs.  However, the queue_complete() routine has some logic to avoid
sending completion callbacks when they aren't needed.  This logic
was throwing out the completion for the cancel since the lock being
canceled never made it to the dlm but was blocked in lock_dlm during
recovery.  The fix will be to simply skip this logic; I expect that
will resolve this bug.
Comment 14 Corey Marthaler 2005-01-18 18:31:17 EST
fixed verified with an unprecedented 3 hours of running revolver. :) 
Comment 15 Corey Marthaler 2005-01-19 16:12:05 EST
Reopening.

This is no longer the recovery issue it used to be (in that it shows
up all the time) but there must still be a corner case out there which
causes this to happen.

I reproduced this while running revolver on the morph cluster. Three
of the five nodes were taken down (so quorum was lost) and then
brought back up. Once again, the mounting of the filesystems hung.
The hung mount attempts were on morph-04 and morph-01.

morph-02 and morph-03 were the nodes left up:

[root@morph-02 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 5 1 4 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[2 5 1 4 3]

DLM Lock Space:  "gfs0"                              4   4 update   
U-4,1,1
[2 5 1]

DLM Lock Space:  "gfs1"                              6   6 run       -
[2 5]

GFS Mount Group: "gfs0"                              5   5 recover 2 -
[2 5]

GFS Mount Group: "gfs1"                              7   7 run       -
[2 5]


[root@morph-03 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 5 1 4 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[2 5 1 4 3]

DLM Lock Space:  "gfs0"                              4   4 update   
U-4,1,1
[2 5 1]

DLM Lock Space:  "gfs1"                              6   6 run       -
[2 5]

GFS Mount Group: "gfs0"                              5   5 recover 4 -
[2 5]

GFS Mount Group: "gfs1"                              7   7 run       -
[2 5]



morph-01, morph-04, and morph-05 were the nodes shot:

[root@morph-01 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[5 2 1 4 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[5 2 1 4 3]

DLM Lock Space:  "gfs0"                              4   4 join     
S-6,20,3
[5 2 1]


[root@morph-04 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 5 1 4 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[2 5 1 4 3]

DLM Lock Space:  "gfs1"                              6   4 join     
S-6,20,3
[2 5 4]


[root@morph-05 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[5 1 4 2 3]

DLM Lock Space:  "clvmd"                             3   3 run       -
[5 1 4 2 3]


Comment 16 Corey Marthaler 2005-01-19 16:17:42 EST
Again, the "lock_dlm cancel" messages showed up.

morph-02:

CMAN: quorum lost, blocking activity
CMAN: quorum regained, resuming activity
GFS: fsid=morph-cluster:gfs1.2: jid=4: Trying to acquire journal lock...
GFS: fsid=morph-cluster:gfs0.2: jid=4: Trying to acquire journal lock...
GFS: fsid=morph-cluster:gfs0.2: jid=4: Looking at journal...
GFS: fsid=morph-cluster:gfs0.2: jid=4: Acquiring the transaction lock...
GFS: fsid=morph-cluster:gfs1.2: jid=4: Looking at journal...
GFS: fsid=morph-cluster:gfs1.2: jid=4: Acquiring the transaction lock...
GFS: fsid=morph-cluster:gfs0.2: jid=4: Replaying journal...
GFS: fsid=morph-cluster:gfs1.2: jid=4: Replaying journal...
GFS: fsid=morph-cluster:gfs1.2: jid=4: Replayed 182 of 182 blocks
GFS: fsid=morph-cluster:gfs1.2: jid=4: replays = 182, skips = 0, sames = 0
GFS: fsid=morph-cluster:gfs1.2: jid=4: Journal replayed in 3s
GFS: fsid=morph-cluster:gfs1.2: jid=4: Done
GFS: fsid=morph-cluster:gfs1.2: jid=3: Trying to acquire journal lock...
GFS: fsid=morph-cluster:gfs1.2: jid=3: Looking at journal...
GFS: fsid=morph-cluster:gfs1.2: jid=3: Done
GFS: fsid=morph-cluster:gfs1.2: jid=1: Trying to acquire journal lock...
GFS: fsid=morph-cluster:gfs1.2: jid=1: Busy
GFS: fsid=morph-cluster:gfs0.2: jid=4: Replayed 2727 of 2870 blocks
GFS: fsid=morph-cluster:gfs0.2: jid=4: replays = 2727, skips = 65,
sames = 78
GFS: fsid=morph-cluster:gfs0.2: jid=4: Journal replayed in 17s
GFS: fsid=morph-cluster:gfs0.2: jid=4: Done
GFS: fsid=morph-cluster:gfs0.2: jid=3: Trying to acquire journal lock...
GFS: fsid=morph-cluster:gfs0.2: jid=3: Busy
GFS: fsid=morph-cluster:gfs0.2: jid=1: Trying to acquire journal lock...
GFS: fsid=morph-cluster:gfs0.2: jid=1: Looking at journal...
GFS: fsid=morph-cluster:gfs0.2: jid=1: Acquiring the transaction lock...
lock_dlm: cancel 1,2 flags 400
lock_dlm: cancel 1,2 complete
GFS: fsid=morph-cluster:gfs0.2: jid=1: Replaying journal...
GFS: fsid=morph-cluster:gfs0.2: jid=1: Replayed 1024 of 1025 blocks
GFS: fsid=morph-cluster:gfs0.2: jid=1: replays = 1024, skips = 0,
sames = 1
Comment 17 David Teigland 2005-01-19 23:13:52 EST
This bug was originally for broken cancels in lock_dlm.  The cancel
messages above show cancels happening correctly, so the reopen
should instead probably be a new bug or an addition to another bug.

The problem here appears to be stuck gfs/lock_dlm recovery on
morph-02 which is in recover state 2 (morph-03 is in recover 4
which is complete.)  To classify this further we'd need output
from /proc/cluster/lock_dlm/debug (esp on morph-02) and possibly
info on dlm/lock_dlm kernel threads.
Comment 18 Corey Marthaler 2005-01-20 11:51:32 EST
Closing this bug and opening a new bug for the reopen issue per Dave's
request.

Note You need to log in before you can comment on or make changes to this bug.