Red Hat Bugzilla – Bug 159727
clvmd startup deadlock issue
Last modified: 2010-01-11 23:03:30 EST
Description of problem:
Same symptoms as bz 155729 for both DLM and GULM.
I ran revolver all weekend on the 4 node tank cluster (tank-01, 03, 04, 05)
without allowing quorum to be lost (only shooting one node at a time) and never
saw any issues.
This morning I restarted revolver so that quorum gets lost (three nodes shot
each time) and after 40 iterations, I saw the hang on all three nodes shot. Two
of the nodes were stuck starting clvmd and one was stuck doing a vgchange. CMAN
on the the node left up reported that everyone was apart of the cluster. I then
killed one of the hung nodes and that allowed the other two hung nodes to get
past the deadlock and continue. The killed node then also came back up without
Also hit this last night on a three node gulm cluster. One slave was shot and
got stuck coming back up while doing a vgchange. Apparently you do not need to
lose quorum inorder for this to happen.
OK, I've spotted this in the lab now and am testing a fix.
Apologies to everyone - it is a clvmd bug. Still, at least we got rid of some
CMAN & DLM bugs in the process !
Created attachment 115349 [details]
Don't defer closing of old FDs
That patch should fix the problem. It should be applied to the RPM after the
I'm seeing this again lately. On a three node cluster (link-01, link-02,
link-08) link-01 was shot by revolver, it join the cluster but hangs when
attempting to activate the VGs:
Starting ccsd: ip_tables: (C) 2000-2002 Netfilter core team
[ OK ]
Starting cman:CMAN 2.6.9-37.0 (built Jul 5 2005 12:20:39) installed
CMAN: quorum regained, resuming activity
DLM 2.6.9-35.0 (built Jul 5 2005 12:29:45) installed
[ OK ]
Starting fence domain:[ OK ]
Starting clvmd: [ OK ]
from another node in the cluster:
[root@link-02 ~]# cat /proc/cluster/nodes
Node Votes Exp Sts Name
1 1 3 M link-01
2 1 3 M link-08
3 1 3 M link-02
[root@link-02 ~]# cat /proc/cluster/services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[3 2 1]
DLM Lock Space: "clvmd" 3 4 run -
[3 2 1]
DLM Lock Space: "gfs0" 4 5 run -
GFS Mount Group: "gfs0" 5 6 run -
I'll try and gather more info.
Interestingly I had this quite often on my Fedora Xen cluster, but it went away
when I upgraded the RPMs to:
Though, to be honest, I'm not sure what the difference was (and I don't have the
old lvm2-cluster package version, sorry)
Although I still have no new info for you :( (as any debugging attempt causes
this not to appear) I'd like to still bring attention to this issue as we are
still seeing it regulary with the init scripts turned on.
I'm trying to figure out if this is in any way peculiar to any hardware type.
I see link-01/08 are dual-proc AMD86 boxes. As I only have UP x86 boxes it may
not be surprising that I can't reproduce this. (I do have a single dual x86 box
but it's currently too hot here to run it for any length of time).
Have you seen this on other machines ?
Just in case (I'm off to the UKUUG Linux conference very soon) I've dropped a
debugging version of clvmd in /root of link-08. You'll need to run it
clvmd -d 2>/root/clvmd.log &
to capture the output.
If you can make it happen with this running the log files (from all machines)
should show what is going on.
The good news is that I've been able to reproduce this on my SMP machine with 3
SMP Xen VMs. The bad news is that it seems to be some strange pthread interaction.
Anyway, I'm on the case.
OK, try this:
Checking in clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v <-- clvmd.c
new revision: 1.25; previous revision: 1.24
clvmd is a little less cavalier in its signalling of subthreads, if a thread is
known to be waiting nicely then there's no need to signal it, just notifying the
condition variable will do.
You may have to wait for agk to include this patch in the lvm2-cluster package.