Description of problem: Same symptoms as bz 155729 for both DLM and GULM. I ran revolver all weekend on the 4 node tank cluster (tank-01, 03, 04, 05) without allowing quorum to be lost (only shooting one node at a time) and never saw any issues. This morning I restarted revolver so that quorum gets lost (three nodes shot each time) and after 40 iterations, I saw the hang on all three nodes shot. Two of the nodes were stuck starting clvmd and one was stuck doing a vgchange. CMAN on the the node left up reported that everyone was apart of the cluster. I then killed one of the hung nodes and that allowed the other two hung nodes to get past the deadlock and continue. The killed node then also came back up without problems. ############################################################## Also hit this last night on a three node gulm cluster. One slave was shot and got stuck coming back up while doing a vgchange. Apparently you do not need to lose quorum inorder for this to happen.
OK, I've spotted this in the lab now and am testing a fix. Apologies to everyone - it is a clvmd bug. Still, at least we got rid of some CMAN & DLM bugs in the process !
Created attachment 115349 [details] Don't defer closing of old FDs
That patch should fix the problem. It should be applied to the RPM after the current patch
I'm seeing this again lately. On a three node cluster (link-01, link-02, link-08) link-01 was shot by revolver, it join the cluster but hangs when attempting to activate the VGs: link-01: Starting ccsd: ip_tables: (C) 2000-2002 Netfilter core team [ OK ] Starting cman:CMAN 2.6.9-37.0 (built Jul 5 2005 12:20:39) installed CMAN: quorum regained, resuming activity DLM 2.6.9-35.0 (built Jul 5 2005 12:29:45) installed [ OK ] Starting lock_gulmd:[WARNING] Starting fence domain:[ OK ] Starting clvmd: [ OK ] Activating VGs: [HANG] from another node in the cluster: [root@link-02 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 3 M link-01 2 1 3 M link-08 3 1 3 M link-02 [root@link-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [3 2 1] DLM Lock Space: "clvmd" 3 4 run - [3 2 1] DLM Lock Space: "gfs0" 4 5 run - [3 2] GFS Mount Group: "gfs0" 5 6 run - [3 2] I'll try and gather more info.
Interestingly I had this quite often on my Fedora Xen cluster, but it went away when I upgraded the RPMs to: lvm2-2.01.12-1.0 lvm2-cluster-2.01.09-5.0 dlm-1.0-0.pre21.10 dlm-kernel-xenU-2.6.11.5-20050601.152643.FC4.2 cman-1.0-0.pre33.15 cman-kernel-xenU-2.6.11.5-20050601.152643.FC4.2 from cman-kernel-xenU-2.6.11.4-20050517.141233.FC4.3 dlm-kernel-xenU-2.6.11.3-20050425.154843.FC4.16 Though, to be honest, I'm not sure what the difference was (and I don't have the old lvm2-cluster package version, sorry)
Although I still have no new info for you :( (as any debugging attempt causes this not to appear) I'd like to still bring attention to this issue as we are still seeing it regulary with the init scripts turned on.
I'm trying to figure out if this is in any way peculiar to any hardware type. I see link-01/08 are dual-proc AMD86 boxes. As I only have UP x86 boxes it may not be surprising that I can't reproduce this. (I do have a single dual x86 box but it's currently too hot here to run it for any length of time). Have you seen this on other machines ?
Just in case (I'm off to the UKUUG Linux conference very soon) I've dropped a debugging version of clvmd in /root of link-08. You'll need to run it clvmd -d 2>/root/clvmd.log & to capture the output. If you can make it happen with this running the log files (from all machines) should show what is going on.
The good news is that I've been able to reproduce this on my SMP machine with 3 SMP Xen VMs. The bad news is that it seems to be some strange pthread interaction. Anyway, I'm on the case.
OK, try this: Checking in clvmd.c; /cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v <-- clvmd.c new revision: 1.25; previous revision: 1.24 done clvmd is a little less cavalier in its signalling of subthreads, if a thread is known to be waiting nicely then there's no need to signal it, just notifying the condition variable will do. You may have to wait for agk to include this patch in the lvm2-cluster package.