Red Hat Bugzilla – Bug 470417
clvmd deadlocks during startup during revolver runs
Last modified: 2016-04-26 10:05:27 EDT
Description of problem:
Running revolver with ckpt-fixed openais results in apparant hang of clvmd during startup.
Version-Release number of selected component (if applicable):
Nov 6 19:51:48 bench-02 kernel: Lock_DLM (built Oct 14 2008 15:12:40) installed
just run revolver until it locks
Steps to Reproduce:
1. setup 3 node revolver run with plock load
2. wait to 5.X+ iterations and until revolver fails with "deadlock on node X" message.
3. make sure your connected to the terminal server output of the three nodes so you can see the startup process. You will find the following:
Loading modules... DLM (built Oct 27 2008 22:03:27) installed
GFS2 (built Oct 27 2008 22:04:01) installed
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
Starting system message bus: [ OK ]
Starting clvmd: dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection from 2
dlm: got connection from 3
[ OK ]
notice the last step is clvmd starting after which I would expect to see:
something about vg being activated.
deadlocks causing revolver to fail and the node to never come up or fenced as a result of its failure to start.
node will continue and operate normally.
[root@bench-02 ~]# cman_tool nodes
Node Sts Inc Joined Name
1 M 836 2008-11-06 19:58:43 bench-01
2 M 820 2008-11-06 19:51:31 bench-02
3 M 840 2008-11-06 19:58:43 bench-03
[root@bench-02 ~]# cman_tool status
Config Version: 1
Cluster Name: bench-123
Cluster Id: 50595
Cluster Member: Yes
Cluster Generation: 840
Membership state: Cluster-Member
Expected votes: 3
Total votes: 3
Active subsystems: 8
Ports Bound: 0 11
Node name: bench-02
Node ID: 2
Multicast addresses: 220.127.116.11
Node addresses: 10.15.84.22
[root@bench-02 ~]# group_tool info
type level name id state
fence 0 default 00010001 none
[1 2 3]
dlm 1 clvmd 00020001 none
[1 2 3]
dlm 1 bench-1230 00040001 none
dlm 1 bench-1231 00060001 none
dlm 1 bench-1232 00080001 none
gfs 2 bench-1230 00030001 none
gfs 2 bench-1231 00050001 none
gfs 2 bench-1232 00070001 none
(above from node 2 in the cluster)
Note I missed this in the log but it also said this during startup before those error messages above:
Setting clock (utc): Thu Nov 6 19:58:05 CST 2008 [ OK ]
Starting udev: [ OK ]
Loading default keymap (us): [ OK ]
Setting hostname bench-01: [ OK ]
Setting up Logical Volume Management: connect() failed on local socket: Connection refused
WARNING: Falling back to local file-based locking.
Volume Groups with the clustered attribute will be inaccessible.
2 logical volume(s) in volume group "VolGroup00" now active
Locking inactive: ignoring clustered volume group bench-123
Then after about 10-15 minutes got the following message:
dlm: closing connection to node 3
Activating VGs: 2 logical volume(s) in volume group "VolGroup00" now active
3 logical volume(s) in volume group "bench-123" now active
[ OK ]
and booting continued normally but still caused failure of revolver.
Can you enable clvmd debugging and attach the output from a couple of nodes
clvmd -d on startup will send logging information to stderr.
yes expect info in a day or so little busy trying to wrap up before I head for vacation. Also your welcome to use my cluster and reproduction test case while I am out.
I can reproduce this on your nodes, fortunately (though unsurprisingly). Adding instrumentation stop it happening though!
However, I have identified a potential startup race with the LVM thread and am testing a fix.
I've checked in the fix I have. It doesn't seem to fully fix the problem but it does make it MUCH hard to reproduce!
Checking in WHATS_NEW;
/cvs/lvm2/LVM2/WHATS_NEW,v <-- WHATS_NEW
new revision: 1.999; previous revision: 1.998
Checking in daemons/clvmd/clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v <-- clvmd.c
new revision: 1.52; previous revision: 1.51
Fix in lvm2-cluster-2_02_40-7_el5.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.