Bug 470417 - clvmd deadlocks during startup during revolver runs
clvmd deadlocks during startup during revolver runs
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2-cluster (Show other bugs)
5.3
All Linux
high Severity medium
: rc
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-06 21:25 EST by Steven Dake
Modified: 2016-04-26 10:05 EDT (History)
14 users (show)

See Also:
Fixed In Version: lvm2-cluster-2.02.40-7.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:55:38 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Steven Dake 2008-11-06 21:25:12 EST
Description of problem:
Running revolver with ckpt-fixed openais results in apparant hang of clvmd during startup.  

Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.40-6.el5
Nov  6 19:51:48 bench-02 kernel: Lock_DLM (built Oct 14 2008 15:12:40) installed


How reproducible:
just run revolver until it locks

Steps to Reproduce:
1. setup 3 node revolver run with plock load
2. wait to 5.X+ iterations and until revolver fails with "deadlock on node X" message.
3. make sure your connected to the terminal server output of the three nodes so you can see the startup process.  You will find the following:

  Starting cluster: 
   Loading modules... DLM (built Oct 27 2008 22:03:27) installed
GFS2 (built Oct 27 2008 22:04:01) installed
done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]
Starting system message bus: [  OK  ]
Starting clvmd: dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection from 2
dlm: got connection from 3
[  OK  ]

<DEADLOCKS HERE>

notice the last step is clvmd starting after which I would expect to see:
something about vg being activated.
Actual results:
deadlocks causing revolver to fail and the node to never come up or fenced as a result of its failure to start.

Expected results:
node will continue and operate normally.

Additional info:
[root@bench-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    836   2008-11-06 19:58:43  bench-01
   2   M    820   2008-11-06 19:51:31  bench-02
   3   M    840   2008-11-06 19:58:43  bench-03
[root@bench-02 ~]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: bench-123
Cluster Id: 50595
Cluster Member: Yes
Cluster Generation: 840
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 11  
Node name: bench-02
Node ID: 2
Multicast addresses: 239.192.197.105 
Node addresses: 10.15.84.22 

[root@bench-02 ~]# group_tool info
type             level name        id       state       
fence            0     default     00010001 none        
[1 2 3]
dlm              1     clvmd       00020001 none        
[1 2 3]
dlm              1     bench-1230  00040001 none        
[2]
dlm              1     bench-1231  00060001 none        
[2]
dlm              1     bench-1232  00080001 none        
[2]
gfs              2     bench-1230  00030001 none        
[2]
gfs              2     bench-1231  00050001 none        
[2]
gfs              2     bench-1232  00070001 none        
[2]

(above from node 2 in the cluster)
Comment 1 Steven Dake 2008-11-06 21:58:54 EST
Note I missed this in the log but it also said this during startup before those error messages above:

Setting clock  (utc): Thu Nov  6 19:58:05 CST 2008 [  OK  ]
Starting udev: [  OK  ]
Loading default keymap (us): [  OK  ]
Setting hostname bench-01:  [  OK  ]
Setting up Logical Volume Management:   connect() failed on local socket: Connection refused
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  2 logical volume(s) in volume group "VolGroup00" now active
  Locking inactive: ignoring clustered volume group bench-123


Then after about 10-15 minutes got the following message:

dlm: closing connection to node 3
Activating VGs:   2 logical volume(s) in volume group "VolGroup00" now active
  3 logical volume(s) in volume group "bench-123" now active
[  OK  ]

and booting continued normally but still caused failure of revolver.
Comment 3 Christine Caulfield 2008-11-11 11:31:52 EST
Can you enable clvmd debugging and attach the output from a couple of nodes
please ?

clvmd -d on startup will send logging information to stderr.
Comment 4 Steven Dake 2008-11-11 12:17:35 EST
yes expect info in a day or so little busy trying to wrap up before I head for vacation.  Also your welcome to use my cluster and reproduction test case while I am out.

Regards
-steve
Comment 6 Christine Caulfield 2008-11-19 10:55:15 EST
I can reproduce this on your nodes, fortunately (though unsurprisingly). Adding instrumentation stop it happening though!

However, I have identified a potential startup race with the LVM thread and am testing a fix.
Comment 7 Christine Caulfield 2008-11-21 08:58:33 EST
I've checked in the fix I have. It doesn't seem to fully fix the problem but it does make it MUCH hard to reproduce!

Checking in WHATS_NEW;                                             
/cvs/lvm2/LVM2/WHATS_NEW,v  <--  WHATS_NEW
new revision: 1.999; previous revision: 1.998
done
Checking in daemons/clvmd/clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v  <--  clvmd.c
new revision: 1.52; previous revision: 1.51
done
Comment 9 Milan Broz 2008-11-26 08:17:12 EST
Fix in lvm2-cluster-2_02_40-7_el5.
Comment 12 errata-xmlrpc 2009-01-20 15:55:38 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0100.html

Note You need to log in before you can comment on or make changes to this bug.