Bug 470417 - clvmd deadlocks during startup during revolver runs
Summary: clvmd deadlocks during startup during revolver runs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2-cluster
Version: 5.3
Hardware: All
OS: Linux
high
medium
Target Milestone: rc
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-11-07 02:25 UTC by Steven Dake
Modified: 2016-04-26 14:05 UTC (History)
14 users (show)

Fixed In Version: lvm2-cluster-2.02.40-7.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:55:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0100 0 normal SHIPPED_LIVE lvm2-cluster bug-fix and enhancement update 2009-01-20 16:04:34 UTC

Description Steven Dake 2008-11-07 02:25:12 UTC
Description of problem:
Running revolver with ckpt-fixed openais results in apparant hang of clvmd during startup.  

Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.40-6.el5
Nov  6 19:51:48 bench-02 kernel: Lock_DLM (built Oct 14 2008 15:12:40) installed


How reproducible:
just run revolver until it locks

Steps to Reproduce:
1. setup 3 node revolver run with plock load
2. wait to 5.X+ iterations and until revolver fails with "deadlock on node X" message.
3. make sure your connected to the terminal server output of the three nodes so you can see the startup process.  You will find the following:

  Starting cluster: 
   Loading modules... DLM (built Oct 27 2008 22:03:27) installed
GFS2 (built Oct 27 2008 22:04:01) installed
done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]
Starting system message bus: [  OK  ]
Starting clvmd: dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection from 2
dlm: got connection from 3
[  OK  ]

<DEADLOCKS HERE>

notice the last step is clvmd starting after which I would expect to see:
something about vg being activated.
Actual results:
deadlocks causing revolver to fail and the node to never come up or fenced as a result of its failure to start.

Expected results:
node will continue and operate normally.

Additional info:
[root@bench-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    836   2008-11-06 19:58:43  bench-01
   2   M    820   2008-11-06 19:51:31  bench-02
   3   M    840   2008-11-06 19:58:43  bench-03
[root@bench-02 ~]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: bench-123
Cluster Id: 50595
Cluster Member: Yes
Cluster Generation: 840
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 11  
Node name: bench-02
Node ID: 2
Multicast addresses: 239.192.197.105 
Node addresses: 10.15.84.22 

[root@bench-02 ~]# group_tool info
type             level name        id       state       
fence            0     default     00010001 none        
[1 2 3]
dlm              1     clvmd       00020001 none        
[1 2 3]
dlm              1     bench-1230  00040001 none        
[2]
dlm              1     bench-1231  00060001 none        
[2]
dlm              1     bench-1232  00080001 none        
[2]
gfs              2     bench-1230  00030001 none        
[2]
gfs              2     bench-1231  00050001 none        
[2]
gfs              2     bench-1232  00070001 none        
[2]

(above from node 2 in the cluster)

Comment 1 Steven Dake 2008-11-07 02:58:54 UTC
Note I missed this in the log but it also said this during startup before those error messages above:

Setting clock  (utc): Thu Nov  6 19:58:05 CST 2008 [  OK  ]
Starting udev: [  OK  ]
Loading default keymap (us): [  OK  ]
Setting hostname bench-01:  [  OK  ]
Setting up Logical Volume Management:   connect() failed on local socket: Connection refused
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  2 logical volume(s) in volume group "VolGroup00" now active
  Locking inactive: ignoring clustered volume group bench-123


Then after about 10-15 minutes got the following message:

dlm: closing connection to node 3
Activating VGs:   2 logical volume(s) in volume group "VolGroup00" now active
  3 logical volume(s) in volume group "bench-123" now active
[  OK  ]

and booting continued normally but still caused failure of revolver.

Comment 3 Christine Caulfield 2008-11-11 16:31:52 UTC
Can you enable clvmd debugging and attach the output from a couple of nodes
please ?

clvmd -d on startup will send logging information to stderr.

Comment 4 Steven Dake 2008-11-11 17:17:35 UTC
yes expect info in a day or so little busy trying to wrap up before I head for vacation.  Also your welcome to use my cluster and reproduction test case while I am out.

Regards
-steve

Comment 6 Christine Caulfield 2008-11-19 15:55:15 UTC
I can reproduce this on your nodes, fortunately (though unsurprisingly). Adding instrumentation stop it happening though!

However, I have identified a potential startup race with the LVM thread and am testing a fix.

Comment 7 Christine Caulfield 2008-11-21 13:58:33 UTC
I've checked in the fix I have. It doesn't seem to fully fix the problem but it does make it MUCH hard to reproduce!

Checking in WHATS_NEW;                                             
/cvs/lvm2/LVM2/WHATS_NEW,v  <--  WHATS_NEW
new revision: 1.999; previous revision: 1.998
done
Checking in daemons/clvmd/clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v  <--  clvmd.c
new revision: 1.52; previous revision: 1.51
done

Comment 9 Milan Broz 2008-11-26 13:17:12 UTC
Fix in lvm2-cluster-2_02_40-7_el5.

Comment 12 errata-xmlrpc 2009-01-20 20:55:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0100.html


Note You need to log in before you can comment on or make changes to this bug.