Bug 476697

Summary: clvmd deadlocks during startup during revolver runs
Product: [Retired] Red Hat Cluster Suite Reporter: Milan Broz <mbroz>
Component: lvm2-clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: high    
Version: 4CC: agk, ccaulfie, dwysocha, edamato, jbrassow, mbroz, prockai, pvrabec
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-05-18 21:11:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Milan Broz 2008-12-16 16:02:42 UTC
+++ This bug was initially created as a clone of Bug #470417 +++
rhel4 clone 

Description of problem:
Running revolver with ckpt-fixed openais results in apparant hang of clvmd during startup.  

Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.40-6.el5
Nov  6 19:51:48 bench-02 kernel: Lock_DLM (built Oct 14 2008 15:12:40) installed


How reproducible:
just run revolver until it locks

Steps to Reproduce:
1. setup 3 node revolver run with plock load
2. wait to 5.X+ iterations and until revolver fails with "deadlock on node X" message.
3. make sure your connected to the terminal server output of the three nodes so you can see the startup process.  You will find the following:

  Starting cluster: 
   Loading modules... DLM (built Oct 27 2008 22:03:27) installed
GFS2 (built Oct 27 2008 22:04:01) installed
done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]
Starting system message bus: [  OK  ]
Starting clvmd: dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection from 2
dlm: got connection from 3
[  OK  ]

<DEADLOCKS HERE>

notice the last step is clvmd starting after which I would expect to see:
something about vg being activated.
Actual results:
deadlocks causing revolver to fail and the node to never come up or fenced as a result of its failure to start.

Expected results:
node will continue and operate normally.

Additional info:
[root@bench-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    836   2008-11-06 19:58:43  bench-01
   2   M    820   2008-11-06 19:51:31  bench-02
   3   M    840   2008-11-06 19:58:43  bench-03
[root@bench-02 ~]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: bench-123
Cluster Id: 50595
Cluster Member: Yes
Cluster Generation: 840
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 11  
Node name: bench-02
Node ID: 2
Multicast addresses: 239.192.197.105 
Node addresses: 10.15.84.22 

[root@bench-02 ~]# group_tool info
type             level name        id       state       
fence            0     default     00010001 none        
[1 2 3]
dlm              1     clvmd       00020001 none        
[1 2 3]
dlm              1     bench-1230  00040001 none        
[2]
dlm              1     bench-1231  00060001 none        
[2]
dlm              1     bench-1232  00080001 none        
[2]
gfs              2     bench-1230  00030001 none        
[2]
gfs              2     bench-1231  00050001 none        
[2]
gfs              2     bench-1232  00070001 none        
[2]

(above from node 2 in the cluster)


--- Additional comment from ccaulfie on 2008-11-21 08:58:33 EDT ---

I've checked in the fix I have. It doesn't seem to fully fix the problem but it does make it MUCH hard to reproduce!

Checking in WHATS_NEW;                                             
/cvs/lvm2/LVM2/WHATS_NEW,v  <--  WHATS_NEW
new revision: 1.999; previous revision: 1.998
done
Checking in daemons/clvmd/clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v  <--  clvmd.c
new revision: 1.52; previous revision: 1.51
done

Comment 1 Milan Broz 2008-12-16 17:51:36 UTC
In CVS - lvm2-cluster-2.02.42-1.el4

Comment 4 errata-xmlrpc 2009-05-18 21:11:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1047.html