Bug 476697 - clvmd deadlocks during startup during revolver runs
clvmd deadlocks during startup during revolver runs
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: lvm2-cluster (Show other bugs)
All Linux
high Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2008-12-16 11:02 EST by Milan Broz
Modified: 2013-02-28 23:07 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-05-18 17:11:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Milan Broz 2008-12-16 11:02:42 EST
+++ This bug was initially created as a clone of Bug #470417 +++
rhel4 clone 

Description of problem:
Running revolver with ckpt-fixed openais results in apparant hang of clvmd during startup.  

Version-Release number of selected component (if applicable):
Nov  6 19:51:48 bench-02 kernel: Lock_DLM (built Oct 14 2008 15:12:40) installed

How reproducible:
just run revolver until it locks

Steps to Reproduce:
1. setup 3 node revolver run with plock load
2. wait to 5.X+ iterations and until revolver fails with "deadlock on node X" message.
3. make sure your connected to the terminal server output of the three nodes so you can see the startup process.  You will find the following:

  Starting cluster: 
   Loading modules... DLM (built Oct 27 2008 22:03:27) installed
GFS2 (built Oct 27 2008 22:04:01) installed
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]
Starting system message bus: [  OK  ]
Starting clvmd: dlm: Using TCP for communications
dlm: connecting to 2
dlm: got connection from 2
dlm: got connection from 3
[  OK  ]


notice the last step is clvmd starting after which I would expect to see:
something about vg being activated.
Actual results:
deadlocks causing revolver to fail and the node to never come up or fenced as a result of its failure to start.

Expected results:
node will continue and operate normally.

Additional info:
[root@bench-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    836   2008-11-06 19:58:43  bench-01
   2   M    820   2008-11-06 19:51:31  bench-02
   3   M    840   2008-11-06 19:58:43  bench-03
[root@bench-02 ~]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: bench-123
Cluster Id: 50595
Cluster Member: Yes
Cluster Generation: 840
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 11  
Node name: bench-02
Node ID: 2
Multicast addresses: 
Node addresses: 

[root@bench-02 ~]# group_tool info
type             level name        id       state       
fence            0     default     00010001 none        
[1 2 3]
dlm              1     clvmd       00020001 none        
[1 2 3]
dlm              1     bench-1230  00040001 none        
dlm              1     bench-1231  00060001 none        
dlm              1     bench-1232  00080001 none        
gfs              2     bench-1230  00030001 none        
gfs              2     bench-1231  00050001 none        
gfs              2     bench-1232  00070001 none        

(above from node 2 in the cluster)

--- Additional comment from ccaulfie@redhat.com on 2008-11-21 08:58:33 EDT ---

I've checked in the fix I have. It doesn't seem to fully fix the problem but it does make it MUCH hard to reproduce!

Checking in WHATS_NEW;                                             
/cvs/lvm2/LVM2/WHATS_NEW,v  <--  WHATS_NEW
new revision: 1.999; previous revision: 1.998
Checking in daemons/clvmd/clvmd.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd.c,v  <--  clvmd.c
new revision: 1.52; previous revision: 1.51
Comment 1 Milan Broz 2008-12-16 12:51:36 EST
In CVS - lvm2-cluster-2.02.42-1.el4
Comment 4 errata-xmlrpc 2009-05-18 17:11:47 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.