Bug 207197 - Cman will hang initializing the daemons if started on all nodes simultaneously
Cman will hang initializing the daemons if started on all nodes simultaneously
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-19 17:22 EDT by Josef Bacik
Modified: 2009-04-16 18:29 EDT (History)
4 users (show)

See Also:
Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-11-28 16:31:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Josef Bacik 2006-09-19 17:22:58 EDT
Description of problem:


Version-Release number of selected component (if applicable):
This isn't specifically CMAN, but rather the daemons that CMAN starts.  I have a
two node cluster, if I start the nodes up with the cluster services stopped and
then run service cman start on both nodes at the same time they will both hang at
[root@rh5cluster1 ~]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... 

fenced is started, so i'm assuming its at the dlm_controld stuff, which isn't
showing up in PS.  The only messages I get are these

Sep 19 15:44:50 rh5cluster1 ccsd[2718]: Initial status:: Quorate 
Sep 19 15:49:18 rh5cluster1 kernel: DLM (built Aug 30 2006 18:19:57) installed
Sep 19 15:49:18 rh5cluster1 kernel: GFS2 (built Aug 30 2006 18:20:27) installed
Sep 19 15:49:18 rh5cluster1 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set
Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22

and

Sep 19 15:44:50 rh5cluster2 openais[2715]: [TOTEM] entering OPERATIONAL state.
Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM  ] got nodejoin message 10.10.1.12
Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM  ] got nodejoin message 10.10.1.13
Sep 19 15:44:50 rh5cluster2 kernel: DLM (built Aug 30 2006 18:19:57) installed
Sep 19 15:44:50 rh5cluster2 kernel: GFS2 (built Aug 30 2006 18:20:27) installed
Sep 19 15:44:50 rh5cluster2 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed

I will look into it more tomorrow.

How reproducible:
Everytime

Steps to Reproduce:
1.bring up nodes without the cluster services enabled
2.open up ssh to both nodes and type 'service cman start' but dont hit enter
3.hit enter in both terminals as quickly together as possible
  
Actual results:
The script hangs at "Starting Daemons"

Expected results:
It shouldn't hang

Additional info:
I'm using the newest packages in brew, Cman 2.0.16.
Comment 1 Christine Caulfield 2006-09-21 08:40:33 EDT
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set
Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22

Obviously those are the key messages. if dlm_controld isn't running then that
might explain why the DLM hasn't been configured - it might be that it crashed
perhaps?

A debug log from dlm_control would be really helpful here if you can get one.
Comment 2 Christine Caulfield 2006-09-21 08:58:43 EDT
Oh, and it's also checking whether configfs is mounted. The times when I see
this message, I have found that configfs hasn't mounted for some reason.
Comment 3 Christine Caulfield 2006-09-26 10:18:48 EDT
I saw something that is possibly similar to #1 today, where a node was added to
the DLM members list before dlm_groupd knew its IP address. DLM kicked out the
error:
dlm: Initiating association with node 13
dlm: no address for nodeid 13

Is it possible there's a race here? The cman event callback arriving after
dlm_controld has decided to add the new node ?
Comment 4 Kiersten (Kerri) Anderson 2006-10-03 12:54:42 EDT
Devel ACK for RHEL 5.0.0 Beta 2
Comment 5 RHEL Product and Program Management 2006-10-03 13:05:33 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.
Comment 6 Christine Caulfield 2006-10-05 03:54:04 EDT
This is slightly hacky and I can't seem to reproduce it any more. but it should
fix the problem.

Basically, if a lockspace contains a node that dlm_control doesn't know about
then it re-reads the cman nodes list.

Checking in action.c;
/cvs/cluster/cluster/group/dlm_controld/action.c,v  <--  action.c
new revision: 1.7; previous revision: 1.6
done
Checking in dlm_daemon.h;
/cvs/cluster/cluster/group/dlm_controld/dlm_daemon.h,v  <--  dlm_daemon.h
new revision: 1.4; previous revision: 1.3
done
Checking in member_cman.c;
/cvs/cluster/cluster/group/dlm_controld/member_cman.c,v  <--  member_cman.c
new revision: 1.3; previous revision: 1.2
done
Comment 11 Nate Straz 2007-12-13 12:22:18 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.