Bug 207197

Summary: Cman will hang initializing the daemons if started on all nodes simultaneously
Product: Red Hat Enterprise Linux 5 Reporter: Josef Bacik <jbacik>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: cfeist, cluster-maint, rkenna, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-28 21:31:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Josef Bacik 2006-09-19 21:22:58 UTC
Description of problem:


Version-Release number of selected component (if applicable):
This isn't specifically CMAN, but rather the daemons that CMAN starts.  I have a
two node cluster, if I start the nodes up with the cluster services stopped and
then run service cman start on both nodes at the same time they will both hang at
[root@rh5cluster1 ~]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... 

fenced is started, so i'm assuming its at the dlm_controld stuff, which isn't
showing up in PS.  The only messages I get are these

Sep 19 15:44:50 rh5cluster1 ccsd[2718]: Initial status:: Quorate 
Sep 19 15:49:18 rh5cluster1 kernel: DLM (built Aug 30 2006 18:19:57) installed
Sep 19 15:49:18 rh5cluster1 kernel: GFS2 (built Aug 30 2006 18:20:27) installed
Sep 19 15:49:18 rh5cluster1 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set
Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22

and

Sep 19 15:44:50 rh5cluster2 openais[2715]: [TOTEM] entering OPERATIONAL state.
Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM  ] got nodejoin message 10.10.1.12
Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM  ] got nodejoin message 10.10.1.13
Sep 19 15:44:50 rh5cluster2 kernel: DLM (built Aug 30 2006 18:19:57) installed
Sep 19 15:44:50 rh5cluster2 kernel: GFS2 (built Aug 30 2006 18:20:27) installed
Sep 19 15:44:50 rh5cluster2 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed

I will look into it more tomorrow.

How reproducible:
Everytime

Steps to Reproduce:
1.bring up nodes without the cluster services enabled
2.open up ssh to both nodes and type 'service cman start' but dont hit enter
3.hit enter in both terminals as quickly together as possible
  
Actual results:
The script hangs at "Starting Daemons"

Expected results:
It shouldn't hang

Additional info:
I'm using the newest packages in brew, Cman 2.0.16.

Comment 1 Christine Caulfield 2006-09-21 12:40:33 UTC
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set
Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22

Obviously those are the key messages. if dlm_controld isn't running then that
might explain why the DLM hasn't been configured - it might be that it crashed
perhaps?

A debug log from dlm_control would be really helpful here if you can get one.

Comment 2 Christine Caulfield 2006-09-21 12:58:43 UTC
Oh, and it's also checking whether configfs is mounted. The times when I see
this message, I have found that configfs hasn't mounted for some reason.

Comment 3 Christine Caulfield 2006-09-26 14:18:48 UTC
I saw something that is possibly similar to #1 today, where a node was added to
the DLM members list before dlm_groupd knew its IP address. DLM kicked out the
error:
dlm: Initiating association with node 13
dlm: no address for nodeid 13

Is it possible there's a race here? The cman event callback arriving after
dlm_controld has decided to add the new node ?

Comment 4 Kiersten (Kerri) Anderson 2006-10-03 16:54:42 UTC
Devel ACK for RHEL 5.0.0 Beta 2

Comment 5 RHEL Program Management 2006-10-03 17:05:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 6 Christine Caulfield 2006-10-05 07:54:04 UTC
This is slightly hacky and I can't seem to reproduce it any more. but it should
fix the problem.

Basically, if a lockspace contains a node that dlm_control doesn't know about
then it re-reads the cman nodes list.

Checking in action.c;
/cvs/cluster/cluster/group/dlm_controld/action.c,v  <--  action.c
new revision: 1.7; previous revision: 1.6
done
Checking in dlm_daemon.h;
/cvs/cluster/cluster/group/dlm_controld/dlm_daemon.h,v  <--  dlm_daemon.h
new revision: 1.4; previous revision: 1.3
done
Checking in member_cman.c;
/cvs/cluster/cluster/group/dlm_controld/member_cman.c,v  <--  member_cman.c
new revision: 1.3; previous revision: 1.2
done


Comment 11 Nate Straz 2007-12-13 17:22:18 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.