Bug 207197 - Cman will hang initializing the daemons if started on all nodes simultaneously
Summary: Cman will hang initializing the daemons if started on all nodes simultaneously
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-19 21:22 UTC by Josef Bacik
Modified: 2009-04-16 22:29 UTC (History)
4 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-28 21:31:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Josef Bacik 2006-09-19 21:22:58 UTC
Description of problem:


Version-Release number of selected component (if applicable):
This isn't specifically CMAN, but rather the daemons that CMAN starts.  I have a
two node cluster, if I start the nodes up with the cluster services stopped and
then run service cman start on both nodes at the same time they will both hang at
[root@rh5cluster1 ~]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... 

fenced is started, so i'm assuming its at the dlm_controld stuff, which isn't
showing up in PS.  The only messages I get are these

Sep 19 15:44:50 rh5cluster1 ccsd[2718]: Initial status:: Quorate 
Sep 19 15:49:18 rh5cluster1 kernel: DLM (built Aug 30 2006 18:19:57) installed
Sep 19 15:49:18 rh5cluster1 kernel: GFS2 (built Aug 30 2006 18:20:27) installed
Sep 19 15:49:18 rh5cluster1 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set
Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22

and

Sep 19 15:44:50 rh5cluster2 openais[2715]: [TOTEM] entering OPERATIONAL state.
Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM  ] got nodejoin message 10.10.1.12
Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM  ] got nodejoin message 10.10.1.13
Sep 19 15:44:50 rh5cluster2 kernel: DLM (built Aug 30 2006 18:19:57) installed
Sep 19 15:44:50 rh5cluster2 kernel: GFS2 (built Aug 30 2006 18:20:27) installed
Sep 19 15:44:50 rh5cluster2 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed

I will look into it more tomorrow.

How reproducible:
Everytime

Steps to Reproduce:
1.bring up nodes without the cluster services enabled
2.open up ssh to both nodes and type 'service cman start' but dont hit enter
3.hit enter in both terminals as quickly together as possible
  
Actual results:
The script hangs at "Starting Daemons"

Expected results:
It shouldn't hang

Additional info:
I'm using the newest packages in brew, Cman 2.0.16.

Comment 1 Christine Caulfield 2006-09-21 12:40:33 UTC
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set
Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22

Obviously those are the key messages. if dlm_controld isn't running then that
might explain why the DLM hasn't been configured - it might be that it crashed
perhaps?

A debug log from dlm_control would be really helpful here if you can get one.

Comment 2 Christine Caulfield 2006-09-21 12:58:43 UTC
Oh, and it's also checking whether configfs is mounted. The times when I see
this message, I have found that configfs hasn't mounted for some reason.

Comment 3 Christine Caulfield 2006-09-26 14:18:48 UTC
I saw something that is possibly similar to #1 today, where a node was added to
the DLM members list before dlm_groupd knew its IP address. DLM kicked out the
error:
dlm: Initiating association with node 13
dlm: no address for nodeid 13

Is it possible there's a race here? The cman event callback arriving after
dlm_controld has decided to add the new node ?

Comment 4 Kiersten (Kerri) Anderson 2006-10-03 16:54:42 UTC
Devel ACK for RHEL 5.0.0 Beta 2

Comment 5 RHEL Program Management 2006-10-03 17:05:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 6 Christine Caulfield 2006-10-05 07:54:04 UTC
This is slightly hacky and I can't seem to reproduce it any more. but it should
fix the problem.

Basically, if a lockspace contains a node that dlm_control doesn't know about
then it re-reads the cman nodes list.

Checking in action.c;
/cvs/cluster/cluster/group/dlm_controld/action.c,v  <--  action.c
new revision: 1.7; previous revision: 1.6
done
Checking in dlm_daemon.h;
/cvs/cluster/cluster/group/dlm_controld/dlm_daemon.h,v  <--  dlm_daemon.h
new revision: 1.4; previous revision: 1.3
done
Checking in member_cman.c;
/cvs/cluster/cluster/group/dlm_controld/member_cman.c,v  <--  member_cman.c
new revision: 1.3; previous revision: 1.2
done


Comment 11 Nate Straz 2007-12-13 17:22:18 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.


Note You need to log in before you can comment on or make changes to this bug.