Bug 126991 - nodes form two clusters instead of one when joining in parallel
Summary: nodes form two clusters instead of one when joining in parallel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-06-30 07:15 UTC by David Teigland
Modified: 2010-01-12 02:53 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-02-22 06:23:21 UTC
Embargoed:


Attachments (Terms of Use)

Description David Teigland 2004-06-30 07:15:32 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7)
Gecko/20040626 Firefox/0.9.1

Description of problem:
For the first time in quite a while I had one of my four nodes form
its own cluster while the other three formed another.  I was just
running my usual cluster startup script.  There was no immediate sign
that something was wrong until SM error messages started appearing.


Version-Release number of selected component (if applicable):


How reproducible:
Couldn't Reproduce

Steps to Reproduce:
This occurs very rarely on my 4 node cluster.
I have all 4 nodes run "cman_tool join -c delta" in parallel


Actual Results:  one node has formed cluster "delta" on its own and
the other three
nodes have formed cluster "delta" together

Expected Results:  all four nodes form a single cluster

Additional info:

Comment 1 Christine Caulfield 2004-07-01 10:00:34 UTC
It looks like the delay calculated on receipt of a NEWCLUSTER message
could occasionally be higher than the joinwait time which would cause
a node to wait too long before trying again, thus the other nodes
would have given up and formed a new cluster.

I've fixed this exposure and also increased the joinwait timeout to be
slightly longer too.

Checking in config.c;
/cvs/cluster/cluster/cman-kernel/src/config.c,v  <--  config.c
new revision: 1.2; previous revision: 1.1
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.3; previous revision: 1.2
done


Comment 2 Christine Caulfield 2004-07-22 12:19:09 UTC
This obviously needs more work...

Comment 3 Christine Caulfield 2004-09-08 07:45:43 UTC
This should work better. Based on ideas from Dave

Checking in src/config.c;
/cvs/cluster/cluster/cman-kernel/src/config.c,v  <--  config.c
new revision: 1.3; previous revision: 1.2
done
Checking in src/config.h;
/cvs/cluster/cluster/cman-kernel/src/config.h,v  <--  config.h
new revision: 1.2; previous revision: 1.1
done
Checking in src/membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.15; previous revision: 1.14
done
Checking in src/proc.c;
/cvs/cluster/cluster/cman-kernel/src/proc.c,v  <--  proc.c
new revision: 1.3; previous revision: 1.2
done


Comment 4 Lazar Obradovic 2004-09-08 14:37:19 UTC
It's better but still not perfect (and, we expect perfect, don't we?)

I have 7 node cluster. Before this last update, I used to have 4-5
"clusters" formed in parallel with node or two in it.
After update, there are 2-3 "clusters", which still isn't good.

As a workaround, I have placed "sleep $((RANDOM / 1000 ))s" into my
startup script, which somewhat helps the parallel startup situation,
but slows down boot process. 


Comment 5 Christine Caulfield 2004-09-13 07:25:49 UTC
There was a missing condition in that original check-in that made it
little better than the original. I've corrected this now and I think
it should be fixed.

I'll wait for Lazar to confirm before changing the status of this bug
report though.

Comment 6 Christine Caulfield 2004-09-17 09:49:41 UTC
No response from Lazar, but he's not said it's still broken :)

It seems OK to me on my 12 node cluster now, so setting it to MODIFIED
for the moment.

Comment 7 Christine Caulfield 2004-09-20 15:02:43 UTC
For info, Lazar said (on IRC) that he hasn't seen this bug since the
last fix was applied.

Comment 8 Kiersten (Kerri) Anderson 2004-11-16 19:02:21 UTC
Updating version to the right level in the defects.  Sorry for the storm.

Comment 9 David Teigland 2005-02-22 06:23:21 UTC
not seen this in a long time


Note You need to log in before you can comment on or make changes to this bug.