This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 126991 - nodes form two clusters instead of one when joining in parallel
nodes form two clusters instead of one when joining in parallel
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-06-30 03:15 EDT by David Teigland
Modified: 2010-01-11 21:53 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-02-22 01:23:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description David Teigland 2004-06-30 03:15:32 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7)
Gecko/20040626 Firefox/0.9.1

Description of problem:
For the first time in quite a while I had one of my four nodes form
its own cluster while the other three formed another.  I was just
running my usual cluster startup script.  There was no immediate sign
that something was wrong until SM error messages started appearing.


Version-Release number of selected component (if applicable):


How reproducible:
Couldn't Reproduce

Steps to Reproduce:
This occurs very rarely on my 4 node cluster.
I have all 4 nodes run "cman_tool join -c delta" in parallel


Actual Results:  one node has formed cluster "delta" on its own and
the other three
nodes have formed cluster "delta" together

Expected Results:  all four nodes form a single cluster

Additional info:
Comment 1 Christine Caulfield 2004-07-01 06:00:34 EDT
It looks like the delay calculated on receipt of a NEWCLUSTER message
could occasionally be higher than the joinwait time which would cause
a node to wait too long before trying again, thus the other nodes
would have given up and formed a new cluster.

I've fixed this exposure and also increased the joinwait timeout to be
slightly longer too.

Checking in config.c;
/cvs/cluster/cluster/cman-kernel/src/config.c,v  <--  config.c
new revision: 1.2; previous revision: 1.1
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.3; previous revision: 1.2
done
Comment 2 Christine Caulfield 2004-07-22 08:19:09 EDT
This obviously needs more work...
Comment 3 Christine Caulfield 2004-09-08 03:45:43 EDT
This should work better. Based on ideas from Dave

Checking in src/config.c;
/cvs/cluster/cluster/cman-kernel/src/config.c,v  <--  config.c
new revision: 1.3; previous revision: 1.2
done
Checking in src/config.h;
/cvs/cluster/cluster/cman-kernel/src/config.h,v  <--  config.h
new revision: 1.2; previous revision: 1.1
done
Checking in src/membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.15; previous revision: 1.14
done
Checking in src/proc.c;
/cvs/cluster/cluster/cman-kernel/src/proc.c,v  <--  proc.c
new revision: 1.3; previous revision: 1.2
done
Comment 4 Lazar Obradovic 2004-09-08 10:37:19 EDT
It's better but still not perfect (and, we expect perfect, don't we?)

I have 7 node cluster. Before this last update, I used to have 4-5
"clusters" formed in parallel with node or two in it.
After update, there are 2-3 "clusters", which still isn't good.

As a workaround, I have placed "sleep $((RANDOM / 1000 ))s" into my
startup script, which somewhat helps the parallel startup situation,
but slows down boot process. 
Comment 5 Christine Caulfield 2004-09-13 03:25:49 EDT
There was a missing condition in that original check-in that made it
little better than the original. I've corrected this now and I think
it should be fixed.

I'll wait for Lazar to confirm before changing the status of this bug
report though.
Comment 6 Christine Caulfield 2004-09-17 05:49:41 EDT
No response from Lazar, but he's not said it's still broken :)

It seems OK to me on my 12 node cluster now, so setting it to MODIFIED
for the moment.
Comment 7 Christine Caulfield 2004-09-20 11:02:43 EDT
For info, Lazar said (on IRC) that he hasn't seen this bug since the
last fix was applied.
Comment 8 Kiersten (Kerri) Anderson 2004-11-16 14:02:21 EST
Updating version to the right level in the defects.  Sorry for the storm.
Comment 9 David Teigland 2005-02-22 01:23:21 EST
not seen this in a long time

Note You need to log in before you can comment on or make changes to this bug.