126991 – nodes form two clusters instead of one when joining in parallel

Bug 126991 - nodes form two clusters instead of one when joining in parallel

Summary: nodes form two clusters instead of one when joining in parallel

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-30 07:15 UTC by David Teigland
Modified:	2010-01-12 02:53 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-02-22 06:23:21 UTC
Embargoed:

Attachments	(Terms of Use)

Description David Teigland 2004-06-30 07:15:32 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7)
Gecko/20040626 Firefox/0.9.1

Description of problem:
For the first time in quite a while I had one of my four nodes form
its own cluster while the other three formed another.  I was just
running my usual cluster startup script.  There was no immediate sign
that something was wrong until SM error messages started appearing.


Version-Release number of selected component (if applicable):


How reproducible:
Couldn't Reproduce

Steps to Reproduce:
This occurs very rarely on my 4 node cluster.
I have all 4 nodes run "cman_tool join -c delta" in parallel


Actual Results:  one node has formed cluster "delta" on its own and
the other three
nodes have formed cluster "delta" together

Expected Results:  all four nodes form a single cluster

Additional info:

Comment 1 Christine Caulfield 2004-07-01 10:00:34 UTC

It looks like the delay calculated on receipt of a NEWCLUSTER message
could occasionally be higher than the joinwait time which would cause
a node to wait too long before trying again, thus the other nodes
would have given up and formed a new cluster.

I've fixed this exposure and also increased the joinwait timeout to be
slightly longer too.

Checking in config.c;
/cvs/cluster/cluster/cman-kernel/src/config.c,v  <--  config.c
new revision: 1.2; previous revision: 1.1
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.3; previous revision: 1.2
done

Comment 2 Christine Caulfield 2004-07-22 12:19:09 UTC

This obviously needs more work...

Comment 3 Christine Caulfield 2004-09-08 07:45:43 UTC

This should work better. Based on ideas from Dave

Checking in src/config.c;
/cvs/cluster/cluster/cman-kernel/src/config.c,v  <--  config.c
new revision: 1.3; previous revision: 1.2
done
Checking in src/config.h;
/cvs/cluster/cluster/cman-kernel/src/config.h,v  <--  config.h
new revision: 1.2; previous revision: 1.1
done
Checking in src/membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.15; previous revision: 1.14
done
Checking in src/proc.c;
/cvs/cluster/cluster/cman-kernel/src/proc.c,v  <--  proc.c
new revision: 1.3; previous revision: 1.2
done

Comment 4 Lazar Obradovic 2004-09-08 14:37:19 UTC

It's better but still not perfect (and, we expect perfect, don't we?)

I have 7 node cluster. Before this last update, I used to have 4-5
"clusters" formed in parallel with node or two in it.
After update, there are 2-3 "clusters", which still isn't good.

As a workaround, I have placed "sleep $((RANDOM / 1000 ))s" into my
startup script, which somewhat helps the parallel startup situation,
but slows down boot process.

Comment 5 Christine Caulfield 2004-09-13 07:25:49 UTC

There was a missing condition in that original check-in that made it
little better than the original. I've corrected this now and I think
it should be fixed.

I'll wait for Lazar to confirm before changing the status of this bug
report though.

Comment 6 Christine Caulfield 2004-09-17 09:49:41 UTC

No response from Lazar, but he's not said it's still broken :)

It seems OK to me on my 12 node cluster now, so setting it to MODIFIED
for the moment.

Comment 7 Christine Caulfield 2004-09-20 15:02:43 UTC

For info, Lazar said (on IRC) that he hasn't seen this bug since the
last fix was applied.

Comment 8 Kiersten (Kerri) Anderson 2004-11-16 19:02:21 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Comment 9 David Teigland 2005-02-22 06:23:21 UTC

not seen this in a long time

Note You need to log in before you can comment on or make changes to this bug.