Bug 473102

Summary: Nodes GATHER but don't form a configuration
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: corosyncAssignee: Steven Dake <sdake>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-16 04:18:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from all 28 nodes and revolver none

Description Nate Straz 2008-11-26 15:20:17 UTC
Created attachment 324732 [details]
Logs from all 28 nodes and revolver

Description of problem:

While running recovery tests on a large cluster (28 nodes) the membership fell apart and nodes formed their own rings and would not re-form the 28 node cluster.  In /var/log/messages I see:

openais[2719]: [TOTEM] entering GATHER state from 11.

This message repeats about 20 times then a configuration with just one node.

I'm using the following parameters in cluster.conf:
  <totem token="30000" consensus="29000" join="5000" send_join="80"/>

The attached logs are from two revolver scenarios.  In scenario 1.3, one node less than quorum was shot by revolver with "reboot -fin," which completed recovery and passed.  In scenario 1.4, one node more than quorum was shot and I hit the problem described above.  

Version-Release number of selected component (if applicable):
cman-2.0.97-1.el5
openais-0.80.3-21.el5


How reproducible:
I've run into this scenario many times before, but it probably takes a few tries to hit this.

Actual results:


Expected results:


Additional info:

Comment 1 Nate Straz 2008-12-01 18:15:07 UTC
Putting this on the 5.4 radar so we can support large configurations.

Comment 4 RHEL Program Management 2009-06-16 04:18:19 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.