Bug 119057

Summary:	can't reconstruct cluster from cluster.xml using shutil -s
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Need Real Name <cjk>
Component:	clumanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	3	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-06-23 15:02:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Need Real Name 2004-03-24 15:58:13 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; 
H010818)

Description of problem:
After accidentally destroying the quorum data on a 4 node cluster, I 
cannot re-establish a cluster by issuing 'shutil -s /etc/cluster.xml'

Version-Release number of selected component (if applicable):
clumanager-1.2.9-1

How reproducible:
Always

Steps to Reproduce:
1. Shutdown clumanager on all nodes
2. kill the quorum paritions using 
         'dd if=/dev/zero of=/dev/sda1'
         'dd if=/dev/zero of=/dev/sda2'
3. issue 'shutil -i'
4. issue 'shutil -s /etc/cluster.xml'
5. start clumanager using 'service clumanager start'
6. Verify settings by issuing 'shutil -p /cluster/cluster.xml'
    

Actual Results:  Cluster does not obtain quorum

Expected Results:  Cluster should obtain quorum 

Additional info:

This is documented in the cluster manager documentation as a working 
solution for backing up cluster configurations. I have followed the 
directions and this does not seem to be working.

Comment 1 Lon Hohberger 2004-03-24 17:11:46 UTC

This works for me.

What does cluquorumd say (run: cluquorumd -f -d)?
After zeroing out the raw partitions, mine says this:

Header CRC32 mismatch; Exp: 0x00000000 Got: 0x190a55ad
diskLseekRawReadChecksum: bad check sum, part = 1 offset = 0 len = 512
Header CRC32 mismatch; Exp: 0x00000000 Got: 0x190a55ad
diskLseekRawReadChecksum: bad check sum, part = 0 offset = 0 len = 512
Header CRC32 mismatch; Exp: 0x00000000 Got: 0x190a55ad
diskLseekRawReadChecksum: bad check sum, part = 1 offset = 0 len = 512
diskRawReadShadow: checksums bad on both partitions
[3566] emerg: Not Starting Cluster Manager: Shared State Header
Unreadable: Success (run: shutil -i)

After initializing the partitions and storing the configuration:

[4382] warning: STONITH: No drivers configured for host 'blue'!
[4382] warning: STONITH: Data integrity may be compromised!
[4382] warning: STONITH: No drivers configured for host 'cyan'!
[4382] warning: STONITH: Data integrity may be compromised!
[4382] debug: Disk: Disk Interval 2, TKO 7 (14 sec)
[4382] info: DISK: My status: UP  Partner status: UP
[4382] debug: Starting disk status thread
[4382] debug: Cluster I/F: eth0 [10.140.1.187]
[4382] crit: Connection to membership daemon failed!
[4382] debug: Telling disk quorum thread to exit
[4382] crit: Unclean exit: Status -1

Don't worry about it complaining about the membership daemon (it can't
connect to it because it didn't spawn it; it never spawns other
daemons in debug mode)

If the quorum daemon can't validate the shared partitions, it won't
start at all.

You can also verify the existence of the cluster configuration in the
raw partitions with:

   shutil -d /cluster/config.xml

And verify the header:

   shutil -p /cluster/header

Comment 2 Need Real Name 2004-03-24 18:11:23 UTC

After zeroing out the Q partitions, I get similar output to what you
have above. I then run shutil -i, shutil -s /etc/cluster.xml, and run 
cluquorumd -f -d again, with similar output as what you have with the 
exception of this...

debug: IP tie-breaker in use, not starting disk thread.[2605] debug: 
Cluster I/F: eth0 [11.22.33.44]
where 11.22.33.44. is my ip address on eth0.

When I check the quorum status with shutil -d /cluster/config.xml and 
shutil -p /cluster/header, the output indicates that it is in fact 
correct.

I'll go back and turn off the IP tie-breaker and see what happens.

By the way, thanks for getting on this so fast.

Corey

Comment 3 Lon Hohberger 2004-03-24 19:11:12 UTC

Not starting the disk thread is expected behavior in that situation.

If you're running later development/test/beta packages (1.2.10 and
later), you'll want to see cluforce(8) when using IP tie-breaker.  The
behavior has been altered such that you can not form a cluster quorum
without a majority of physical members without operator intervention
while using the IP tie-breaker vote. 

There was a timing inconsistency with respect to how quickly members
converged on their own views of membership (causing them to take
longer to converge); but that does not directly explain the problem as
reported per se.

Comment 4 Need Real Name 2004-03-25 01:02:43 UTC

I got it working by taking two of the four nodes out of the cluster
and changing the tie-breaker to disk based. Then I was able to get
quorum and add the other two machines back into the cluster. I then
changed the tie-breaker back to ip based. Things seem to be working
now but this is clearly not right. I try to repeat the problem in the
morning.

Comment 5 Lon Hohberger 2004-03-25 13:55:37 UTC

Did you distribute the old /etc/cluster.xml (the one you were
restoring) to all of the members prior to starting up cluster manager?

(I'm not sure if the documentation says that; if it does not, it should.)

Side note - you need to bring 2 members online in 1.2.9 to form a
quorum with an IP tie-breaker in a 4-member cluster.  You can never
form a new quorum with only one member when 4 members exist in the
cluster configuration.  Without an IP tie-breaker, you need 3 members.

In the U2 beta package, you need either (a) 3 online members or (b)
operator intervention + 2 online members.

Comment 6 Need Real Name 2004-03-26 17:52:26 UTC

Yes I copied the cluster.xml file from the nodes that remained to the 
ones that I rebuilt.

If I am reading your comment about the U2 beta, your saying that in a 
3 node cluster I can't bring down one node and still have quorum? 

Is there any thought of handling things like TruCluster and VMS where 
the quorum disk gets a vote so that on a two node cluster with 
quorum, one node can be dropped and still maintain system operations?

Comment 7 Lon Hohberger 2004-03-26 18:39:44 UTC

It works that way now.  The disk and IP tie-breakers are "third votes"
in 2-node clusters when one member is online:

(a) 1 of 2 online: Half + No T/B: Minority.  No Quorum
(b) 1 of 2 online: Half + Disk T/B: Quorum
(c) 1 of 2 online: Half + IP T/B: Quorum
(d) 2 of 2 online: Majority.  Quorum.

3-member clusters do not need a tie-break, as there can never be half
of the members online:

(a) 1 of 3 online: Minority.  No quorum.
(b) 2 of 3 online: Majority.  Quorum.
(c) 3 of 3 online: Majority.  Quorum.

Similarly, in 4-node clusters:

(a) 1 of 4 online: Minority:  No Quorum.
(b) 2 of 4 online: Half + No T/B:  No Quorum
(c) 2 of 4 online: Half + IP T/B:  Quorum.
(d) 3 of 4 online: Majority:  Quorum.
(e) 4 of 4 online: Majority:  Quorum.

Note that the U2 beta package alters example (c) above; (c) only
applies if the cluster previously had a majority and it drops to half
its members in size.  A new quorum will not form solely with half of
the members and the IP-tie-breaker without administrator intervention.


I'm still puzzled why you could not regain a quorum in your cluster. 
How many of your 4 members did you bring online after restoring the
configuration?

Comment 8 Lon Hohberger 2007-12-21 15:10:25 UTC

Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3