Bug 119057
Summary: | can't reconstruct cluster from cluster.xml using shutil -s | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Need Real Name <cjk> |
Component: | clumanager | Assignee: | Lon Hohberger <lhh> |
Status: | CLOSED WORKSFORME | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-06-23 15:02:30 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Need Real Name
2004-03-24 15:58:13 UTC
This works for me. What does cluquorumd say (run: cluquorumd -f -d)? After zeroing out the raw partitions, mine says this: Header CRC32 mismatch; Exp: 0x00000000 Got: 0x190a55ad diskLseekRawReadChecksum: bad check sum, part = 1 offset = 0 len = 512 Header CRC32 mismatch; Exp: 0x00000000 Got: 0x190a55ad diskLseekRawReadChecksum: bad check sum, part = 0 offset = 0 len = 512 Header CRC32 mismatch; Exp: 0x00000000 Got: 0x190a55ad diskLseekRawReadChecksum: bad check sum, part = 1 offset = 0 len = 512 diskRawReadShadow: checksums bad on both partitions [3566] emerg: Not Starting Cluster Manager: Shared State Header Unreadable: Success (run: shutil -i) After initializing the partitions and storing the configuration: [4382] warning: STONITH: No drivers configured for host 'blue'! [4382] warning: STONITH: Data integrity may be compromised! [4382] warning: STONITH: No drivers configured for host 'cyan'! [4382] warning: STONITH: Data integrity may be compromised! [4382] debug: Disk: Disk Interval 2, TKO 7 (14 sec) [4382] info: DISK: My status: UP Partner status: UP [4382] debug: Starting disk status thread [4382] debug: Cluster I/F: eth0 [10.140.1.187] [4382] crit: Connection to membership daemon failed! [4382] debug: Telling disk quorum thread to exit [4382] crit: Unclean exit: Status -1 Don't worry about it complaining about the membership daemon (it can't connect to it because it didn't spawn it; it never spawns other daemons in debug mode) If the quorum daemon can't validate the shared partitions, it won't start at all. You can also verify the existence of the cluster configuration in the raw partitions with: shutil -d /cluster/config.xml And verify the header: shutil -p /cluster/header After zeroing out the Q partitions, I get similar output to what you have above. I then run shutil -i, shutil -s /etc/cluster.xml, and run cluquorumd -f -d again, with similar output as what you have with the exception of this... debug: IP tie-breaker in use, not starting disk thread.[2605] debug: Cluster I/F: eth0 [11.22.33.44] where 11.22.33.44. is my ip address on eth0. When I check the quorum status with shutil -d /cluster/config.xml and shutil -p /cluster/header, the output indicates that it is in fact correct. I'll go back and turn off the IP tie-breaker and see what happens. By the way, thanks for getting on this so fast. Corey Not starting the disk thread is expected behavior in that situation. If you're running later development/test/beta packages (1.2.10 and later), you'll want to see cluforce(8) when using IP tie-breaker. The behavior has been altered such that you can not form a cluster quorum without a majority of physical members without operator intervention while using the IP tie-breaker vote. There was a timing inconsistency with respect to how quickly members converged on their own views of membership (causing them to take longer to converge); but that does not directly explain the problem as reported per se. I got it working by taking two of the four nodes out of the cluster and changing the tie-breaker to disk based. Then I was able to get quorum and add the other two machines back into the cluster. I then changed the tie-breaker back to ip based. Things seem to be working now but this is clearly not right. I try to repeat the problem in the morning. Did you distribute the old /etc/cluster.xml (the one you were restoring) to all of the members prior to starting up cluster manager? (I'm not sure if the documentation says that; if it does not, it should.) Side note - you need to bring 2 members online in 1.2.9 to form a quorum with an IP tie-breaker in a 4-member cluster. You can never form a new quorum with only one member when 4 members exist in the cluster configuration. Without an IP tie-breaker, you need 3 members. In the U2 beta package, you need either (a) 3 online members or (b) operator intervention + 2 online members. Yes I copied the cluster.xml file from the nodes that remained to the ones that I rebuilt. If I am reading your comment about the U2 beta, your saying that in a 3 node cluster I can't bring down one node and still have quorum? Is there any thought of handling things like TruCluster and VMS where the quorum disk gets a vote so that on a two node cluster with quorum, one node can be dropped and still maintain system operations? It works that way now. The disk and IP tie-breakers are "third votes" in 2-node clusters when one member is online: (a) 1 of 2 online: Half + No T/B: Minority. No Quorum (b) 1 of 2 online: Half + Disk T/B: Quorum (c) 1 of 2 online: Half + IP T/B: Quorum (d) 2 of 2 online: Majority. Quorum. 3-member clusters do not need a tie-break, as there can never be half of the members online: (a) 1 of 3 online: Minority. No quorum. (b) 2 of 3 online: Majority. Quorum. (c) 3 of 3 online: Majority. Quorum. Similarly, in 4-node clusters: (a) 1 of 4 online: Minority: No Quorum. (b) 2 of 4 online: Half + No T/B: No Quorum (c) 2 of 4 online: Half + IP T/B: Quorum. (d) 3 of 4 online: Majority: Quorum. (e) 4 of 4 online: Majority: Quorum. Note that the U2 beta package alters example (c) above; (c) only applies if the cluster previously had a majority and it drops to half its members in size. A new quorum will not form solely with half of the members and the IP-tie-breaker without administrator intervention. I'm still puzzled why you could not regain a quorum in your cluster. How many of your 4 members did you bring online after restoring the configuration? Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3 |