Bug 173462
Summary: | quorum partition can not be found | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | dhe |
Component: | clumanager | Assignee: | Lon Hohberger <lhh> |
Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | cluster-maint, eguan, tao |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-12-06 16:18:17 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 174689 |
Description
dhe
2005-11-17 08:45:41 UTC
Issue: After running for a year and rebooting a node, the cluster displayed abnormal messages: [root@gserver4 root]# clustat Cluster Status - servers 23:38:09 Cluster Quorum Status Unknown (Connection refused) Member Status ------------------ ---------- server Unknown server_2 Unknown No Quorum - Service States Unknown ---------------------------------- [root@server root]# service clumanager status clumembd is stopped cluquorumd is stopped clulockd is stopped clusvcmgrd is stopped --------------------- Then I started clumanager manually: ----------------------------------- [root@server root]# service clumanager start Starting Red Hat Cluster Manager... Starting Quorum Daemon: [ OK ] [root@server root]# service clumanager status clumembd (pid 13043) is running... cluquorumd (pid 13041) is running... clulockd (pid 13049) is running... clusvcmgrd is stopped Note: Service manager is not running because this member is not participating in the cluster quorum. Then, the really weird stuff: [root@server root]# clustat Cluster Status - server 23:39:51 Incarnation #0 (This member is not part of the cluster quorum) Member Status ------------------ ---------- server Inactive server_2 Inactive <-- You are here No Quorum - Service States Unknown The quorum partitions and participation in the cluster quorum are different things. If the quorum partitions were not accessible, cluquorumd would not successfully start, and clulockd could not run. If it had been running for a year, chances are good that it's a time to upgrade clumanager. Looks like you've got 1.2.22 -- which is actually a fine version; but I'd upgrade to fix the IP tiebreaker bug. I don't think this is your problem, but it's best to eliminate all possible causes. Bottom line, it looks like the nodes are not communicating properly with one-another. Check things like network cables, switch ports, iptables rules, etc. It looks like a /quorum/ convergence problem, oddly, not a membership convergence problem. e.g. Try running "clumembd -fd" on both nodes with the cluster stopped and see if they converge. You will see something like "Membership View #2:0x00000003" if they converge, and 0x00000001 or 0x00000002 if not. I'd like to see that output. [root@gserver4 root]# clumembd -fd [3161] debug: Starting up [3161] debug: Setting configuration parameters. [3161] debug: Overriding interval to be 750000 [3161] debug: Transmit thread set to ON [3161] debug: Overriding TKO count to be 20 [3161] debug: Broadcast hearbeating set to OFF [3161] debug: Multicast hearbeat ON [3161] debug: Multicast address is 225.0.0.11 [3161] debug: I am member #1 [3161] debug: Interface IP is 127.0.0.1 [3161] debug: Interface IP is 10.0.0.2 [3161] debug: Interface IP is 10.53.80.38 [3161] debug: Setting up multicast 225.0.0.11 on 10.53.80.38 [3161] debug: Interface IP is 172.21.80.38 [3161] debug: Cluster I/F: eth1 [10.53.80.38] [3161] debug: clumembd_start_watchdog: set duration to 14. [3161] debug: Waiting for requests. [3161] debug: Transmit thread: pulsar [3161] notice: Member gserver3 UP [3161] debug: Connect: Member #0 (10.53.80.37) [IPv4] [3161] debug: MB: New connect: fd8 [3161] debug: MB: Received VF_MESSAGE, fd8 [3161] debug: VF_JOIN_VIEW from member #0! Key: 0x27456381 #2 [3161] debug: VF: Voting YES [3161] debug: MB: Received VF_MESSAGE, fd8 [3161] debug: VF: Received VF_VIEW_FORMED, fd8 [3161] debug: VF: Commit Key 0x27456381 #2 from member #0 [3161] info: Membership View #2:0x00000003 ########################################################### # Output was stopped here, entered CTRL-C to exit manually. # ########################################################### [3161] debug: clumembd_sw_watchdog_stop: successfully stopped wat [root@gserver4 /]# clumembd -fd [19281] debug: Starting up [19281] debug: Setting configuration parameters. [19281] debug: Overriding interval to be 750000 [19281] debug: Transmit thread set to ON [19281] debug: Overriding TKO count to be 20 [19281] debug: Broadcast hearbeating set to OFF [19281] debug: Multicast hearbeat ON [19281] debug: Multicast address is 225.0.0.11 [19281] debug: I am member #1 [19281] debug: Interface IP is 127.0.0.1 [19281] debug: Interface IP is 10.0.0.2 [19281] debug: Interface IP is 10.53.80.38 [19281] debug: Setting up multicast 225.0.0.11 on 10.53.80.38 [19281] debug: Interface IP is 172.21.80.38 [19281] debug: Cluster I/F: eth1 [10.53.80.38] [19281] debug: clumembd_start_watchdog: set duration to 14. [19281] debug: Waiting for requests. [19281] debug: Transmit thread: pulsar [19281] notice: Member gserver4 UP [1 Excellent, so, the nodes can see each other (and are communicating properly). If they could not, you never would see 0x00000003. Starting clumanager and immediately running clustat will give you inquorate / both nodes inactive, and is expected behavior... Could that have been the problem? Or did this persist for a long time after (requiring a restart) ? Yes, it persisted for a long time. Lon, I'm not sure why this BZ got opened out of process. It appears that they had iptables rules blocking the traffic which means it couldn't finish joining. I'm closing the BZ |