From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 Description of problem: Cluster configuration - 2 member cluster with ip based tiebreaker - soft quorum enabled by default for both members Formation of soft quorum in 1 member up case does not work if the network was down when the member comes up and network is re-enabled. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: (1) Shut down member 2 completely. (2) Shut down member 1 (3) Unplug network cable from member1 (4) Bring up member1 In this case quorum is not formed (since tie-breaker ip is not reachable). (5) Plug in network cable for member1. (6) Manual cluforce works or if member2 is brought up, quorum formation happens. Actual Results: No quorum formed after step 5. Expected Results: After step 5: Expectation was that quorum should get formed due to the soft quorum flag enabled by defauly in cludb. However no quorum was formed Additional info:
While this _is_ a bug, it's not a generally supported use case.
Created attachment 103178 [details] Patch fixing several problems with net tiebreaker This patch fixes the behavior described in this bug.
Above patch breaks allow_soft option; new patch coming shortly.
Created attachment 103180 [details] Corrected patch Same as above patch, except enforces hard quorum when allow_soft is not set.
Patch tested with both allow_soft and non-allow-soft. Should work.
We tried the same experiments after applying this patch. We have 4 services to be run on the 2-member cluster: svc1 and svc2 preferred on host1 svc3 and svc4 preferred on host2 Following are the results: TestCase 1: bring up host1 without network, while host2 is down. host1 comes up but the cluster on it doesn't form a quorum. We put the network cable back, and the cluster on host1 forms a quorum. It then starts all the services( svc1, svc2, svc3, svc4 ) on itself, because host2 is down. This is fair, TestCase 1 is PASSED. TestCase 2: bring up host1 without network, while host2 is up. host1 comes up but the cluster on it doesn't form a quorum. By this this time, host2 is running all the services( svc1, svc2, svc3, svc4 ) because host1's cluster is not participating in the quorum. We put the network cable back, and the cluster on host1 forms a quorum ( I'm assuming that host1 formed this quorum which included host2 in it). It then *surprisingly* starts all the services on itself( svc1, svc2, svc3, svc4 ), while only two of them viz svc1 and svc2 were expected to failover to host1. This TestCase FAILED. Did it happen because host1 formed a new quorum (with higher VIEW-ID/incarnation-num ?) which excluded host2 ? We had to reboot host2. When host2 came up, it took over it's preferred services viz svc3 and svc4, back from host1.
I'm reasonably confident I know what's wrong with the patch, thanks for the feedback. Stay tuned.
Created attachment 103300 [details] Implements async IP-tie-vote + fixes timing There was a timing problem where the IP-tie-vote was getting declared 'online' before membership had a chance to converge, which caused the described problem. This patch should fix that problem.
Created attachment 103303 [details] Implements async IP-tie-vote + fixes timing There was a timing problem where the IP-tie-vote was getting declared 'online' before membership had a chance to converge, which caused the described problem. This patch should fix that problem. (Previous patch was generated against the wrong tree. I'm on top of things.)
I ran two cases of 'Test 2' (one with node 0 starting disconnected, one with node 1 starting disconnected) - services with single-node failover domains relocated properly in both cases.
TestCase 2 PASSED. But unfortunately TestCase 1 FAILED. So that the actual problem described in this BUG, still exists. Steps to Reproduce: (1) Shut down node 2 completely. (2) Shut down node 1 (3) Unplug network cable from member1 (4) Bring up member1 In this case quorum is not formed (since tie-breaker ip is not reachable). (5) Plug in network cable for member1. *** The quorom is still not formed. Sorry for the trouble :)
It works for me... Sep 1 11:07:15 magenta kernel: e100: eth0 NIC Link is Down Sep 1 11:07:32 magenta clumanager: [17937]: <notice> Starting Red Hat Cluster Manager... Sep 1 11:07:32 magenta kernel: ip_tables: (C) 2000-2002 Netfilter core team Sep 1 11:07:32 magenta cluquorumd[17954]: <info> IPv4-TB: 192.168.0.254 Sep 1 11:07:32 magenta cluquorumd[17954]: <info> IPv4-TB: Interval 2 On:8 Off:2 Sep 1 11:07:32 magenta cluquorumd[17954]: <warning> Allowing soft quorum. Sep 1 11:07:32 magenta cluquorumd[17954]: <info> STONITH: wti_nps at 192.168.0.15, port yellow controls yellow.lab.test.com Sep 1 11:07:32 magenta cluquorumd[17954]: <info> STONITH: wti_nps at 192.168.0.15, port magenta controls magenta.lab.test.com Sep 1 11:07:32 magenta clumanager: cluquorumd startup succeeded Sep 1 11:07:43 magenta clumembd[17956]: <notice> Member magenta.lab.test.com UP Sep 1 11:07:53 magenta kernel: e100: eth0 NIC Link is Up 100 Mbps Full duplex Sep 1 11:08:25 magenta cluquorumd[17954]: <notice> IPv4 TB @ 192.168.0.254 Online Sep 1 11:08:27 magenta cluquorumd[17954]: <notice> Quorum Formed; Starting Service Manager You'll note it took awhile (in this case, 32 seconds after plugging back in). ...
sorry, my mistake. I had made the rpm with the wrong source. I tested both cases( TestCase1 and TestCase2 ) and it works! Another observation that I wanted to share with you: 1) Both node1 and node2 are up, and n/w cable is plugged out from node1, the cluster on node1 reboots the host and when it comes up, it has a cluster without quorum. Putting back the n/w cable results in node1's cluster forming the quorum. 2) node2 is down, and node1 is up with all the services running on itself. Pull out the n/w cable and the cluster on this node dissolves the quorum but doesn't reboot the host. Putting back n/w cable, results in this node forming the quorum. The discrepancy is that in case (1), the host that loses n/w reboots while in case (2) it just dissolves the quorum. Nothing is harmful with this behaviour. Is this something intentional( because of design) ?
Yes, it's intentional. If it's quorate and becomes inquorate, the default behavior is to reboot . Unless: (1) No change in actual membership resulting in loss of quorum. The IP-tiebreaker vote is not an actual member of the cluster, so when you unplug the cable: Case 1 above: "I lost one member and communication with the cluster quorum. Panic...". This is because it's assumed the other member which was lost is still quorate - and will take over services. Because you have no power switches configured, the only thing the node can do is reboot. Case 2 above: "I lost communication with the quorum, but there was no change in membership. Stop everything and wait for more members.". In this case, failover won't occur because there weren't other members to take services over. (2) Power switches are configured. If power switches are configured, a node never reboots itself because of leaving the quorum. If it needs to be rebooted, someone else (who is still in the quorum) will do it.
1.2.18pre1 patch (unsupported; test only, etc.) http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch This includes the fix for this bug and a few others.
Marking Verified. Haven't seen this in house and original reporter is satisfied with the fix. Will go out with RHEL3-U4, clumanager-1.2.22-2.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-491.html