130426 – clumanager doesn't recover from network disconnect /reconnect

Bug 130426 - clumanager doesn't recover from network disconnect /reconnect

Summary: clumanager doesn't recover from network disconnect /reconnect

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	131576
TreeView+	depends on / blocked

Reported:	2004-08-20 14:01 UTC by Chandrashekhar Marathe
Modified:	2009-04-16 20:35 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-11-09 17:54:42 UTC
Embargoed:

Attachments	(Terms of Use)
Patch fixing several problems with net tiebreaker (8.41 KB, patch) 2004-08-27 19:32 UTC, Lon Hohberger	no flags	Details \| Diff
Corrected patch (8.09 KB, patch) 2004-08-27 19:54 UTC, Lon Hohberger	no flags	Details \| Diff
Implements async IP-tie-vote + fixes timing (12.86 KB, patch) 2004-08-31 16:33 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:491	0	high	SHIPPED_LIVE	Updated clumanager and redhat-config-cluster packages	2004-12-20 05:00:00 UTC

Description Chandrashekhar Marathe 2004-08-20 14:01:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1)
Gecko/20020823 Netscape/7.0

Description of problem:
Cluster configuration
- 2 member cluster with ip based tiebreaker
- soft quorum enabled by default for both 
  members

Formation of soft quorum in 1 member up
case does not work if the network
was down when the member comes up and network
is re-enabled.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
(1) Shut down member 2 completely.
(2) Shut down member 1
(3) Unplug network cable from member1
(4) Bring up member1 

In this case quorum is not formed (since tie-breaker ip
is not reachable). 

(5) Plug in network cable for member1. 


(6) Manual cluforce works or if member2 is
brought up, quorum formation happens.

    

Actual Results:  No quorum formed after step 5.

Expected Results:  
After step 5: Expectation was that quorum should get formed
due to the soft quorum flag enabled by defauly in
cludb. However no quorum was formed

Additional info:

Comment 1 Lon Hohberger 2004-08-23 13:48:11 UTC

While this _is_ a bug, it's not a generally supported use case.

Comment 2 Lon Hohberger 2004-08-27 19:32:15 UTC

Created attachment 103178 [details]
Patch fixing several problems with net tiebreaker

This patch fixes the behavior described in this bug.

Comment 3 Lon Hohberger 2004-08-27 19:49:45 UTC

Above patch breaks allow_soft option; new patch coming shortly.

Comment 4 Lon Hohberger 2004-08-27 19:54:17 UTC

Created attachment 103180 [details]
Corrected patch

Same as above patch, except enforces hard quorum when allow_soft is not set.

Comment 5 Lon Hohberger 2004-08-27 19:55:22 UTC

Patch tested with both allow_soft and non-allow-soft.  Should work.

Comment 6 Satya Prakash Tripathi 2004-08-31 12:29:57 UTC

We tried the same experiments after applying this patch. 

We have 4 services to be run on the 2-member cluster:
                   svc1 and svc2 preferred on host1
                   svc3 and svc4 preferred on host2

Following are the results:

TestCase 1: bring up host1 without network, while host2 is down.  

host1 comes up but the cluster on it doesn't form a quorum. We put the
network cable back, and the cluster on host1 forms a quorum. It then
starts all the services( svc1, svc2, svc3, svc4 ) on itself, because
host2 is down.
This is fair, TestCase 1 is PASSED.


TestCase 2: bring up host1 without network, while host2 is up.  

host1 comes up but the cluster on it doesn't form a quorum. By this
this time, host2 is running all the services( svc1, svc2, svc3, svc4 )
because host1's cluster is not participating in the quorum. 
We put the network cable back, and the cluster on host1 forms a quorum
 ( I'm assuming that host1 formed this quorum which included host2 in
it). It then *surprisingly* starts all the services on itself( svc1,
svc2, svc3, svc4 ), while only two of them viz svc1 and svc2 were
expected to failover to host1. 

This TestCase FAILED.

Did it happen because host1 formed a new quorum (with higher
VIEW-ID/incarnation-num ?) which excluded host2 ?

We had to reboot host2. When host2 came up, it took over it's
preferred services viz svc3 and svc4, back from host1.

Comment 7 Lon Hohberger 2004-08-31 15:50:15 UTC

I'm reasonably confident I know what's wrong with the patch, thanks
for the feedback.  Stay tuned.

Comment 8 Lon Hohberger 2004-08-31 16:28:08 UTC

Created attachment 103300 [details]
Implements async IP-tie-vote + fixes timing

There was a timing problem where the IP-tie-vote was getting declared 'online'
before membership had a chance to converge, which caused the described problem.


This patch should fix that problem.

Comment 9 Lon Hohberger 2004-08-31 16:33:32 UTC

Created attachment 103303 [details]
Implements async IP-tie-vote + fixes timing

There was a timing problem where the IP-tie-vote was getting declared 'online'
before membership had a chance to converge, which caused the described problem.


This patch should fix that problem.

(Previous patch was generated against the wrong tree.  I'm on top of things.)

Comment 10 Lon Hohberger 2004-08-31 18:26:32 UTC

I ran two cases of 'Test 2' (one with node 0 starting disconnected,
one with node 1 starting disconnected) - services with single-node
failover domains relocated properly in both cases.

Comment 11 Satya Prakash Tripathi 2004-09-01 09:59:07 UTC

TestCase 2 PASSED. But unfortunately TestCase 1 FAILED.

So that the actual problem described in this BUG, still exists.
Steps to Reproduce:
(1) Shut down node 2 completely.
(2) Shut down node 1
(3) Unplug network cable from member1
(4) Bring up member1 
    In this case quorum is not formed (since tie-breaker ip
is not reachable). 

(5) Plug in network cable for member1. 
   *** The quorom is still not formed.       

Sorry for the trouble :)

Comment 12 Lon Hohberger 2004-09-01 16:36:06 UTC

It works for me...

Sep  1 11:07:15 magenta kernel: e100: eth0 NIC Link is Down
Sep  1 11:07:32 magenta clumanager: [17937]: <notice> Starting Red Hat
Cluster Manager...
Sep  1 11:07:32 magenta kernel: ip_tables: (C) 2000-2002 Netfilter
core team
Sep  1 11:07:32 magenta cluquorumd[17954]: <info> IPv4-TB: 192.168.0.254
Sep  1 11:07:32 magenta cluquorumd[17954]: <info> IPv4-TB: Interval 2
On:8 Off:2
Sep  1 11:07:32 magenta cluquorumd[17954]: <warning> Allowing soft quorum.
Sep  1 11:07:32 magenta cluquorumd[17954]: <info> STONITH: wti_nps at
192.168.0.15, port yellow controls yellow.lab.test.com
Sep  1 11:07:32 magenta cluquorumd[17954]: <info> STONITH: wti_nps at
192.168.0.15, port magenta controls magenta.lab.test.com
Sep  1 11:07:32 magenta clumanager: cluquorumd startup succeeded
Sep  1 11:07:43 magenta clumembd[17956]: <notice> Member
magenta.lab.test.com UP
Sep  1 11:07:53 magenta kernel: e100: eth0 NIC Link is Up 100 Mbps
Full duplex
Sep  1 11:08:25 magenta cluquorumd[17954]: <notice> IPv4 TB @
192.168.0.254 Online
Sep  1 11:08:27 magenta cluquorumd[17954]: <notice> Quorum Formed;
Starting Service Manager

You'll note it took awhile (in this case, 32 seconds after plugging
back in).

...

Comment 13 Satya Prakash Tripathi 2004-09-02 09:29:13 UTC

sorry, my mistake. I had made the rpm with the wrong source.

I tested both cases( TestCase1 and TestCase2 ) and it works!

Another observation that I wanted to share with you:

1) Both node1 and node2 are up, and n/w cable is plugged out from
node1, the cluster on node1 reboots the host and when it comes up,
it has a cluster without quorum. Putting back the n/w cable results 
in node1's cluster forming the quorum. 

2) node2 is down, and node1 is up with all the services running on
itself. Pull out the n/w cable and the cluster on this node dissolves
the quorum but doesn't reboot the host. Putting back n/w cable, 
results in this node forming the quorum.

The discrepancy is that in case (1), the host that loses n/w reboots
while in case (2) it just dissolves the quorum.
Nothing  is harmful with this behaviour.

Is this something intentional( because of design) ?

Comment 14 Lon Hohberger 2004-09-02 13:41:43 UTC

Yes, it's intentional.

If it's quorate and becomes inquorate, the default behavior is to
reboot .  Unless:

(1) No change in actual membership resulting in loss of quorum.  The
IP-tiebreaker vote is not an actual member of the cluster, so when you
unplug the cable:

   Case 1 above: "I lost one member and communication with the cluster
quorum.  Panic...".  This is because it's assumed the other member
which was lost is still quorate - and will take over services. 
Because you have no power switches configured, the only thing the node
can do is reboot.

   Case 2 above: "I lost communication with the quorum, but there was
no change in membership.  Stop everything and wait for more members.".
 In this case, failover won't occur because there weren't other
members to take services over.

(2) Power switches are configured.  If power switches are configured,
a node never reboots itself because of leaving the quorum.  If it
needs to be rebooted, someone else (who is still in the quorum) will
do it.

Comment 15 Lon Hohberger 2004-09-02 15:56:57 UTC

1.2.18pre1 patch (unsupported; test only, etc.)

http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch

This includes the fix for this bug and a few others.

Comment 16 Derek Anderson 2004-11-09 17:54:42 UTC

Marking Verified.  Haven't seen this in house and original reporter is
satisfied with the fix.  Will go out with RHEL3-U4, clumanager-1.2.22-2.

Comment 17 John Flanagan 2004-12-21 03:40:14 UTC

An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-491.html

Note You need to log in before you can comment on or make changes to this bug.