Bug 639961 - Failover of service fails if non power fencing is used with qdisk
Summary: Failover of service fails if non power fencing is used with qdisk
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
: 585210 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-04 12:35 UTC by Marc Grimme
Modified: 2011-11-29 12:44 UTC (History)
7 users (show)

Fixed In Version: rgmanager-2.0.52-8.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 641995 (view as bug list)
Environment:
Last Closed: 2011-01-13 23:27:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The cluster configuration used to reproduce the problem (1.88 KB, text/plain)
2010-10-04 12:35 UTC, Marc Grimme
no flags Details
Proposed fix (2.58 KB, patch)
2010-10-08 21:18 UTC, Lon Hohberger
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0134 0 normal SHIPPED_LIVE rgmanager bug fix and enhancement update 2011-01-12 19:20:47 UTC

Description Marc Grimme 2010-10-04 12:35:20 UTC
Created attachment 451390 [details]
The cluster configuration used to reproduce the problem

Description of problem:
When there is a cluster configured to have a service included consisting of an IP (for example, could also be any service undependent on storage) and scsi_fencing (for example, could also be any fenceagent that does not power of the erroneous node) is used, when storage problems occure there are cases when the IP cannot be switched.

This always happens if one node looses access to storage. Qdisk will detect this problem and will emergency shutdown the clusternode by leaving the cluster. Then the recovery starts. But the IP which was running on the node with disk problems still resides there and therefore cannot not be switched.

The result is that the recovery will fail.

Version-Release number of selected component (if applicable):
Cluster RHEL5/6, might be also RHEL4

How reproducible:
Use the attached cluster.conf and separate the node hosting the service from shared storage.

Steps to Reproduce:
1. Power on a two node cluster with shared storage and qdisk and most important non power fencing
2. Configure a service with an IP resource
3. Separate the node where the resource is active from the storage
  
Actual results:
The IP will fail to be switch as it is still running on the node with problems.

Expected results:
The IP should be switched successfully.


Additional info:

Comment 1 Lon Hohberger 2010-10-06 17:36:17 UTC
Reproduced by simply killing qdiskd with -STOP.

Comment 2 Lon Hohberger 2010-10-08 21:18:46 UTC
Created attachment 452437 [details]
Proposed fix

Comment 6 Lon Hohberger 2010-10-27 21:20:04 UTC
Awaiting review.

https://www.redhat.com/archives/cluster-devel/2010-October/msg00049.html

Comment 7 Lon Hohberger 2010-10-28 21:20:38 UTC
This was fixed in RHEL6 and STABLE31 branches months ago, as it turns out.  Here is the backported fix:

https://www.redhat.com/archives/cluster-devel/2010-October/msg00052.html

On RHEL5, releasing the lockspace hangs during shutdown; the fix for that is here:

https://www.redhat.com/archives/cluster-devel/2010-October/msg00053.html

Comment 8 Lon Hohberger 2010-10-28 21:24:42 UTC
(In reply to comment #7)

> On RHEL5, releasing the lockspace hangs during shutdown; the fix for that is
> here:

That is, releasing the lockspace after CMAN has exited during an unclean shutdown, rgmanager will hang (in write() while trying to release the lockspace).  So, the only way to have rgmanager exit is to skip the lockspace release during emergency shutdowns.

Comment 9 Lon Hohberger 2010-10-28 21:27:23 UTC
Detailed analysis:

If cman dies because it receives a kill packet (of doom)
from other hosts, rgmanager does not notice.  This can
happen if, for example, you are using qdiskd and it hangs
on I/O to the quorum disk due to frequent trespasses or
other SAN interruptions.  The other instance of qdiskd
will ask CMAN to evict the hung node, causing it to be
ejected from the cluster and fenced.

Data is safe (which is the top priority).  If power-cycle
fencing is in use, there is no issue at all; the node
reboots and service failover occurs fairly quickly.

However, problems can arise if, in the same hung-I/O
situation:

 * storage-level fencing is in use

 * rgmanager has one or more IP addresses in use
   as part of cluster services.

This is because more recent versions of the IP resource
agent actually ping the IP address prior to bringing it
online for use by services.  This prevents accidental
take-over of IP addresses in use by other hosts on the
network due to an administrator mistake when setting up
the cluster.

Unfortunately, this behavior also prevents service
failover if the presumed-dead host is still online.

This patch causes rgmanager to use poll() instead of
select() when dealing with the baseline CMAN connection
it uses for receiving membership changes and so forth.

If the socket is closed by CMAN (either by CMAN's death
or some other reason), rgmanager can now detect and act
upon that will now treat that stimulus.  It treats it as
an emergency cluster shutdown request.  It will halt all
services and exit as quickly as possible.

Unfortunately, there is a race between this emergency
action and recovery on the surviving host.  It is not
possible for rgmanager to guarantee that all services will
halt after the node has been fenced from shared storage
(but before the other host attempts to start the
service(s)).

Furthermore, a hung 'stop' request caused by loss of
access to shared storage may very well cause rgmanager
to hang forever, preventing some services (or parts)
from ever actually being killed.

A main use case for storage-level fencing over power-
cycling is the ability to perform post-mortem RCA of what
happened in order to cause the node to die in the first
place.  This implies that rgmanager killing the host
would be an incorrect resolution.

Comment 10 Lon Hohberger 2010-10-28 21:42:27 UTC
Test results... Dying host:

Oct 28 17:38:31 rhel5-1 openais[1914]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). 
Oct 28 17:38:32 rhel5-1 dlm_controld[1963]: cluster is down, exiting
Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: groupd_dispatch error -1 errno 0
Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: groupd connection died
Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: cluster is down, exiting
Oct 28 17:38:32 rhel5-1 clurgmgrd[2829]: <warning> #67: Shutting down uncleanly 
Oct 28 17:38:32 rhel5-1 fenced[1957]: cluster is down, exiting
Oct 28 17:38:32 rhel5-1 kernel: dlm: closing connection to node 2
Oct 28 17:38:32 rhel5-1 kernel: dlm: closing connection to node 1
Oct 28 17:38:32 rhel5-1 avahi-daemon[2457]: Withdrawing address record for 192.168.122.95 on eth0.
Oct 28 17:38:42 rhel5-1 clurgmgrd[2829]: <notice> Shutdown complete, exiting

...

Surviving host (note: empty1 has an IP address in it; it is not entirely empty):

Oct 28 17:38:31 rhel5-2 qdiskd[2042]: <notice> Writing eviction notice for node 1
Oct 28 17:38:32 rhel5-2 qdiskd[2042]: <notice> Node 1 evicted
Oct 28 17:38:52 rhel5-2 openais[2013]: [TOTEM] The token was lost in the OPERATIONAL state.
...
Oct 28 17:38:54 rhel5-2 openais[2013]: [TOTEM] Sending initial ORF token
Oct 28 17:38:54 rhel5-2 fenced[2056]: rhel5-1.lhh.pvt not a cluster member after 0 sec post_fail_delay
Oct 28 17:38:54 rhel5-2 kernel: dlm: closing connection to node 1
Oct 28 17:38:54 rhel5-2 fenced[2056]: fencing node "rhel5-1.lhh.pvt"
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 28 17:38:54 rhel5-2 fenced[2056]: fence "rhel5-1.lhh.pvt" success
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] New Configuration:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ]  r(0) ip(192.168.122.91)
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Left:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ]  r(0) ip(192.168.122.90)
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Joined:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] New Configuration:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ]  r(0) ip(192.168.122.91)
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Left:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Joined:
Oct 28 17:38:54 rhel5-2 openais[2013]: [SYNC ] This node is within the primary component and will provide service.
Oct 28 17:38:54 rhel5-2 openais[2013]: [TOTEM] entering OPERATIONAL state.
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] got nodejoin message 192.168.122.91
Oct 28 17:38:54 rhel5-2 openais[2013]: [CPG  ] got joinlist message from node 2
Oct 28 17:38:59 rhel5-2 clurgmgrd[2863]: <notice> Taking over service service:empty1 from down member rhel5-1.lhh.pvt
Oct 28 17:39:01 rhel5-2 avahi-daemon[2546]: Registering new address record for 192.168.122.95 on eth0.
Oct 28 17:39:02 rhel5-2 clurgmgrd[2863]: <notice> Service service:empty1 started

Comment 13 errata-xmlrpc 2011-01-13 23:27:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0134.html

Comment 14 Lon Hohberger 2011-01-25 15:10:12 UTC
*** Bug 585210 has been marked as a duplicate of this bug. ***

Comment 15 rauch 2011-11-29 12:44:15 UTC
It looks like that this bug is still present in RGManager package rgmanager-2.0.52-9.el5_6.1.x86_64.

The ip-address switch is not working like described in the description of the bug.


Note You need to log in before you can comment on or make changes to this bug.