639961 – Failover of service fails if non power fencing is used with qdisk

Bug 639961 - Failover of service fails if non power fencing is used with qdisk

Summary: Failover of service fails if non power fencing is used with qdisk

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	585210 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-10-04 12:35 UTC by Marc Grimme
Modified:	2011-11-29 12:44 UTC (History)
CC List:	7 users (show)
Fixed In Version:	rgmanager-2.0.52-8.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	641995 (view as bug list)
Environment:
Last Closed:	2011-01-13 23:27:18 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
The cluster configuration used to reproduce the problem (1.88 KB, text/plain) 2010-10-04 12:35 UTC, Marc Grimme	no flags	Details
Proposed fix (2.58 KB, patch) 2010-10-08 21:18 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0134	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2011-01-12 19:20:47 UTC

Description Marc Grimme 2010-10-04 12:35:20 UTC

Created attachment 451390 [details]
The cluster configuration used to reproduce the problem

Description of problem:
When there is a cluster configured to have a service included consisting of an IP (for example, could also be any service undependent on storage) and scsi_fencing (for example, could also be any fenceagent that does not power of the erroneous node) is used, when storage problems occure there are cases when the IP cannot be switched.

This always happens if one node looses access to storage. Qdisk will detect this problem and will emergency shutdown the clusternode by leaving the cluster. Then the recovery starts. But the IP which was running on the node with disk problems still resides there and therefore cannot not be switched.

The result is that the recovery will fail.

Version-Release number of selected component (if applicable):
Cluster RHEL5/6, might be also RHEL4

How reproducible:
Use the attached cluster.conf and separate the node hosting the service from shared storage.

Steps to Reproduce:
1. Power on a two node cluster with shared storage and qdisk and most important non power fencing
2. Configure a service with an IP resource
3. Separate the node where the resource is active from the storage
  
Actual results:
The IP will fail to be switch as it is still running on the node with problems.

Expected results:
The IP should be switched successfully.


Additional info:

Comment 1 Lon Hohberger 2010-10-06 17:36:17 UTC

Reproduced by simply killing qdiskd with -STOP.

Comment 2 Lon Hohberger 2010-10-08 21:18:46 UTC

Created attachment 452437 [details]
Proposed fix

Comment 6 Lon Hohberger 2010-10-27 21:20:04 UTC

Awaiting review.

https://www.redhat.com/archives/cluster-devel/2010-October/msg00049.html

Comment 7 Lon Hohberger 2010-10-28 21:20:38 UTC

This was fixed in RHEL6 and STABLE31 branches months ago, as it turns out.  Here is the backported fix:

https://www.redhat.com/archives/cluster-devel/2010-October/msg00052.html

On RHEL5, releasing the lockspace hangs during shutdown; the fix for that is here:

https://www.redhat.com/archives/cluster-devel/2010-October/msg00053.html

Comment 8 Lon Hohberger 2010-10-28 21:24:42 UTC

(In reply to comment #7)

> On RHEL5, releasing the lockspace hangs during shutdown; the fix for that is
> here:

That is, releasing the lockspace after CMAN has exited during an unclean shutdown, rgmanager will hang (in write() while trying to release the lockspace).  So, the only way to have rgmanager exit is to skip the lockspace release during emergency shutdowns.

Comment 9 Lon Hohberger 2010-10-28 21:27:23 UTC

Detailed analysis:

If cman dies because it receives a kill packet (of doom)
from other hosts, rgmanager does not notice.  This can
happen if, for example, you are using qdiskd and it hangs
on I/O to the quorum disk due to frequent trespasses or
other SAN interruptions.  The other instance of qdiskd
will ask CMAN to evict the hung node, causing it to be
ejected from the cluster and fenced.

Data is safe (which is the top priority).  If power-cycle
fencing is in use, there is no issue at all; the node
reboots and service failover occurs fairly quickly.

However, problems can arise if, in the same hung-I/O
situation:

 * storage-level fencing is in use

 * rgmanager has one or more IP addresses in use
   as part of cluster services.

This is because more recent versions of the IP resource
agent actually ping the IP address prior to bringing it
online for use by services.  This prevents accidental
take-over of IP addresses in use by other hosts on the
network due to an administrator mistake when setting up
the cluster.

Unfortunately, this behavior also prevents service
failover if the presumed-dead host is still online.

This patch causes rgmanager to use poll() instead of
select() when dealing with the baseline CMAN connection
it uses for receiving membership changes and so forth.

If the socket is closed by CMAN (either by CMAN's death
or some other reason), rgmanager can now detect and act
upon that will now treat that stimulus.  It treats it as
an emergency cluster shutdown request.  It will halt all
services and exit as quickly as possible.

Unfortunately, there is a race between this emergency
action and recovery on the surviving host.  It is not
possible for rgmanager to guarantee that all services will
halt after the node has been fenced from shared storage
(but before the other host attempts to start the
service(s)).

Furthermore, a hung 'stop' request caused by loss of
access to shared storage may very well cause rgmanager
to hang forever, preventing some services (or parts)
from ever actually being killed.

A main use case for storage-level fencing over power-
cycling is the ability to perform post-mortem RCA of what
happened in order to cause the node to die in the first
place.  This implies that rgmanager killing the host
would be an incorrect resolution.

Comment 10 Lon Hohberger 2010-10-28 21:42:27 UTC

Test results... Dying host:

Oct 28 17:38:31 rhel5-1 openais[1914]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). 
Oct 28 17:38:32 rhel5-1 dlm_controld[1963]: cluster is down, exiting
Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: groupd_dispatch error -1 errno 0
Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: groupd connection died
Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: cluster is down, exiting
Oct 28 17:38:32 rhel5-1 clurgmgrd[2829]: <warning> #67: Shutting down uncleanly 
Oct 28 17:38:32 rhel5-1 fenced[1957]: cluster is down, exiting
Oct 28 17:38:32 rhel5-1 kernel: dlm: closing connection to node 2
Oct 28 17:38:32 rhel5-1 kernel: dlm: closing connection to node 1
Oct 28 17:38:32 rhel5-1 avahi-daemon[2457]: Withdrawing address record for 192.168.122.95 on eth0.
Oct 28 17:38:42 rhel5-1 clurgmgrd[2829]: <notice> Shutdown complete, exiting

...

Surviving host (note: empty1 has an IP address in it; it is not entirely empty):

Oct 28 17:38:31 rhel5-2 qdiskd[2042]: <notice> Writing eviction notice for node 1
Oct 28 17:38:32 rhel5-2 qdiskd[2042]: <notice> Node 1 evicted
Oct 28 17:38:52 rhel5-2 openais[2013]: [TOTEM] The token was lost in the OPERATIONAL state.
...
Oct 28 17:38:54 rhel5-2 openais[2013]: [TOTEM] Sending initial ORF token
Oct 28 17:38:54 rhel5-2 fenced[2056]: rhel5-1.lhh.pvt not a cluster member after 0 sec post_fail_delay
Oct 28 17:38:54 rhel5-2 kernel: dlm: closing connection to node 1
Oct 28 17:38:54 rhel5-2 fenced[2056]: fencing node "rhel5-1.lhh.pvt"
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 28 17:38:54 rhel5-2 fenced[2056]: fence "rhel5-1.lhh.pvt" success
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] New Configuration:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ]  r(0) ip(192.168.122.91)
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Left:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ]  r(0) ip(192.168.122.90)
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Joined:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] New Configuration:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ]  r(0) ip(192.168.122.91)
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Left:
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] Members Joined:
Oct 28 17:38:54 rhel5-2 openais[2013]: [SYNC ] This node is within the primary component and will provide service.
Oct 28 17:38:54 rhel5-2 openais[2013]: [TOTEM] entering OPERATIONAL state.
Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM  ] got nodejoin message 192.168.122.91
Oct 28 17:38:54 rhel5-2 openais[2013]: [CPG  ] got joinlist message from node 2
Oct 28 17:38:59 rhel5-2 clurgmgrd[2863]: <notice> Taking over service service:empty1 from down member rhel5-1.lhh.pvt
Oct 28 17:39:01 rhel5-2 avahi-daemon[2546]: Registering new address record for 192.168.122.95 on eth0.
Oct 28 17:39:02 rhel5-2 clurgmgrd[2863]: <notice> Service service:empty1 started

Comment 11 Lon Hohberger 2010-10-29 14:00:34 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=848f084f51e9890e41071e8a58b3878cedce0dd7
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=38ca868f6ee9c7ebb4059b7a4983734935c80ca4

Comment 13 errata-xmlrpc 2011-01-13 23:27:18 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0134.html

Comment 14 Lon Hohberger 2011-01-25 15:10:12 UTC

*** Bug 585210 has been marked as a duplicate of this bug. ***

Comment 15 rauch 2011-11-29 12:44:15 UTC

It looks like that this bug is still present in RGManager package rgmanager-2.0.52-9.el5_6.1.x86_64.

The ip-address switch is not working like described in the description of the bug.

Note You need to log in before you can comment on or make changes to this bug.