Created attachment 451390 [details] The cluster configuration used to reproduce the problem Description of problem: When there is a cluster configured to have a service included consisting of an IP (for example, could also be any service undependent on storage) and scsi_fencing (for example, could also be any fenceagent that does not power of the erroneous node) is used, when storage problems occure there are cases when the IP cannot be switched. This always happens if one node looses access to storage. Qdisk will detect this problem and will emergency shutdown the clusternode by leaving the cluster. Then the recovery starts. But the IP which was running on the node with disk problems still resides there and therefore cannot not be switched. The result is that the recovery will fail. Version-Release number of selected component (if applicable): Cluster RHEL5/6, might be also RHEL4 How reproducible: Use the attached cluster.conf and separate the node hosting the service from shared storage. Steps to Reproduce: 1. Power on a two node cluster with shared storage and qdisk and most important non power fencing 2. Configure a service with an IP resource 3. Separate the node where the resource is active from the storage Actual results: The IP will fail to be switch as it is still running on the node with problems. Expected results: The IP should be switched successfully. Additional info:
Reproduced by simply killing qdiskd with -STOP.
Created attachment 452437 [details] Proposed fix
Awaiting review. https://www.redhat.com/archives/cluster-devel/2010-October/msg00049.html
This was fixed in RHEL6 and STABLE31 branches months ago, as it turns out. Here is the backported fix: https://www.redhat.com/archives/cluster-devel/2010-October/msg00052.html On RHEL5, releasing the lockspace hangs during shutdown; the fix for that is here: https://www.redhat.com/archives/cluster-devel/2010-October/msg00053.html
(In reply to comment #7) > On RHEL5, releasing the lockspace hangs during shutdown; the fix for that is > here: That is, releasing the lockspace after CMAN has exited during an unclean shutdown, rgmanager will hang (in write() while trying to release the lockspace). So, the only way to have rgmanager exit is to skip the lockspace release during emergency shutdowns.
Detailed analysis: If cman dies because it receives a kill packet (of doom) from other hosts, rgmanager does not notice. This can happen if, for example, you are using qdiskd and it hangs on I/O to the quorum disk due to frequent trespasses or other SAN interruptions. The other instance of qdiskd will ask CMAN to evict the hung node, causing it to be ejected from the cluster and fenced. Data is safe (which is the top priority). If power-cycle fencing is in use, there is no issue at all; the node reboots and service failover occurs fairly quickly. However, problems can arise if, in the same hung-I/O situation: * storage-level fencing is in use * rgmanager has one or more IP addresses in use as part of cluster services. This is because more recent versions of the IP resource agent actually ping the IP address prior to bringing it online for use by services. This prevents accidental take-over of IP addresses in use by other hosts on the network due to an administrator mistake when setting up the cluster. Unfortunately, this behavior also prevents service failover if the presumed-dead host is still online. This patch causes rgmanager to use poll() instead of select() when dealing with the baseline CMAN connection it uses for receiving membership changes and so forth. If the socket is closed by CMAN (either by CMAN's death or some other reason), rgmanager can now detect and act upon that will now treat that stimulus. It treats it as an emergency cluster shutdown request. It will halt all services and exit as quickly as possible. Unfortunately, there is a race between this emergency action and recovery on the surviving host. It is not possible for rgmanager to guarantee that all services will halt after the node has been fenced from shared storage (but before the other host attempts to start the service(s)). Furthermore, a hung 'stop' request caused by loss of access to shared storage may very well cause rgmanager to hang forever, preventing some services (or parts) from ever actually being killed. A main use case for storage-level fencing over power- cycling is the ability to perform post-mortem RCA of what happened in order to cause the node to die in the first place. This implies that rgmanager killing the host would be an incorrect resolution.
Test results... Dying host: Oct 28 17:38:31 rhel5-1 openais[1914]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). Oct 28 17:38:32 rhel5-1 dlm_controld[1963]: cluster is down, exiting Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: groupd_dispatch error -1 errno 0 Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: groupd connection died Oct 28 17:38:32 rhel5-1 gfs_controld[1969]: cluster is down, exiting Oct 28 17:38:32 rhel5-1 clurgmgrd[2829]: <warning> #67: Shutting down uncleanly Oct 28 17:38:32 rhel5-1 fenced[1957]: cluster is down, exiting Oct 28 17:38:32 rhel5-1 kernel: dlm: closing connection to node 2 Oct 28 17:38:32 rhel5-1 kernel: dlm: closing connection to node 1 Oct 28 17:38:32 rhel5-1 avahi-daemon[2457]: Withdrawing address record for 192.168.122.95 on eth0. Oct 28 17:38:42 rhel5-1 clurgmgrd[2829]: <notice> Shutdown complete, exiting ... Surviving host (note: empty1 has an IP address in it; it is not entirely empty): Oct 28 17:38:31 rhel5-2 qdiskd[2042]: <notice> Writing eviction notice for node 1 Oct 28 17:38:32 rhel5-2 qdiskd[2042]: <notice> Node 1 evicted Oct 28 17:38:52 rhel5-2 openais[2013]: [TOTEM] The token was lost in the OPERATIONAL state. ... Oct 28 17:38:54 rhel5-2 openais[2013]: [TOTEM] Sending initial ORF token Oct 28 17:38:54 rhel5-2 fenced[2056]: rhel5-1.lhh.pvt not a cluster member after 0 sec post_fail_delay Oct 28 17:38:54 rhel5-2 kernel: dlm: closing connection to node 1 Oct 28 17:38:54 rhel5-2 fenced[2056]: fencing node "rhel5-1.lhh.pvt" Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] CLM CONFIGURATION CHANGE Oct 28 17:38:54 rhel5-2 fenced[2056]: fence "rhel5-1.lhh.pvt" success Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] New Configuration: Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] r(0) ip(192.168.122.91) Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] Members Left: Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] r(0) ip(192.168.122.90) Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] Members Joined: Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] CLM CONFIGURATION CHANGE Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] New Configuration: Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] r(0) ip(192.168.122.91) Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] Members Left: Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] Members Joined: Oct 28 17:38:54 rhel5-2 openais[2013]: [SYNC ] This node is within the primary component and will provide service. Oct 28 17:38:54 rhel5-2 openais[2013]: [TOTEM] entering OPERATIONAL state. Oct 28 17:38:54 rhel5-2 openais[2013]: [CLM ] got nodejoin message 192.168.122.91 Oct 28 17:38:54 rhel5-2 openais[2013]: [CPG ] got joinlist message from node 2 Oct 28 17:38:59 rhel5-2 clurgmgrd[2863]: <notice> Taking over service service:empty1 from down member rhel5-1.lhh.pvt Oct 28 17:39:01 rhel5-2 avahi-daemon[2546]: Registering new address record for 192.168.122.95 on eth0. Oct 28 17:39:02 rhel5-2 clurgmgrd[2863]: <notice> Service service:empty1 started
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=848f084f51e9890e41071e8a58b3878cedce0dd7 http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=38ca868f6ee9c7ebb4059b7a4983734935c80ca4
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0134.html
*** Bug 585210 has been marked as a duplicate of this bug. ***
It looks like that this bug is still present in RGManager package rgmanager-2.0.52-9.el5_6.1.x86_64. The ip-address switch is not working like described in the description of the bug.