Red Hat Bugzilla – Bug 548133
rgmanager - Failed changing service status
Last modified: 2012-06-01 06:14:09 EDT
Description of problem:
When stopping cluster services (rgmanager, cman) on one node which is running an rgmanager service, it is possible to hit a timing issue which causes that service to fail starting on the other node with the message:
Failed changing service status
rgmanager is unable to update view formation status during cman membership transitions.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Disable rgmanager (/etc/init.d/rgmanager stop) on one node of a multi-node cluster
2. Immediately disable cman on the same node (/etc/init.d/cman stop)
The surviving node is unable to take over the services because it can not receive the acknowledgement from cman/ais on the failed node leading to timeouts in view formation and the service terminates with an error.
Dec 15 14:49:48 ahost02 clurgmgrd: <notice> Member 1 shutting down
Dec 15 14:49:54 ahost02 clurgmgrd: <notice> Starting stopped service service:my-service
Dec 15 14:50:32 ahost02 clurgmgrd: <err> #75: Failed changing service status
Dec 15 14:50:32 ahost02 clurgmgrd: <notice> Stopping service service:my-service
Surviving nodes take over services as normal.
This problem may be worked around by inserting a delay between rgmanager shutting down and cman leaving the cluster (putting a sleep into the stop() method of /etc/init.d/cman before the call to cman_tool leave). The sleep needs to be long enough to allow for the service start up time plus two times the totem token timeout (to allow time for the failing node to have been kicked out of the cluster preventing it causing timeouts at view formation time).
Created attachment 378809 [details]
Created attachment 378810 [details]
original patch was missing a free() call, which would have caused rgmanager to leak memory over a long period of time (e.g. the length of totem/@token on each configuration change)
Created attachment 378811 [details]
Easy way to cause this to happen:
- create a two node cluster (quorum disk doesn't matter)
- set /cluster/totem/@token to a larger-than-default value, like 30000 (30 seconds)
- create a service. It does not have to have any resources.
- start both cluster nodes and rgmanager on both nodes
- determine which node is running the service using 'clustat'
- run 'service rgmanager stop; umount -at gfs; umount -at gfs2; service cman stop'
If CMAN stops correctly, rgmanager on the remaining node will hang and eventually produce an error in the system logs that it could not change the service state.
With the above patch, rgmanager waits, the node transition is recognized by openais, and then the service state is correctly updated.
This issue can largely be worked around by:
- service rgmanager stop
- wait for all services to complete failover
- service cman stop
Any news about when the errata will be released?
We are affected by this issue.
Next RHEL update.
Red Hat Support can assist you require a fix earlier.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~
RHEL 5.5 Beta has been released! There should be a fix present in this
release that addresses your request. Please test and report back results
here, by March 3rd 2010 (2010-03-03) or sooner.
Upon successful verification of this request, post your results and update
the Verified field in Bugzilla with the appropriate value.
If you encounter any issues while testing, please describe them and set
this bug into NEED_INFO. If you encounter new defects or have additional
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
Event posted on 02-17-2010 04:05pm JST by tumeya
HP verified the 5.5 beta.
This event sent from IssueTracker by tumeya
HP verified (comment #14)
We've tested the rgmanager package in the 5.5 beta release and it's working fine.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.