Bug 548133 - rgmanager - Failed changing service status
rgmanager - Failed changing service status
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.4
All Linux
high Severity high
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-16 13:03 EST by Bryn M. Reeves
Modified: 2012-06-01 06:14 EDT (History)
6 users (show)

See Also:
Fixed In Version: rgmanager-2.0.52-1.33.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 566732 569953 (view as bug list)
Environment:
Last Closed: 2010-03-30 04:48:04 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Fix (5.14 KB, patch)
2009-12-16 13:14 EST, Lon Hohberger
no flags Details | Diff
Fixed fix (5.14 KB, patch)
2009-12-16 13:20 EST, Lon Hohberger
no flags Details | Diff
Fixed fix (5.15 KB, patch)
2009-12-16 13:24 EST, Lon Hohberger
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 26853 None None None 2012-06-01 06:14:09 EDT

  None (edit)
Description Bryn M. Reeves 2009-12-16 13:03:05 EST
Description of problem:
When stopping cluster services (rgmanager, cman) on one node which is running an rgmanager service, it is possible to hit a timing issue which causes that service to fail starting on the other node with the message:

    Failed changing service status

rgmanager is unable to update view formation status during cman membership transitions.

Version-Release number of selected component (if applicable):
rgmanager-2.0.*

How reproducible:
100%

Steps to Reproduce:
1. Disable rgmanager (/etc/init.d/rgmanager stop) on one node of a multi-node cluster
2. Immediately disable cman on the same node (/etc/init.d/cman stop)
  
Actual results:
The surviving node is unable to take over the services because it can not receive the acknowledgement from cman/ais on the failed node leading to timeouts in view formation and the service terminates with an error.

Dec 15 14:49:48 ahost02 clurgmgrd[8702]: <notice> Member 1 shutting down
Dec 15 14:49:54 ahost02 clurgmgrd[8702]: <notice> Starting stopped service service:my-service
...
Dec 15 14:50:32 ahost02 clurgmgrd[8702]: <err> #75: Failed changing service status
Dec 15 14:50:32 ahost02 clurgmgrd[8702]: <notice> Stopping service service:my-service

Expected results:
Surviving nodes take over services as normal.

Additional info:
This problem may be worked around by inserting a delay between rgmanager shutting down and cman leaving the cluster (putting a sleep into the stop() method of /etc/init.d/cman before the call to cman_tool leave). The sleep needs to be long enough to allow for the service start up time plus two times the totem token timeout (to allow time for the failing node to have been kicked out of the cluster preventing it causing timeouts at view formation time).
Comment 1 Lon Hohberger 2009-12-16 13:14:38 EST
Created attachment 378809 [details]
Fix
Comment 3 Lon Hohberger 2009-12-16 13:20:55 EST
Created attachment 378810 [details]
Fixed fix

original patch was missing a free() call, which would have caused rgmanager to leak memory over a long period of time (e.g. the length of totem/@token on each configuration change)
Comment 4 Lon Hohberger 2009-12-16 13:24:06 EST
Created attachment 378811 [details]
Fixed fix
Comment 5 Lon Hohberger 2009-12-16 13:29:01 EST
Easy way to cause this to happen:

- create a two node cluster (quorum disk doesn't matter)
- set /cluster/totem/@token to a larger-than-default value, like 30000 (30 seconds)
- create a service.  It does not have to have any resources.
- start both cluster nodes and rgmanager on both nodes
- determine which node is running the service using 'clustat'
- run 'service rgmanager stop; umount -at gfs; umount -at gfs2; service cman stop'

If CMAN stops correctly, rgmanager on the remaining node will hang and eventually produce an error in the system logs that it could not change the service state.

With the above patch, rgmanager waits, the node transition is recognized by openais, and then the service state is correctly updated.
Comment 7 Lon Hohberger 2009-12-16 13:33:23 EST
This issue can largely be worked around by:

- service rgmanager stop
- wait for all services to complete failover
- service cman stop
Comment 10 Alfredo Moralejo 2010-01-07 04:48:57 EST
Any news about when the errata will be released?

We are affected by this issue.

Best regards,

Alfredo
Comment 11 Lon Hohberger 2010-01-07 11:50:38 EST
Next RHEL update.

Red Hat Support can assist you require a fix earlier.
Comment 12 Chris Ward 2010-02-11 05:32:08 EST
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.
Comment 14 Issue Tracker 2010-02-17 02:05:38 EST
Event posted on 02-17-2010 04:05pm JST by tumeya

HP verified the 5.5 beta. 


This event sent from IssueTracker by tumeya 
 issue 388773
Comment 15 Dean Jansa 2010-03-03 17:08:01 EST
HP verified (comment #14)
Comment 18 Alfredo Moralejo 2010-03-23 09:37:55 EDT
We've tested the rgmanager package in the 5.5 beta release and it's working fine.
Comment 19 errata-xmlrpc 2010-03-30 04:48:04 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0280.html

Note You need to log in before you can comment on or make changes to this bug.