Description of problem: rgmanager tries to restart a failed service, which returns exit 1 while starting. So the cluster should relocate the service (recovery=restart) to another node, but that doesn't work: Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 Version-Release number of selected component (if applicable): EL5.2 x86_64 + latest updates How reproducible: * 2-node-cluster * following domain <failoverdomains> <failoverdomain name="tom" nofailback="1" ordered="1" restricted="0"> <failoverdomainnode name="tom" priority="1"/> <failoverdomainnode name="jerry" priority="2"/> </failoverdomain> </failoverdomains> * following service <service autostart="1" domain="tom" exclusive="0" name="samba" recovery="restart"> <script file="/etc/init.d/smb" name="samba"/> </service> * after service is running I changed the /etc/init.d/smb so it will exit with ret-val 1 in case of calling it with "/etc/init.d/smb start" on one member Steps to Reproduce: 1. start cluster and services (service samba is running on first node - in my case on tom) 2. change init-skript smb on first member (on tom) in the way that start will always fail (I did it with "exit 1" in case-clause and case "start") 3. stop service smb Actual results: * rgmanager recognizes the failed service and tries to restart it (recovery=restart) * restart fails and rgmanager tries to relocate the service * relocate fails and messages are: Sep 18 10:23:11 jerry clurgmgrd: [3879]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> status on script "samba" returned 1 (generic error) Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> Stopping service service:samba Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Recovering failed service service:samba Sep 18 10:23:12 jerry clurgmgrd: [3879]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> start on script "samba" returned 1 (generic error) Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #68: Failed to start service:samba; return value: 1 Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Stopping service service:samba Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #71: Relocating failed service service:samba Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 Sep 18 10:23:14 jerry clurgmgrd[3879]: <notice> Service service:samba is stopped Expected results: * restart fails so cluster does a relocate * relocate suceeds Additional info: An ordinary relocate (clusvcadm -r samba) is working.
ups.. little mistake - the messages in comment #1 are from member jerry in a second test (I tested it in both ways with the same results). messages from the first test (tom -> jerry) from tom: Sep 18 10:07:09 tom clurgmgrd: [5316]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Stopping service service:samba Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Service service:samba is recovering Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Recovering failed service service:samba Sep 18 10:07:10 tom clurgmgrd: [5316]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> start on script "samba" returned 1 (generic error) Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #68: Failed to start service:samba; return value: 1 Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Stopping service service:samba Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Service service:samba is recovering Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #71: Relocating failed service service:samba Sep 18 10:07:12 tom clurgmgrd[5316]: <err> #58: Failed opening connection to member #1 Sep 18 10:07:12 tom clurgmgrd[5316]: <notice> Service service:samba is stopped
There's an event processing bug which causes rgmanager to sleep with a lock - this causes timeouts if multiple events come in simultaneously, which sounds like what you're hitting. I can patch 5.2 rgmanager with the 5.3 patch if you would like to test it. The patch looks like this: http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=rgmanager/src/daemons/rg_event.c;h=d2c7cd331c71aeef70f7d8ec9505da6fd81af08b;hp=08137de3772aeb51914be593c03d144fcc910474;hb=50dc172c12f728ebb5916e2059b01404d94dd066;hpb=1cc68885904a1e393e2dcd6d788ae5099ef7124d
That was fast :) Ja, I'd like to test it. Is it possible to get an updated rpm?
http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.src.rpm http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.x86_64.rpm Let me know if you needed a different architecture.
Thanks - architecture ok. Tested it -> works well; bug seems to be fixed. Thank you! BTW: I assume official rpms will be available with EL5.3? When will EL5.3 be released?
Correct; it will be fixed in 5.3. I don't have current information for GA date of 5.3, but beta should be pretty soon (month or so, I think).
RHEL4 has this bug too, it's: https://bugzilla.redhat.com/show_bug.cgi?id=461956
5.3 is coming, I promise >:)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0101.html
*** Bug 481133 has been marked as a duplicate of this bug. ***
This bug has the same cause (but is a different symptom of) bug #461956