Created attachment 480852 [details] cluster.conf for recreation Description of problem: A two node cluster each running an exclusive service. The node rh51 is running the service demoProd(exclusive=1) and the node rh52 is running the service demoDevel(exclusive=2). The service demoProd fails a status check. Then it tries to restart the service on the node rh51(itself). Then service demoProd is stopped successfully. Then node rh51 fails to start the service demoProd. The service then is migrated to the "stopped" state. The cluster.conf is attached to the ticket Version-Release number of selected component (if applicable): rgmanager-2.0.52-9.el5 How reproducible: Everytime Steps to Reproduce: 1) Started these services on their respective nodes: $ clustat Service Name Owner (Last) State service:demoDevel rh5node2.examplerh.com started service:demoProd rh5node1.examplerh.com started 2) Edit the script demoProd so that after the service fails a status check that starting service will fail. $ cp ~/cservices/services/demo/demo1 ~/cservices/services/demo/demo1.org $ emacs ~/cservices/services/demo/demo1 $ diff ~/cservices/services/demo/demo1.org ~/cservices/services/demo/demo1 --- /root/cservices/services/demo/demo1.org 2011-02-24 12:25:13.000000000 -0500 +++ /root/cservices/services/demo/demo1 2011-02-24 12:26:14.000000000 -0500 @@ -48,7 +48,8 @@ startService() { logger "Delaying startup of $SCRIPTNAME script $DELAY_STARTUP_SECONDS seconds." sleep $DELAY_STARTUP_SECONDS; fi - exit $result; + #exit $result; + exit 1; } statusService() { 3) Then I removed the lockfile for the demoProd service running on rh51 so status check will fail: $ rm /var/lock/subsys/demo1 This resulted in the following error messages: Feb 24 12:28:40 rh5node1 root: REMOVING THE LOCKFILE for demo1 after changing exit value of start() Feb 24 12:28:57 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 status Feb 24 12:28:58 rh5node1 logger: demo1 is not running Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <err> script:demoProdScript: status of /root/cservices/services/demo/demo1 failed (returned 1) Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> status on script "demoProdScript" returned 1 (generic error) Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Stopping service service:demoProd Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 stop Feb 24 12:28:58 rh5node1 logger: demo1 is not running. Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Service service:demoProd is recovering Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Starting service:demoProd on [ 1 ] Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Recovering failed service service:demoProd Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 start Feb 24 12:28:58 rh5node1 logger: demo1 is not running. Feb 24 12:28:58 rh5node1 logger: STARTING demo1 Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <err> script:demoProdScript: start of /root/cservices/services/demo/demo1 failed (returned 1) Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> start on script "demoProdScript" returned 1 (generic error) Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <warning> #68: Failed to start service:demoProd; return value: 1 Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Stopping service service:demoProd Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 stop Feb 24 12:28:58 rh5node1 logger: STOPPING demo1 Feb 24 12:28:59 rh5node1 clurgmgrd[6859]: <notice> Service service:demoProd is stopped Feb 24 12:28:59 rh5node1 clurgmgrd[6859]: <info> Start failed; node reports: 1 failures, 0 exclusive, 0 dependency errors 4) The service then went into a stopped state and was not migrated. $ clustat Service Name Owner (Last) State service:demoDevel rh5node2.examplerh.com started service:demoProd (rh5node1.examplerh.com) stopped The service demoProd is now stopped on rh51 and demoDevel is still running on rh52. When service is evaluated it only sees 1 empty node and the rest are exclusive. The desired bahavior would be for it to compare the exclusive prio services running on other nodes to see if that is better place to relocate to. Actual results: A two node cluster each running an exclusive service. The node rh51 is running the service demoProd(exclusive=1) and the node rh52 is running the service demoDevel(exclusive=2). The service demoProd fails a status check. The node rh51 notices that it is the only node that is empty and restarts the service on node rh51. Expected results: A two node cluster each running an exclusive service. The node rh51 is running the service demoProd(exclusive=1) and the node rh52 is running the service demoDevel(exclusive=2). The service demoProd fails a status check. The service demoProd then should be relocated to node rh52. The service demoDevel should be stopped since the service has lower exclusive value than demoProd. The service demoProd should then be started on node rh52. Additional info:
Created attachment 480876 [details] Drop-in replacement default_event_script.sl In my testing, this solves the issue. I used master (prio 1) and slave (prio 2) exclusive services, with, after starting, "master" exhibiting a failure which prevented it from restarting. This caused it to preempt the "slave" service on node 2. After it started, an event is now queued which triggers all non-priority-1 exclusive services to try to restart. The net effect is that "master" and "slave" switched places: Feb 24 17:41:07 snap rgmanager[4671]: Starting service:master on [ 1 2 ] Feb 24 17:41:07 snap rgmanager[4671]: Starting stopped service service:master Feb 24 17:41:07 snap rgmanager[12630]: Executing /usr/sbin/cluster-script.sh start Feb 24 17:41:07 snap rgmanager[4671]: Service service:master started Feb 24 17:41:07 snap rgmanager[4671]: Service service:master is now running on member 1 Feb 24 17:41:14 snap rgmanager[4671]: Starting service:slave on [ 2 ] Feb 24 17:41:14 snap rgmanager[4671]: Service service:slave is now running on member 2 Feb 24 17:41:46 snap rgmanager[12673]: Executing /usr/sbin/cluster-script.sh status Feb 24 17:42:16 snap rgmanager[12721]: Executing /usr/sbin/cluster-script.sh status Feb 24 17:42:16 snap rgmanager[12744]: script:test1: status of /usr/sbin/cluster-script.sh failed (returned 1) Feb 24 17:42:16 snap rgmanager[4671]: status on script "test1" returned 1 (generic error) Feb 24 17:42:16 snap rgmanager[4671]: Stopping service service:master Feb 24 17:42:16 snap rgmanager[12775]: Executing /usr/sbin/cluster-script.sh stop Feb 24 17:42:16 snap rgmanager[4671]: Service service:master is recovering Feb 24 17:42:16 snap rgmanager[4671]: Starting service:master on [ 1 ] Feb 24 17:42:17 snap rgmanager[4671]: Recovering failed service service:master Feb 24 17:42:17 snap rgmanager[12815]: Executing /usr/sbin/cluster-script.sh start Feb 24 17:42:17 snap rgmanager[12838]: script:test1: start of /usr/sbin/cluster-script.sh failed (returned 1) Feb 24 17:42:17 snap rgmanager[4671]: start on script "test1" returned 1 (generic error) Feb 24 17:42:17 snap rgmanager[4671]: #68: Failed to start service:master; return value: 1 Feb 24 17:42:17 snap rgmanager[4671]: Stopping service service:master Feb 24 17:42:17 snap rgmanager[12870]: Executing /usr/sbin/cluster-script.sh stop Feb 24 17:42:17 snap rgmanager[4671]: Service service:master is stopped Feb 24 17:42:17 snap rgmanager[4671]: Start failed; node reports: 1 failures, 0 exclusive, 0 dependency errors Feb 24 17:42:17 snap rgmanager[4671]: STOPPING service service:slave because service:master is a higher priority. Feb 24 17:42:18 snap rgmanager[4671]: Starting service:master on 2 Feb 24 17:42:18 snap rgmanager[4671]: Service service:master is now running on member 2 Feb 24 17:42:18 snap rgmanager[4671]: Restarting stopped exclusive priority 2 service service:slave Feb 24 17:42:18 snap rgmanager[4671]: Starting service:slave on [ 1 ] Feb 24 17:42:18 snap rgmanager[4671]: Starting stopped service service:slave Feb 24 17:42:18 snap rgmanager[4671]: Service service:slave started Feb 24 17:42:18 snap rgmanager[4671]: Service service:slave is now running on member
I don't know why the last line got cut off, but it should have said: Feb 24 17:42:18 snap rgmanager[4671]: Service service:slave is now running on member 1
It's important to note that part of the reason default_event_script.sl exists is to enable users to write their own custom failure/recovery policies.
Created attachment 486712 [details] Patch for RHEL5 branch
Verified in version rgmanager-2.0.52-19.el5, kernel 2.6.18-265.el5 <rm central_processing="1" log_facility="local6" log_level="7"> <failoverdomains/> <service name="master" exclusive="1" max_restarts="1" recovery="relocate" autostart="1"> <script file="/tmp/master.sh" name="master_script"/> </service> <service name="slave" exclusive="2" max_restarts="1" recovery="relocate" autostart="1"> <script file="/tmp/slave.sh" name="slave_script"/> </service> </rm> Jun 13 05:35:30 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh status Jun 13 05:36:10 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh status Jun 13 05:36:50 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh status Jun 13 05:36:51 a1 clurgmgrd: [14189]: <err> script:master_script: status of /tmp/master.sh failed (returned 1) Jun 13 05:36:51 a1 clurgmgrd[14189]: <notice> status on script "master_script" returned 1 (generic error) Jun 13 05:36:51 a1 clurgmgrd[14189]: <notice> Stopping service service:master Jun 13 05:36:51 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh stop Jun 13 05:36:52 a1 clurgmgrd[14189]: <notice> Service service:master is recovering Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Executing default service event handler Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Recovering Service: service:master Last owner: 1 Policy: relocate RTE: 1 Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Evaluating service:master state=recovering owner=-1 Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Nodes - Empty: [ 1 ] w/Services: [ ] w/Excl: [ 2 ] Jun 13 05:36:52 a1 clurgmgrd[14189]: <notice> Starting service:master on [ 1 ] Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Sent remote-start request to 1 Jun 13 05:36:52 a1 clurgmgrd[14189]: <notice> Recovering failed service service:master Jun 13 05:36:52 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh start Jun 13 05:36:53 a1 clurgmgrd: [14189]: <err> script:master_script: start of /tmp/master.sh failed (returned 1) Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> start on script "master_script" returned 1 (generic error) Jun 13 05:36:53 a1 clurgmgrd[14189]: <warning> #68: Failed to start service:master; return value: 1 Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Stopping service service:master Jun 13 05:36:53 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh stop Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:master is stopped Jun 13 05:36:53 a1 clurgmgrd[14189]: <info> Start failed; node reports: 1 failures, 0 exclusive, 0 dependency errors Jun 13 05:36:53 a1 clurgmgrd[14189]: <warning> STOPPING service service:slave because service:master is a higher priority. Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Starting service:master on 2 Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Sent remote-start request to 2 Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:master is now running on member 2 Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Executing default service event handler Jun 13 05:36:53 a1 clurgmgrd[14189]: <info> Restarting stopped exclusive priority 2 service service:slave Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Evaluating service:slave state=stopped owner=-1 Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Nodes - Empty: [ 1 ] w/Services: [ ] w/Excl: [ 2 ] Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Starting service:slave on [ 1 ] Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Sent remote-start request to 1 Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Starting stopped service service:slave Jun 13 05:36:53 a1 clurgmgrd: [14189]: <info> Executing /tmp/slave.sh start Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:slave started Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:slave is now running on member 1 Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Executing default service event handler Jun 13 05:36:53 a1 last message repeated 2 times Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> 5 events processed Jun 13 05:37:30 a1 clurgmgrd: [14189]: <info> Executing /tmp/slave.sh status
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1000.html
Created attachment 691632 [details] Notes about patch behavior (should have been added a long time ago)