Bug 680256 - Service with highest exclusive prio should be relocated to another node with lower exclusive prio
Summary: Service with highest exclusive prio should be relocated to another node with ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-24 20:05 UTC by Shane Bradley
Modified: 2018-11-14 16:01 UTC (History)
4 users (show)

Fixed In Version: rgmanager-2.0.52-16.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 690191 692627 (view as bug list)
Environment:
Last Closed: 2011-07-21 10:47:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster.conf for recreation (1.33 KB, application/octet-stream)
2011-02-24 20:05 UTC, Shane Bradley
no flags Details
Drop-in replacement default_event_script.sl (16.30 KB, text/plain)
2011-02-24 22:52 UTC, Lon Hohberger
no flags Details
Patch for RHEL5 branch (4.61 KB, patch)
2011-03-22 01:17 UTC, Lon Hohberger
no flags Details | Diff
Notes about patch behavior (should have been added a long time ago) (2.04 KB, text/plain)
2013-02-01 15:57 UTC, Lon Hohberger
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1000 0 normal SHIPPED_LIVE Low: rgmanager security, bug fix, and enhancement update 2011-07-21 10:43:18 UTC

Description Shane Bradley 2011-02-24 20:05:16 UTC
Created attachment 480852 [details]
cluster.conf for recreation

Description of problem:

A two node cluster each running an exclusive service. The node rh51 is running the service demoProd(exclusive=1) and the node rh52 is running the service demoDevel(exclusive=2). The service demoProd fails a status check. Then it tries to restart the service on the node rh51(itself). Then service demoProd is stopped successfully. Then node rh51 fails to start the service demoProd. The service then is migrated to the "stopped" state.

The cluster.conf is attached to the ticket

Version-Release number of selected component (if applicable):
rgmanager-2.0.52-9.el5

How reproducible:
Everytime

Steps to Reproduce:
1) Started these services on their respective nodes:
   $ clustat
   Service Name      Owner (Last)              State
   service:demoDevel rh5node2.examplerh.com    started       
   service:demoProd  rh5node1.examplerh.com    started 


2) Edit the script demoProd so that after the service fails a status check that starting service will fail.
   $ cp ~/cservices/services/demo/demo1 ~/cservices/services/demo/demo1.org
   $ emacs ~/cservices/services/demo/demo1

   $ diff ~/cservices/services/demo/demo1.org ~/cservices/services/demo/demo1
   --- /root/cservices/services/demo/demo1.org     2011-02-24 12:25:13.000000000 -0500
   +++ /root/cservices/services/demo/demo1 2011-02-24 12:26:14.000000000 -0500
   @@ -48,7 +48,8 @@ startService() {
            logger "Delaying startup of $SCRIPTNAME script $DELAY_STARTUP_SECONDS seconds."
            sleep $DELAY_STARTUP_SECONDS;
        fi
   -    exit $result;
   +    #exit $result;
   +    exit 1;
   }
 
   statusService() {

3) Then I removed the lockfile for the demoProd service running on rh51 so status check will fail:
   $ rm /var/lock/subsys/demo1

   This resulted in the following error messages:
   Feb 24 12:28:40 rh5node1 root: REMOVING THE LOCKFILE for demo1 after changing exit value of start()
   Feb 24 12:28:57 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 status 
   Feb 24 12:28:58 rh5node1 logger: demo1 is not running
   Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <err> script:demoProdScript: 
     status of /root/cservices/services/demo/demo1 failed (returned 1) 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> status on script "demoProdScript" returned 1 (generic error) 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Stopping service service:demoProd 
   Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 stop 
   Feb 24 12:28:58 rh5node1 logger: demo1 is not running.
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Service service:demoProd is recovering 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Starting service:demoProd on [ 1 ] 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Recovering failed service service:demoProd 
   Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 start 
   Feb 24 12:28:58 rh5node1 logger: demo1 is not running.
   Feb 24 12:28:58 rh5node1 logger: STARTING demo1
   Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <err> script:demoProdScript: 
     start of /root/cservices/services/demo/demo1 failed (returned 1) 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> start on script "demoProdScript" returned 1 (generic error) 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <warning> #68: Failed to start service:demoProd; return value: 1 
   Feb 24 12:28:58 rh5node1 clurgmgrd[6859]: <notice> Stopping service service:demoProd 
   Feb 24 12:28:58 rh5node1 clurgmgrd: [6859]: <info> Executing /root/cservices/services/demo/demo1 stop 
   Feb 24 12:28:58 rh5node1 logger: STOPPING demo1
   Feb 24 12:28:59 rh5node1 clurgmgrd[6859]: <notice> Service service:demoProd is stopped 
   Feb 24 12:28:59 rh5node1 clurgmgrd[6859]: <info> Start failed; 
     node reports: 1 failures, 0 exclusive, 0 dependency errors

4) The service then went into a stopped state and was not migrated.
   $ clustat
   Service Name      Owner (Last)              State
   service:demoDevel rh5node2.examplerh.com    started       
   service:demoProd  (rh5node1.examplerh.com)  stopped 

   The service demoProd is now stopped on rh51 and demoDevel is still running on rh52.

When service is evaluated it only sees 1 empty node and the rest are exclusive. The desired bahavior would be for it to compare the exclusive prio services running on other nodes to see if that is better place to relocate to.
  
Actual results:
A two node cluster each running an exclusive service. The node rh51 is running the service demoProd(exclusive=1) and the node rh52 is running the service demoDevel(exclusive=2). The service demoProd fails a status check. The node rh51 notices that it is the only node that is empty and restarts the service on node rh51.

Expected results:
A two node cluster each running an exclusive service. The node rh51 is running the service demoProd(exclusive=1) and the node rh52 is running the service demoDevel(exclusive=2). The service demoProd fails a status check. The service demoProd then should be relocated to node rh52. The service demoDevel should be stopped since the service has lower exclusive value than demoProd. The service demoProd should then be started on node rh52.

Additional info:

Comment 1 Lon Hohberger 2011-02-24 22:52:01 UTC
Created attachment 480876 [details]
Drop-in replacement default_event_script.sl

In my testing, this solves the issue.

I used master (prio 1) and slave (prio 2) exclusive services, with, after starting, "master" exhibiting a failure which prevented it from restarting.  This caused it to preempt the "slave" service on node 2.

After it started, an event is now queued which triggers all non-priority-1 exclusive services to try to restart.

The net effect is that "master" and "slave" switched places:

Feb 24 17:41:07 snap rgmanager[4671]: Starting service:master on [ 1 2 ]
Feb 24 17:41:07 snap rgmanager[4671]: Starting stopped service service:master
Feb 24 17:41:07 snap rgmanager[12630]: Executing /usr/sbin/cluster-script.sh start
Feb 24 17:41:07 snap rgmanager[4671]: Service service:master started
Feb 24 17:41:07 snap rgmanager[4671]: Service service:master is now running on member 1
Feb 24 17:41:14 snap rgmanager[4671]: Starting service:slave on [ 2 ]
Feb 24 17:41:14 snap rgmanager[4671]: Service service:slave is now running on member 2
Feb 24 17:41:46 snap rgmanager[12673]: Executing /usr/sbin/cluster-script.sh status
Feb 24 17:42:16 snap rgmanager[12721]: Executing /usr/sbin/cluster-script.sh status
Feb 24 17:42:16 snap rgmanager[12744]: script:test1: status of /usr/sbin/cluster-script.sh failed (returned 1)
Feb 24 17:42:16 snap rgmanager[4671]: status on script "test1" returned 1 (generic error)
Feb 24 17:42:16 snap rgmanager[4671]: Stopping service service:master
Feb 24 17:42:16 snap rgmanager[12775]: Executing /usr/sbin/cluster-script.sh stop
Feb 24 17:42:16 snap rgmanager[4671]: Service service:master is recovering
Feb 24 17:42:16 snap rgmanager[4671]: Starting service:master on [ 1 ]
Feb 24 17:42:17 snap rgmanager[4671]: Recovering failed service service:master
Feb 24 17:42:17 snap rgmanager[12815]: Executing /usr/sbin/cluster-script.sh start
Feb 24 17:42:17 snap rgmanager[12838]: script:test1: start of /usr/sbin/cluster-script.sh failed (returned 1)
Feb 24 17:42:17 snap rgmanager[4671]: start on script "test1" returned 1 (generic error)
Feb 24 17:42:17 snap rgmanager[4671]: #68: Failed to start service:master; return value: 1
Feb 24 17:42:17 snap rgmanager[4671]: Stopping service service:master
Feb 24 17:42:17 snap rgmanager[12870]: Executing /usr/sbin/cluster-script.sh stop
Feb 24 17:42:17 snap rgmanager[4671]: Service service:master is stopped
Feb 24 17:42:17 snap rgmanager[4671]: Start failed; node reports: 1 failures, 0 exclusive, 0 dependency errors
Feb 24 17:42:17 snap rgmanager[4671]: STOPPING service service:slave because service:master is a higher priority.
Feb 24 17:42:18 snap rgmanager[4671]: Starting service:master on 2
Feb 24 17:42:18 snap rgmanager[4671]: Service service:master is now running on member 2
Feb 24 17:42:18 snap rgmanager[4671]: Restarting stopped exclusive priority 2 service service:slave
Feb 24 17:42:18 snap rgmanager[4671]: Starting service:slave on [ 1 ]
Feb 24 17:42:18 snap rgmanager[4671]: Starting stopped service service:slave
Feb 24 17:42:18 snap rgmanager[4671]: Service service:slave started
Feb 24 17:42:18 snap rgmanager[4671]: Service service:slave is now running on member

Comment 2 Lon Hohberger 2011-02-24 22:53:48 UTC
I don't know why the last line got cut off, but it should have said:

Feb 24 17:42:18 snap rgmanager[4671]: Service service:slave is now running on member 1

Comment 4 Lon Hohberger 2011-03-01 15:48:25 UTC
It's important to note that part of the reason default_event_script.sl exists is to enable users to write their own custom failure/recovery policies.

Comment 5 Lon Hohberger 2011-03-22 01:17:27 UTC
Created attachment 486712 [details]
Patch for RHEL5 branch

Comment 7 Martin Juricek 2011-06-13 10:49:57 UTC
Verified in version rgmanager-2.0.52-19.el5, kernel 2.6.18-265.el5




<rm central_processing="1" log_facility="local6" log_level="7">
  <failoverdomains/>
  <service name="master" exclusive="1" max_restarts="1" recovery="relocate" autostart="1">
	<script file="/tmp/master.sh" name="master_script"/>
  </service>    
  <service name="slave" exclusive="2" max_restarts="1" recovery="relocate" autostart="1">
	<script file="/tmp/slave.sh" name="slave_script"/>
  </service>    
</rm>


 
Jun 13 05:35:30 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh status 
Jun 13 05:36:10 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh status 
Jun 13 05:36:50 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh status 
Jun 13 05:36:51 a1 clurgmgrd: [14189]: <err> script:master_script: status of /tmp/master.sh failed (returned 1) 
Jun 13 05:36:51 a1 clurgmgrd[14189]: <notice> status on script "master_script" returned 1 (generic error) 
Jun 13 05:36:51 a1 clurgmgrd[14189]: <notice> Stopping service service:master 
Jun 13 05:36:51 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh stop 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <notice> Service service:master is recovering 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Executing default service event handler 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Recovering Service: service:master Last owner: 1 Policy: relocate RTE: 1 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Evaluating service:master state=recovering owner=-1 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Nodes - Empty: [ 1 ] w/Services: [ ] w/Excl: [ 2 ] 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <notice> Starting service:master on [ 1 ] 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <debug> Sent remote-start request to 1 
Jun 13 05:36:52 a1 clurgmgrd[14189]: <notice> Recovering failed service service:master 
Jun 13 05:36:52 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh start 
Jun 13 05:36:53 a1 clurgmgrd: [14189]: <err> script:master_script: start of /tmp/master.sh failed (returned 1) 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> start on script "master_script" returned 1 (generic error) 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <warning> #68: Failed to start service:master; return value: 1 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Stopping service service:master 
Jun 13 05:36:53 a1 clurgmgrd: [14189]: <info> Executing /tmp/master.sh stop 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:master is stopped 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <info> Start failed; node reports: 1 failures, 0 exclusive, 0 dependency errors 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <warning> STOPPING service service:slave because service:master is a higher priority. 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Starting service:master on 2 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Sent remote-start request to 2 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:master is now running on member 2 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Executing default service event handler 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <info> Restarting stopped exclusive priority 2 service service:slave 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Evaluating service:slave state=stopped owner=-1 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Nodes - Empty: [ 1 ] w/Services: [ ] w/Excl: [ 2 ] 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Starting service:slave on [ 1 ] 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Sent remote-start request to 1 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Starting stopped service service:slave 
Jun 13 05:36:53 a1 clurgmgrd: [14189]: <info> Executing /tmp/slave.sh start 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:slave started 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <notice> Service service:slave is now running on member 1 
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> Executing default service event handler 
Jun 13 05:36:53 a1 last message repeated 2 times
Jun 13 05:36:53 a1 clurgmgrd[14189]: <debug> 5 events processed 
Jun 13 05:37:30 a1 clurgmgrd: [14189]: <info> Executing /tmp/slave.sh status

Comment 8 errata-xmlrpc 2011-07-21 10:47:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1000.html

Comment 9 Lon Hohberger 2013-02-01 15:57:07 UTC
Created attachment 691632 [details]
Notes about patch behavior (should have been added a long time ago)


Note You need to log in before you can comment on or make changes to this bug.