Description of problem: It might be useful when having multiple services/virtual machines under clustercontrol to fail them over in case a node goes down in a userdefined ordered way. That means that those services that are more critical get failed over when more resources are available. This usecase can best be described with virtual machines but could also be extended to services. Let's say you have a cluster of two nodes with different virtual machines running on each node. When one machine goes down you will be sure that those virtual machine get start that are the most important ones. The others should only be successfully failed over when enough resources are still available. Version-Release number of selected component (if applicable): n.a. How reproducible: When having multiple vms it is not predictable in what order they are failed over in case of problems. Steps to Reproduce: 1. 2. 3. Actual results: The fail over behavior is not persistently predictable Expected results: Additional info: As I have a customer who is requesting that feature and as I'm aware of the RIND implementation for rgmanager. I think it can "easily" be implemented with extending the actual fail over policy. My idea would be to being able to give every services/virtual machine a priority attribute and in case of NODE_DOWN event fail over the services in a that order (lowerst priority first). As I have to implement it nevertheless I wanted to make this feature officially available. And discuss if you think this could be a good way. Marc.
Lon, as you are on this bug what do you think? Regards Marc.
In the simplest case, we could add an attribute to the service and sort based on this attribute before running through the service list in the event handler(s).
Yes that was my idea as well. I'll add the patch when I'm done. Ok?
Sure. :)
So this is my first try. I've tested it with two nodes and different services. With and without priorities. When I trigger a NODE_EVENT the services get failed over in an ordered manner. If no priority is specified the service list stays constant as before. The idea is as if no priority is specified 0 is supposed. The services are ordered with lowest priority first. This means if no priority is specified this service/vm will always get "highest" priority. The relevant parts of the cluster.conf look as follows: <rm central_processing="1" log_facility="local4" log_level="8"> <events> <event name="node" class="node"> notice("Event node triggered!"); evalfile("/usr/local/cluster/priority_services.sl"); </event> </events> <failoverdomains> <failoverdomain name="all"> <failoverdomainnaode name="axqa03-1" priority="1"/> <failoverdomainnaode name="axqa03-1" priority="1"/> </failoverdomain> </failoverdomains> <service name="test1" domain="all" autostart="0" priority="5"> <script name="/usr/local/test/test1.sh"/> </service> <service name="test2" domain="all" autostart="0" priority="4"> <script name="/usr/local/test/test2.sh"/> </service> <service name="test3" domain="all" autostart="0" priority="3"> <script name="/usr/local/test/test3.sh"/> </service> <service name="test4" domain="all" autostart="0" priority="2"> <script name="/usr/local/test/test4.sh"/> </service> <service name="test5" domain="all" autostart="0" priority="1"> <script name="/usr/local/test/test5.sh"/> </service> <vm name="axqad101_2" path="/etc/xen" domain="all" autostart="0"/> <resources/> </rm> Patches follow.
Created attachment 337979 [details] Add priority attribute to the service definition This patch adds the priority attribute to the /usr/share/cluster/service.sh in order to make it available to rgmanager and the service_property slang function.
Created attachment 337980 [details] Optional patch for default_event_script.sl This patch is optional. If you want to make this concept available to the default behavior you can apply this to the default_event_script.sl.
Created attachment 337981 [details] priority_service.sl is the standalone implementation This file can be used as stand alone implementation of this concept. As described in previous Comment.
If you use this implementation in default_event_script.sl you might want to have a relevant cluster.conf part as follows: <rm central_processing="1" log_facility="local4" log_level="8"> <failoverdomains> <failoverdomain name="all"> <failoverdomainnaode name="axqa03-1" priority="1"/> <failoverdomainnaode name="axqa03-1" priority="1"/> </failoverdomain> </failoverdomains> <service name="test1" domain="all" autostart="0" priority="5"> <script name="/usr/local/test/test1.sh"/> </service> <service name="test2" domain="all" autostart="0" priority="4"> <script name="/usr/local/test/test2.sh"/> </service> <service name="test3" domain="all" autostart="0" priority="3"> <script name="/usr/local/test/test3.sh"/> </service> <service name="test4" domain="all" autostart="0" priority="2"> <script name="/usr/local/test/test4.sh"/> </service> <service name="test5" domain="all" autostart="0" priority="1"> <script name="/usr/local/test/test5.sh"/> </service> <vm name="axqad101_2" path="/etc/xen" domain="all" autostart="0"/> <resources/> </rm> What do you think? Marc.
Patch nuked freeze/unfreeze: @@ -296,15 +434,7 @@ ret = service_stop(service_name); - } else if (user_request == USER_FREEZE) { - - ret = service_freeze(service_name); - - } else if (user_request == USER_UNFREEZE) { - - ret = service_unfreeze(service_name); - - } + } % % todo - migrate Aside from that, I'd say it's pretty good. I'll try to test it today.
I haven't tested this yet due to other priorities. My apologies. I will test it as soon as I return from vacation.
I: * added back the bits relating to FREEZE/UNFREEZE) * changed the description of the 'priority' field to the following to note that it only has an effect with central_processing turned on: Priority for the service. In a failover scenario, this indicates the ordering of the service (1 is processed first, 2 is processed second, etc.). This overrides the order presented in cluster.conf. This option only has an effect if central processing within rgmanager is turned on. * changed <content type="string"... to <content type="integer"... in the new service attribute Testing went fine. Note that administrators can achieve the same goal by sorting the services the way they want in cluster.conf directly.
Created attachment 341796 [details] Patch against current rhel5 branch
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=4d9b91ea4c230c9e10d0e510a68b3e3898132de7
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1339.html
*** Bug 714671 has been marked as a duplicate of this bug. ***