Red Hat Bugzilla – Bug 1345876
Restarting a resource in a resource group on a remote node restarts other services instead
Last modified: 2016-11-03 14:59:52 EDT
Description of problem: When requesting a resource that is part of a resource group running in a remote node to restart, a different resource in the group is restarted instead. Version-Release number of selected component (if applicable): pacemaker-1.1.13-10.el7_2.2 and pacemaker-1.1.15-2.el7 How reproducible: Always Steps to Reproduce: 1. Configure a remote node 2. Configure 4 random resources in a resource group 3. Start all resources in remote node 4. Restart one of the resources Actual results: A random, different resource than requested is restarted: # pcs resource Resource Group: group database (ocf::heartbeat:pgsql): Started rhel7c.usersys.redhat.com appserver (ocf::heartbeat:tomcat): Started rhel7c.usersys.redhat.com webserver (ocf::heartbeat:apache): Started rhel7c.usersys.redhat.com mailserver (ocf::heartbeat:postfix): Started rhel7c.usersys.redhat.com vm-rhel7c (ocf::heartbeat:VirtualDomain): Started rhel7pm1.usersys.redhat.com # date && pcs resource restart webserver Mon 13 Jun 12:12:47 BST 2016 webserver successfully restarted # journalctl -f Jun 13 12:12:49 rhel7pm1.usersys.redhat.com crmd[19581]: notice: Result of stop operation for mailserver on rhel7c.usersys.redhat.com: ok | call=81 key=mailserver_stop_0 confirmed=true rc=0 cib-update=37 Jun 13 12:12:52 rhel7pm1.usersys.redhat.com crmd[19581]: notice: Result of start operation for mailserver on rhel7c.usersys.redhat.com: ok | call=82 key=mailserver_start_0 confirmed=true rc=0 cib-update=38 webserver != mailserver Expected results: The requested resource (rather than a different resource) is restarted. Additional info:
This is expected behavior. Restarting a resource does not (and should not) ignore any dependencies. So if resource B is ordered after A (whether by an order constraint, or by being later in a group), and we restart A, then B must stop beforehand and start afterward. Of course, the target resource should restart as well. Let me know if it did not. You can ignore dependencies if desired, by setting them as unmanaged before doing the restart. There are many situations where a restart can affect other resources. A restart is simply a normal stop followed by a normal start, so anything that might normally change after a stop can happen during a restart. Besides constraints as mentioned above, factors such as stickiness and placement strategy could come into play.
Sorry if that wasn't clear: there are no configured dependencies.
A group implies ordering and colocation between its members, so restarting mailserver is expected when restarting webserver. However, webserver itself does not get restarted, which is a bug. This is what happens when the command is run: * target-role for webserver is set to Stopped. The command is supposed to wait at this point until webserver is stopped, but it does not. * Because mailserver is listed after webserver in the group, the cluster schedules stops for mailserver then webserver. * The request to stop mailserver is initiated and executed. * Before the request to stop webserver can be initiated, target-role for webserver is cleared (because the command did not wait as it was supposed to). * The cluster cancels the stop for webserver since it is no longer needed, and starts mailserver. I'll investigate why the command is not waiting when it is supposed to.
This is fixed upstream as of commit f5afdc1. What was happening is that the group was being looked at as a whole -- so long as any member was started, the group was considered started, so crm_mon wasn't waiting for the individually stopped member to start again. Now, crm_mon expands groups into their individual resources, so starts/stops are monitored individually.
Thank you Ken. Looking at the patch I understand that the problem is unrelated to the resource group preferring a remote node?
Correct. The problem could occur when restarting a member of any group when there is at least one other member before it in the group, and it would be almost certain to occur if there is also at least one member after it.
Tested cluster configuration: 2 nodes with one resource group containing 3 ocf:heartbeat:Dummy resources (dummy1, dummy2, dummy3; in this particular order). Before the fix (pacemaker-1.1.15-2.el7), restarting the first resource in the group (dummy1) always stopped resources in reverse order (3, 2, 1) then proceeded to start them up again (1, 2, 3). Restarting dummy2 sometimes did the correct procedure (stop 3, stop 2, start 2, start 3), other times it restarted only the third resource, and in some cases did nothing at all. Attempting to restart the last resource in a group again sometimes did the right thing (stop and start dummy3) and other times did nothing at all. After the fix (pacemaker-1.1.15-3.el7), the cluster always stops the resources that need stopping and then starts them in correct order. Marking as verified in pacemaker-1.1.15-3.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html