Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1345876 - Restarting a resource in a resource group on a remote node restarts other services instead
Restarting a resource in a resource group on a remote node restarts other ser...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.2
All Linux
urgent Severity urgent
: rc
: 7.3
Assigned To: Ken Gaillot
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-06-13 07:18 EDT by Julio Entrena Perez
Modified: 2016-11-03 14:59 EDT (History)
5 users (show)

See Also:
Fixed In Version: pacemaker-1.1.15-3.el7
Doc Type: No Doc Update
Doc Text:
Documentation for this is included with BZ#1337688
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-03 14:59:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2372061 None None None 2016-06-13 11:53 EDT
Red Hat Product Errata RHSA-2016:2578 normal SHIPPED_LIVE Moderate: pacemaker security, bug fix, and enhancement update 2016-11-03 08:07:24 EDT

  None (edit)
Description Julio Entrena Perez 2016-06-13 07:18:54 EDT
Description of problem:
When requesting a resource that is part of a resource group running in a remote node to restart, a different resource in the group is restarted instead.

Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.2
and
pacemaker-1.1.15-2.el7

How reproducible:
Always

Steps to Reproduce:
1. Configure a remote node
2. Configure 4 random resources in a resource group
3. Start all resources in remote node
4. Restart one of the resources

Actual results:
A random, different resource than requested is restarted:

# pcs resource
 Resource Group: group
     database	(ocf::heartbeat:pgsql):	Started rhel7c.usersys.redhat.com
     appserver	(ocf::heartbeat:tomcat):	Started rhel7c.usersys.redhat.com
     webserver	(ocf::heartbeat:apache):	Started rhel7c.usersys.redhat.com
     mailserver	(ocf::heartbeat:postfix):	Started rhel7c.usersys.redhat.com
 vm-rhel7c	(ocf::heartbeat:VirtualDomain):	Started rhel7pm1.usersys.redhat.com

# date && pcs resource restart webserver
Mon 13 Jun 12:12:47 BST 2016
webserver successfully restarted

# journalctl -f
Jun 13 12:12:49 rhel7pm1.usersys.redhat.com crmd[19581]:   notice: Result of stop operation for mailserver on rhel7c.usersys.redhat.com: ok | call=81 key=mailserver_stop_0 confirmed=true rc=0 cib-update=37
Jun 13 12:12:52 rhel7pm1.usersys.redhat.com crmd[19581]:   notice: Result of start operation for mailserver on rhel7c.usersys.redhat.com: ok | call=82 key=mailserver_start_0 confirmed=true rc=0 cib-update=38

webserver != mailserver

Expected results:
The requested resource (rather than a different resource) is restarted.

Additional info:
Comment 1 Ken Gaillot 2016-06-13 10:41:23 EDT
This is expected behavior. Restarting a resource does not (and should not) ignore any dependencies. So if resource B is ordered after A (whether by an order constraint, or by being later in a group), and we restart A, then B must stop beforehand and start afterward.

Of course, the target resource should restart as well. Let me know if it did not.

You can ignore dependencies if desired, by setting them as unmanaged before doing the restart.

There are many situations where a restart can affect other resources. A restart is simply a normal stop followed by a normal start, so anything that might normally change after a stop can happen during a restart. Besides constraints as mentioned above, factors such as stickiness and placement strategy could come into play.
Comment 2 Julio Entrena Perez 2016-06-13 10:50:01 EDT
Sorry if that wasn't clear: there are no configured dependencies.
Comment 4 Ken Gaillot 2016-06-13 16:31:05 EDT
A group implies ordering and colocation between its members, so restarting mailserver is expected when restarting webserver.

However, webserver itself does not get restarted, which is a bug. This is what happens when the command is run:

* target-role for webserver is set to Stopped. The command is supposed to wait at this point until webserver is stopped, but it does not.

* Because mailserver is listed after webserver in the group, the cluster schedules stops for mailserver then webserver.

* The request to stop mailserver is initiated and executed.

* Before the request to stop webserver can be initiated, target-role for webserver is cleared (because the command did not wait as it was supposed to).

* The cluster cancels the stop for webserver since it is no longer needed, and starts mailserver.

I'll investigate why the command is not waiting when it is supposed to.
Comment 5 Ken Gaillot 2016-06-16 11:46:33 EDT
This is fixed upstream as of commit f5afdc1. What was happening is that the group was being looked at as a whole -- so long as any member was started, the group was considered started, so crm_mon wasn't waiting for the individually stopped member to start again. Now, crm_mon expands groups into their individual resources, so starts/stops are monitored individually.
Comment 6 Julio Entrena Perez 2016-06-16 11:58:35 EDT
Thank you Ken. Looking at the patch I understand that the problem is unrelated to the resource group preferring a remote node?
Comment 7 Ken Gaillot 2016-06-16 12:09:55 EDT
Correct. The problem could occur when restarting a member of any group when there is at least one other member before it in the group, and it would be almost certain to occur if there is also at least one member after it.
Comment 9 Patrik Hagara 2016-08-30 07:24:44 EDT
Tested cluster configuration: 2 nodes with one resource group containing 3 ocf:heartbeat:Dummy resources (dummy1, dummy2, dummy3; in this particular order).

Before the fix (pacemaker-1.1.15-2.el7), restarting the first resource in the group (dummy1) always stopped resources in reverse order (3, 2, 1) then proceeded to start them up again (1, 2, 3). Restarting dummy2 sometimes did the correct procedure (stop 3, stop 2, start 2, start 3), other times it restarted only the third resource, and in some cases did nothing at all. Attempting to restart the last resource in a group again sometimes did the right thing (stop and start dummy3) and other times did nothing at all.

After the fix (pacemaker-1.1.15-3.el7), the cluster always stops the resources that need stopping and then starts them in correct order.

Marking as verified in pacemaker-1.1.15-3.el7
Comment 11 errata-xmlrpc 2016-11-03 14:59:52 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.