1345876 – Restarting a resource in a resource group on a remote node restarts other services instead

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1345876 - Restarting a resource in a resource group on a remote node restarts other services instead

Summary: Restarting a resource in a resource group on a remote node restarts other ser...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.3
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-13 11:18 UTC by Julio Entrena Perez
Modified:	2019-12-16 05:55 UTC (History)
CC List:	5 users (show)
Fixed In Version:	pacemaker-1.1.15-3.el7
Doc Type:	No Doc Update
Doc Text:	Documentation for this is included with BZ#1337688
Clone Of:
Environment:
Last Closed:	2016-11-03 18:59:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2372061	0	None	None	None	2016-06-13 15:53:40 UTC
Red Hat Product Errata	RHSA-2016:2578	0	normal	SHIPPED_LIVE	Moderate: pacemaker security, bug fix, and enhancement update	2016-11-03 12:07:24 UTC

Description Julio Entrena Perez 2016-06-13 11:18:54 UTC

Description of problem:
When requesting a resource that is part of a resource group running in a remote node to restart, a different resource in the group is restarted instead.

Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.2
and
pacemaker-1.1.15-2.el7

How reproducible:
Always

Steps to Reproduce:
1. Configure a remote node
2. Configure 4 random resources in a resource group
3. Start all resources in remote node
4. Restart one of the resources

Actual results:
A random, different resource than requested is restarted:

# pcs resource
 Resource Group: group
     database	(ocf::heartbeat:pgsql):	Started rhel7c.usersys.redhat.com
     appserver	(ocf::heartbeat:tomcat):	Started rhel7c.usersys.redhat.com
     webserver	(ocf::heartbeat:apache):	Started rhel7c.usersys.redhat.com
     mailserver	(ocf::heartbeat:postfix):	Started rhel7c.usersys.redhat.com
 vm-rhel7c	(ocf::heartbeat:VirtualDomain):	Started rhel7pm1.usersys.redhat.com

# date && pcs resource restart webserver
Mon 13 Jun 12:12:47 BST 2016
webserver successfully restarted

# journalctl -f
Jun 13 12:12:49 rhel7pm1.usersys.redhat.com crmd[19581]:   notice: Result of stop operation for mailserver on rhel7c.usersys.redhat.com: ok | call=81 key=mailserver_stop_0 confirmed=true rc=0 cib-update=37
Jun 13 12:12:52 rhel7pm1.usersys.redhat.com crmd[19581]:   notice: Result of start operation for mailserver on rhel7c.usersys.redhat.com: ok | call=82 key=mailserver_start_0 confirmed=true rc=0 cib-update=38

webserver != mailserver

Expected results:
The requested resource (rather than a different resource) is restarted.

Additional info:

Comment 1 Ken Gaillot 2016-06-13 14:41:23 UTC

This is expected behavior. Restarting a resource does not (and should not) ignore any dependencies. So if resource B is ordered after A (whether by an order constraint, or by being later in a group), and we restart A, then B must stop beforehand and start afterward.

Of course, the target resource should restart as well. Let me know if it did not.

You can ignore dependencies if desired, by setting them as unmanaged before doing the restart.

There are many situations where a restart can affect other resources. A restart is simply a normal stop followed by a normal start, so anything that might normally change after a stop can happen during a restart. Besides constraints as mentioned above, factors such as stickiness and placement strategy could come into play.

Comment 2 Julio Entrena Perez 2016-06-13 14:50:01 UTC

Sorry if that wasn't clear: there are no configured dependencies.

Comment 4 Ken Gaillot 2016-06-13 20:31:05 UTC

A group implies ordering and colocation between its members, so restarting mailserver is expected when restarting webserver.

However, webserver itself does not get restarted, which is a bug. This is what happens when the command is run:

* target-role for webserver is set to Stopped. The command is supposed to wait at this point until webserver is stopped, but it does not.

* Because mailserver is listed after webserver in the group, the cluster schedules stops for mailserver then webserver.

* The request to stop mailserver is initiated and executed.

* Before the request to stop webserver can be initiated, target-role for webserver is cleared (because the command did not wait as it was supposed to).

* The cluster cancels the stop for webserver since it is no longer needed, and starts mailserver.

I'll investigate why the command is not waiting when it is supposed to.

Comment 5 Ken Gaillot 2016-06-16 15:46:33 UTC

This is fixed upstream as of commit f5afdc1. What was happening is that the group was being looked at as a whole -- so long as any member was started, the group was considered started, so crm_mon wasn't waiting for the individually stopped member to start again. Now, crm_mon expands groups into their individual resources, so starts/stops are monitored individually.

Comment 6 Julio Entrena Perez 2016-06-16 15:58:35 UTC

Thank you Ken. Looking at the patch I understand that the problem is unrelated to the resource group preferring a remote node?

Comment 7 Ken Gaillot 2016-06-16 16:09:55 UTC

Correct. The problem could occur when restarting a member of any group when there is at least one other member before it in the group, and it would be almost certain to occur if there is also at least one member after it.

Comment 9 Patrik Hagara 2016-08-30 11:24:44 UTC

Tested cluster configuration: 2 nodes with one resource group containing 3 ocf:heartbeat:Dummy resources (dummy1, dummy2, dummy3; in this particular order).

Before the fix (pacemaker-1.1.15-2.el7), restarting the first resource in the group (dummy1) always stopped resources in reverse order (3, 2, 1) then proceeded to start them up again (1, 2, 3). Restarting dummy2 sometimes did the correct procedure (stop 3, stop 2, start 2, start 3), other times it restarted only the third resource, and in some cases did nothing at all. Attempting to restart the last resource in a group again sometimes did the right thing (stop and start dummy3) and other times did nothing at all.

After the fix (pacemaker-1.1.15-3.el7), the cluster always stops the resources that need stopping and then starts them in correct order.

Marking as verified in pacemaker-1.1.15-3.el7

Comment 11 errata-xmlrpc 2016-11-03 18:59:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.