1327469 – pengine wants to start services that should not be started

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1327469 - pengine wants to start services that should not be started

Summary: pengine wants to start services that should not be started

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.3
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Milan Navratil
URL:
Whiteboard:
Depends On:
Blocks:	1349493
TreeView+	depends on / blocked

Reported:	2016-04-15 08:09 UTC by Michele Baldessari
Modified:	2016-11-03 18:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pacemaker-1.1.15-3.el7
Doc Type:	Bug Fix
Doc Text:	The "crm_resource --wait" command and a "pcs" command with the "--wait" option now work correctly Previously, Pacemaker sometimes scheduled actions that depended on an unrunnable action on a cloned resource. As a consequence, log files became unnecessarily verbose, and the "crm_resource --wait" command never returned due to the scheduled actions. There was no significant effect on the cluster itself, as the cluster did not proceed beyond the unrunnable action. Now, Pacemaker no longer schedules actions that depend on an unrunnable clone action. As a result, log files are cleaner, and running "crm_resource --wait" or a "pcs" command with the "--wait" option returns as expected when the cluster stabilizes.
Clone Of:
Clones:	1349493 (view as bug list)
Environment:
Last Closed:	2016-11-03 18:59:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cib.xml.live from a live system (192.84 KB, application/xml) 2016-04-15 08:09 UTC, Michele Baldessari	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2578	0	normal	SHIPPED_LIVE	Moderate: pacemaker security, bug fix, and enhancement update	2016-11-03 12:07:24 UTC

Description Michele Baldessari 2016-04-15 08:09:47 UTC

Created attachment 1147493 [details]
cib.xml.live from a live system

Description of problem:
In tripleo we need to stop a bunch of pacemaker managed resources, do some    
stuff (like config changes) and then start the resources again.  We have a    
situation that is not fully understood and before changing things around we'd 
like to understand what is happening exactly.  In short we stop a dummy       
resource called "openstack-core" (has lots of dependent resources) and then do
a crm_resource --wait.                                                        
                                                                              
The problem is that crm_resource --wait does not return (it gets killed by us 
after 30mins).  We want to understand why it does not return within 30 minutes

pcs resource disable <foo>                                                                                     
check_resource <foo> stopped <timeout>                                                                         
                                                                                                               
The function check_resource, which uses crm_resource --wait, is defined here (Line 5-36):                      
https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh
                                                                                                               
So what seems to be happening is that we call "pcs resource disable openstack-core"                            
and it terminates successfully, but crm_resource --wait never exits and has the following                      
output:                                                                                                        
                                                                                                               
crm_resource --wait -VVV` keeps printing this:                                                                 
                                                                                                               
  notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked)                         
  notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0)                                   
  notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1)                                   
  notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2)                                   
  notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0)                           
  notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1)                           
  notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2)                           
                                                                                                               
It's not clear why these resources are trying to start.                                                        
                                                                                                               
The graph of the ordering constraints is here:                                                                 
http://file.rdu.redhat.com/~mbaldess/lp1569444/newton-jiri.pdf                                                 
                                                                                                               
It seems to me that even though openstack-core and its children are stopped 
successfully it is trying to start the services above and hence 
crm_resource --wait is not exiting                          

Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.2.x86_64

How reproducible:
We know that if we add an ordering constraint to openstack-ceilometer-notification-clone (and kill the openstack-heat-api-clone one) on openstack-core 
and one on openstack-sahara-engine-clone and one on openstack-aodh-listener-clone, we cannot reproduce this issue anymore. Otherwise it is reproduceable

Full sosreports are here:
http://file.rdu.redhat.com/~mbaldess/lp1569444/

Comment 4 Andrew Beekhof 2016-06-17 00:37:42 UTC

Ken: A link to sos reports is included at the bottom of the description.

If you download:

  http://file.rdu.redhat.com/~mbaldess/lp1569444/sosreport-overcloud-controller-0-20160412162711/sos_commands/cluster/crm_report/overcloud-controller-0/cib.xml.live

and run it as:

  CIB_file=./cib.xml.live crm_resource --wait -VVV

you'll see the LogActions logs that Michele mentions.

Comment 5 Andrew Beekhof 2016-06-17 01:24:05 UTC

In the same situation, an empty graph is now produced allowed the command to complete.

   https://github.com/ClusterLabs/pacemaker/commit/6951b7e

Comment 6 Ken Gaillot 2016-06-22 19:45:06 UTC

QA: The simplest way to verify this is to take the cib.xml.live attached to this bz and run the command given in Comment 4. Before the fix, it will show multiple "LogActions:   Start" lines without "blocked". After the fix, it will show one with "blocked".

Comment 12 Andrew Beekhof 2016-06-23 23:52:59 UTC

Essentially, the PE was failing to correctly mark parts of the action graph as unrunnable and this could happen whenever there was a clone resource that depended on another clone that depended on something that was disabled.

The result was a graph that looked like there was a bunch of work to do, but in reality it would never be attempted.  As a result of the fix, the graph now reflects reality: nothing to be done, the cluster has reached a steady state.

As for implications on people using 'crm_resource --wait', prior to the bug fix, anyone hitting this condition was basically getting a call to 'sleep 3600' that also returned an error.  So its hard to imagine anyone coming to rely on that behaviour.  People running it by hand would surely get bored and cancel the command and any scripts that hadn't already timed out at a higher level would likely bork because it returned an error.

Comment 13 Patrik Hagara 2016-08-22 12:06:22 UTC

Confirmed fixed in pacemaker-1.1.15-9.el7.x86_64

Before the fix:

> [root@virt-247 ~]# rpm -q pacemaker
> pacemaker-1.1.13-10.el7_2.3.x86_64
> [root@virt-247 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV
>   notice: LogActions:	Start   openstack-cinder-volume	(overcloud-controller-2 - blocked)
>   notice: LogActions:	Start   openstack-heat-engine:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-engine:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-engine:2	(overcloud-controller-2)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:2	(overcloud-controller-2)
>   notice: LogActions:	Start   openstack-cinder-volume	(overcloud-controller-2 - blocked)
>   notice: LogActions:	Start   openstack-heat-engine:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-engine:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-engine:2	(overcloud-controller-2)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:2	(overcloud-controller-2)
> ^C


After the fix:

> [root@virt-138 ~]# rpm -q pacemaker
> pacemaker-1.1.15-9.el7.x86_64
> [root@virt-138 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV
>   notice: LogActions:	Start   openstack-cinder-volume	(overcloud-controller-2 - blocked)
> [root@virt-138 ~]# echo $?
> 0


crm_resource --wait now correctly and cleanly terminates when there's nothing to be done. Marking as verified.

Comment 14 Ken Gaillot 2016-10-17 17:20:58 UTC

doc text: maybe "no work correctly" -> "could fail to return"

Comment 17 errata-xmlrpc 2016-11-03 18:59:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.