Bug 1327469

Summary:

pengine wants to start services that should not be started

Product:

Red Hat Enterprise Linux 7

Reporter:

Michele Baldessari <michele>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Milan Navratil <mnavrati>

Priority:

urgent

Version:

7.2

CC:

abeekhof, cfeist, cluster-maint, djansa, fdinitto, jruemker, jstransk, mnavrati, phagara

Target Milestone:

Keywords:

ZStream

Target Release:

7.3

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

pacemaker-1.1.15-3.el7

Doc Type:

Bug Fix

Doc Text:

The "crm_resource --wait" command and a "pcs" command with the "--wait" option now work correctly Previously, *Pacemaker* sometimes scheduled actions that depended on an unrunnable action on a cloned resource. As a consequence, log files became unnecessarily verbose, and the "crm_resource --wait" command never returned due to the scheduled actions. There was no significant effect on the cluster itself, as the cluster did not proceed beyond the unrunnable action. Now, *Pacemaker* no longer schedules actions that depend on an unrunnable clone action. As a result, log files are cleaner, and running "crm_resource --wait" or a "pcs" command with the "--wait" option returns as expected when the cluster stabilizes.

Story Points:

---

Clone Of:

Clones:

1349493 (view as bug list)

Environment:

Last Closed:

2016-11-03 18:59:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1349493

Attachments:

Description	Flags
cib.xml.live from a live system	none

Description Michele Baldessari 2016-04-15 08:09:47 UTC

Created attachment 1147493 [details]
cib.xml.live from a live system

Description of problem:
In tripleo we need to stop a bunch of pacemaker managed resources, do some    
stuff (like config changes) and then start the resources again.  We have a    
situation that is not fully understood and before changing things around we'd 
like to understand what is happening exactly.  In short we stop a dummy       
resource called "openstack-core" (has lots of dependent resources) and then do
a crm_resource --wait.                                                        
                                                                              
The problem is that crm_resource --wait does not return (it gets killed by us 
after 30mins).  We want to understand why it does not return within 30 minutes

pcs resource disable <foo>                                                                                     
check_resource <foo> stopped <timeout>                                                                         
                                                                                                               
The function check_resource, which uses crm_resource --wait, is defined here (Line 5-36):                      
https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh
                                                                                                               
So what seems to be happening is that we call "pcs resource disable openstack-core"                            
and it terminates successfully, but crm_resource --wait never exits and has the following                      
output:                                                                                                        
                                                                                                               
crm_resource --wait -VVV` keeps printing this:                                                                 
                                                                                                               
  notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked)                         
  notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0)                                   
  notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1)                                   
  notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2)                                   
  notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0)                           
  notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1)                           
  notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2)                           
                                                                                                               
It's not clear why these resources are trying to start.                                                        
                                                                                                               
The graph of the ordering constraints is here:                                                                 
http://file.rdu.redhat.com/~mbaldess/lp1569444/newton-jiri.pdf                                                 
                                                                                                               
It seems to me that even though openstack-core and its children are stopped 
successfully it is trying to start the services above and hence 
crm_resource --wait is not exiting                          

Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.2.x86_64

How reproducible:
We know that if we add an ordering constraint to openstack-ceilometer-notification-clone (and kill the openstack-heat-api-clone one) on openstack-core 
and one on openstack-sahara-engine-clone and one on openstack-aodh-listener-clone, we cannot reproduce this issue anymore. Otherwise it is reproduceable

Full sosreports are here:
http://file.rdu.redhat.com/~mbaldess/lp1569444/

Comment 4 Andrew Beekhof 2016-06-17 00:37:42 UTC

Ken: A link to sos reports is included at the bottom of the description.

If you download:

  http://file.rdu.redhat.com/~mbaldess/lp1569444/sosreport-overcloud-controller-0-20160412162711/sos_commands/cluster/crm_report/overcloud-controller-0/cib.xml.live

and run it as:

  CIB_file=./cib.xml.live crm_resource --wait -VVV

you'll see the LogActions logs that Michele mentions.

Comment 5 Andrew Beekhof 2016-06-17 01:24:05 UTC

In the same situation, an empty graph is now produced allowed the command to complete.

   https://github.com/ClusterLabs/pacemaker/commit/6951b7e

Comment 6 Ken Gaillot 2016-06-22 19:45:06 UTC

QA: The simplest way to verify this is to take the cib.xml.live attached to this bz and run the command given in Comment 4. Before the fix, it will show multiple "LogActions:   Start" lines without "blocked". After the fix, it will show one with "blocked".

Comment 12 Andrew Beekhof 2016-06-23 23:52:59 UTC

Essentially, the PE was failing to correctly mark parts of the action graph as unrunnable and this could happen whenever there was a clone resource that depended on another clone that depended on something that was disabled.

The result was a graph that looked like there was a bunch of work to do, but in reality it would never be attempted.  As a result of the fix, the graph now reflects reality: nothing to be done, the cluster has reached a steady state.

As for implications on people using 'crm_resource --wait', prior to the bug fix, anyone hitting this condition was basically getting a call to 'sleep 3600' that also returned an error.  So its hard to imagine anyone coming to rely on that behaviour.  People running it by hand would surely get bored and cancel the command and any scripts that hadn't already timed out at a higher level would likely bork because it returned an error.

Comment 13 Patrik Hagara 2016-08-22 12:06:22 UTC

Confirmed fixed in pacemaker-1.1.15-9.el7.x86_64

Before the fix:

> [root@virt-247 ~]# rpm -q pacemaker
> pacemaker-1.1.13-10.el7_2.3.x86_64
> [root@virt-247 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV
>   notice: LogActions:	Start   openstack-cinder-volume	(overcloud-controller-2 - blocked)
>   notice: LogActions:	Start   openstack-heat-engine:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-engine:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-engine:2	(overcloud-controller-2)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:2	(overcloud-controller-2)
>   notice: LogActions:	Start   openstack-cinder-volume	(overcloud-controller-2 - blocked)
>   notice: LogActions:	Start   openstack-heat-engine:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-engine:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-engine:2	(overcloud-controller-2)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:0	(overcloud-controller-0)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:1	(overcloud-controller-1)
>   notice: LogActions:	Start   openstack-heat-api-cloudwatch:2	(overcloud-controller-2)
> ^C


After the fix:

> [root@virt-138 ~]# rpm -q pacemaker
> pacemaker-1.1.15-9.el7.x86_64
> [root@virt-138 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV
>   notice: LogActions:	Start   openstack-cinder-volume	(overcloud-controller-2 - blocked)
> [root@virt-138 ~]# echo $?
> 0


crm_resource --wait now correctly and cleanly terminates when there's nothing to be done. Marking as verified.

Comment 14 Ken Gaillot 2016-10-17 17:20:58 UTC

doc text: maybe "no work correctly" -> "could fail to return"

Comment 17 errata-xmlrpc 2016-11-03 18:59:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html