Red Hat Bugzilla – Bug 1327469
pengine wants to start services that should not be started
Last modified: 2016-11-03 14:59:12 EDT
Created attachment 1147493 [details] cib.xml.live from a live system Description of problem: In tripleo we need to stop a bunch of pacemaker managed resources, do some stuff (like config changes) and then start the resources again. We have a situation that is not fully understood and before changing things around we'd like to understand what is happening exactly. In short we stop a dummy resource called "openstack-core" (has lots of dependent resources) and then do a crm_resource --wait. The problem is that crm_resource --wait does not return (it gets killed by us after 30mins). We want to understand why it does not return within 30 minutes pcs resource disable <foo> check_resource <foo> stopped <timeout> The function check_resource, which uses crm_resource --wait, is defined here (Line 5-36): https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh So what seems to be happening is that we call "pcs resource disable openstack-core" and it terminates successfully, but crm_resource --wait never exits and has the following output: crm_resource --wait -VVV` keeps printing this: notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0) notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1) notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2) notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0) notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1) notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2) It's not clear why these resources are trying to start. The graph of the ordering constraints is here: http://file.rdu.redhat.com/~mbaldess/lp1569444/newton-jiri.pdf It seems to me that even though openstack-core and its children are stopped successfully it is trying to start the services above and hence crm_resource --wait is not exiting Version-Release number of selected component (if applicable): pacemaker-1.1.13-10.el7_2.2.x86_64 How reproducible: We know that if we add an ordering constraint to openstack-ceilometer-notification-clone (and kill the openstack-heat-api-clone one) on openstack-core and one on openstack-sahara-engine-clone and one on openstack-aodh-listener-clone, we cannot reproduce this issue anymore. Otherwise it is reproduceable Full sosreports are here: http://file.rdu.redhat.com/~mbaldess/lp1569444/
Ken: A link to sos reports is included at the bottom of the description. If you download: http://file.rdu.redhat.com/~mbaldess/lp1569444/sosreport-overcloud-controller-0-20160412162711/sos_commands/cluster/crm_report/overcloud-controller-0/cib.xml.live and run it as: CIB_file=./cib.xml.live crm_resource --wait -VVV you'll see the LogActions logs that Michele mentions.
In the same situation, an empty graph is now produced allowed the command to complete. https://github.com/ClusterLabs/pacemaker/commit/6951b7e
QA: The simplest way to verify this is to take the cib.xml.live attached to this bz and run the command given in Comment 4. Before the fix, it will show multiple "LogActions: Start" lines without "blocked". After the fix, it will show one with "blocked".
Essentially, the PE was failing to correctly mark parts of the action graph as unrunnable and this could happen whenever there was a clone resource that depended on another clone that depended on something that was disabled. The result was a graph that looked like there was a bunch of work to do, but in reality it would never be attempted. As a result of the fix, the graph now reflects reality: nothing to be done, the cluster has reached a steady state. As for implications on people using 'crm_resource --wait', prior to the bug fix, anyone hitting this condition was basically getting a call to 'sleep 3600' that also returned an error. So its hard to imagine anyone coming to rely on that behaviour. People running it by hand would surely get bored and cancel the command and any scripts that hadn't already timed out at a higher level would likely bork because it returned an error.
Confirmed fixed in pacemaker-1.1.15-9.el7.x86_64 Before the fix: > [root@virt-247 ~]# rpm -q pacemaker > pacemaker-1.1.13-10.el7_2.3.x86_64 > [root@virt-247 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV > notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) > notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2) > notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2) > notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) > notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2) > notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2) > ^C After the fix: > [root@virt-138 ~]# rpm -q pacemaker > pacemaker-1.1.15-9.el7.x86_64 > [root@virt-138 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV > notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) > [root@virt-138 ~]# echo $? > 0 crm_resource --wait now correctly and cleanly terminates when there's nothing to be done. Marking as verified.
doc text: maybe "no work correctly" -> "could fail to return"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html