Bug 1327469
Summary: | pengine wants to start services that should not be started | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Michele Baldessari <michele> | ||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | Milan Navratil <mnavrati> | ||||
Priority: | urgent | ||||||
Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, djansa, fdinitto, jruemker, jstransk, mnavrati, phagara | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | 7.3 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-1.1.15-3.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
The "crm_resource --wait" command and a "pcs" command with the "--wait" option now work correctly
Previously, *Pacemaker* sometimes scheduled actions that depended on an unrunnable action on a cloned resource. As a consequence, log files became unnecessarily verbose, and the "crm_resource --wait" command never returned due to the scheduled actions. There was no significant effect on the cluster itself, as the cluster did not proceed beyond the unrunnable action. Now, *Pacemaker* no longer schedules actions that depend on an unrunnable clone action. As a result, log files are cleaner, and running "crm_resource --wait" or a "pcs" command with the "--wait" option returns as expected when the cluster stabilizes.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1349493 (view as bug list) | Environment: | |||||
Last Closed: | 2016-11-03 18:59:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1349493 | ||||||
Attachments: |
|
Description
Michele Baldessari
2016-04-15 08:09:47 UTC
Ken: A link to sos reports is included at the bottom of the description. If you download: http://file.rdu.redhat.com/~mbaldess/lp1569444/sosreport-overcloud-controller-0-20160412162711/sos_commands/cluster/crm_report/overcloud-controller-0/cib.xml.live and run it as: CIB_file=./cib.xml.live crm_resource --wait -VVV you'll see the LogActions logs that Michele mentions. In the same situation, an empty graph is now produced allowed the command to complete. https://github.com/ClusterLabs/pacemaker/commit/6951b7e QA: The simplest way to verify this is to take the cib.xml.live attached to this bz and run the command given in Comment 4. Before the fix, it will show multiple "LogActions: Start" lines without "blocked". After the fix, it will show one with "blocked". Essentially, the PE was failing to correctly mark parts of the action graph as unrunnable and this could happen whenever there was a clone resource that depended on another clone that depended on something that was disabled. The result was a graph that looked like there was a bunch of work to do, but in reality it would never be attempted. As a result of the fix, the graph now reflects reality: nothing to be done, the cluster has reached a steady state. As for implications on people using 'crm_resource --wait', prior to the bug fix, anyone hitting this condition was basically getting a call to 'sleep 3600' that also returned an error. So its hard to imagine anyone coming to rely on that behaviour. People running it by hand would surely get bored and cancel the command and any scripts that hadn't already timed out at a higher level would likely bork because it returned an error. Confirmed fixed in pacemaker-1.1.15-9.el7.x86_64 Before the fix: > [root@virt-247 ~]# rpm -q pacemaker > pacemaker-1.1.13-10.el7_2.3.x86_64 > [root@virt-247 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV > notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) > notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2) > notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2) > notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) > notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2) > notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0) > notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1) > notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2) > ^C After the fix: > [root@virt-138 ~]# rpm -q pacemaker > pacemaker-1.1.15-9.el7.x86_64 > [root@virt-138 ~]# CIB_file=./cib.xml.live crm_resource --wait -VVV > notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked) > [root@virt-138 ~]# echo $? > 0 crm_resource --wait now correctly and cleanly terminates when there's nothing to be done. Marking as verified. doc text: maybe "no work correctly" -> "could fail to return" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |