Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
The "crm_resource --wait" command and a "pcs" command with the "--wait" option now work correctly
Previously, *Pacemaker* sometimes scheduled actions that depended on an unrunnable action on a cloned resource. As a consequence, log files became unnecessarily verbose, and the "crm_resource --wait" command never returned due to the scheduled actions. There was no significant effect on the cluster itself, as the cluster did not proceed beyond the unrunnable action. Now, *Pacemaker* no longer schedules actions that depend on an unrunnable clone action. As a result, log files are cleaner, and running "crm_resource --wait" or a "pcs" command with the "--wait" option returns as expected when the cluster stabilizes.
DescriptionMichele Baldessari
2016-04-15 08:09:47 UTC
Created attachment 1147493[details]
cib.xml.live from a live system
Description of problem:
In tripleo we need to stop a bunch of pacemaker managed resources, do some
stuff (like config changes) and then start the resources again. We have a
situation that is not fully understood and before changing things around we'd
like to understand what is happening exactly. In short we stop a dummy
resource called "openstack-core" (has lots of dependent resources) and then do
a crm_resource --wait.
The problem is that crm_resource --wait does not return (it gets killed by us
after 30mins). We want to understand why it does not return within 30 minutes
pcs resource disable <foo>
check_resource <foo> stopped <timeout>
The function check_resource, which uses crm_resource --wait, is defined here (Line 5-36):
https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh
So what seems to be happening is that we call "pcs resource disable openstack-core"
and it terminates successfully, but crm_resource --wait never exits and has the following
output:
crm_resource --wait -VVV` keeps printing this:
notice: LogActions: Start openstack-cinder-volume (overcloud-controller-2 - blocked)
notice: LogActions: Start openstack-heat-engine:0 (overcloud-controller-0)
notice: LogActions: Start openstack-heat-engine:1 (overcloud-controller-1)
notice: LogActions: Start openstack-heat-engine:2 (overcloud-controller-2)
notice: LogActions: Start openstack-heat-api-cloudwatch:0 (overcloud-controller-0)
notice: LogActions: Start openstack-heat-api-cloudwatch:1 (overcloud-controller-1)
notice: LogActions: Start openstack-heat-api-cloudwatch:2 (overcloud-controller-2)
It's not clear why these resources are trying to start.
The graph of the ordering constraints is here:
http://file.rdu.redhat.com/~mbaldess/lp1569444/newton-jiri.pdf
It seems to me that even though openstack-core and its children are stopped
successfully it is trying to start the services above and hence
crm_resource --wait is not exiting
Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7_2.2.x86_64
How reproducible:
We know that if we add an ordering constraint to openstack-ceilometer-notification-clone (and kill the openstack-heat-api-clone one) on openstack-core
and one on openstack-sahara-engine-clone and one on openstack-aodh-listener-clone, we cannot reproduce this issue anymore. Otherwise it is reproduceable
Full sosreports are here:
http://file.rdu.redhat.com/~mbaldess/lp1569444/
QA: The simplest way to verify this is to take the cib.xml.live attached to this bz and run the command given in Comment 4. Before the fix, it will show multiple "LogActions: Start" lines without "blocked". After the fix, it will show one with "blocked".
Essentially, the PE was failing to correctly mark parts of the action graph as unrunnable and this could happen whenever there was a clone resource that depended on another clone that depended on something that was disabled.
The result was a graph that looked like there was a bunch of work to do, but in reality it would never be attempted. As a result of the fix, the graph now reflects reality: nothing to be done, the cluster has reached a steady state.
As for implications on people using 'crm_resource --wait', prior to the bug fix, anyone hitting this condition was basically getting a call to 'sleep 3600' that also returned an error. So its hard to imagine anyone coming to rely on that behaviour. People running it by hand would surely get bored and cancel the command and any scripts that hadn't already timed out at a higher level would likely bork because it returned an error.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHSA-2016-2578.html