1928762 – pacemaker-2.0.5-6 ends up in “error: Shutdown Escalation just popped in state S_TRANSITION_ENGINE!” during a shutdown

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1928762 - pacemaker-2.0.5-6 ends up in “error: Shutdown Escalation just popped in state S_TRANSITION_ENGINE!” during a shutdown

Summary: pacemaker-2.0.5-6 ends up in “error: Shutdown Escalation just popped in state...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.4
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1969419 1971650 (view as bug list)
Depends On:
Blocks:	1971650 1972368 1972369
TreeView+	depends on / blocked

Reported:	2021-02-15 14:00 UTC by Damien Ciabrini
Modified:	2024-10-01 17:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pacemaker-2.0.5-8.el8
Doc Type:	Bug Fix
Doc Text:	Cause: If a guest node or bundle instance with active resources had to move at the same time as its remote connection, cancellation of active resource monitors would be ordered before the connection move but routed through the connection's destination node. Consequence: The cancellation would always fail because the connection was not yet established on the destination node, and the cluster could not make further progress as long as both moves were required. Fix: Monitor cancellations are now routed through the connection's original node. Result: The moves succeed.
Clone Of:
Clones:	1971650 1972368 1972369 (view as bug list)
Environment:
Last Closed:	2021-05-18 15:26:45 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6113161	0	None	None	None	2021-06-10 15:26:39 UTC

Description Damien Ciabrini 2021-02-15 14:00:07 UTC

Description of problem:

Context: 9 nodes cluster + 2 remotes where we run Openstack services. We’re performing a rolling Openstack update (pacemaker gets upgraded from pacemaker-2.0.4-6.el8_3.1.x86_64 to pacemaker-2.0.5-6.el8.x86_64), where we shutdown various pacemaker nodes one after the other (i.e. only one pacemaker node is shut down at a time). Ultimately, all pacemaker nodes are expected to be stopped/restarted.
 
Every node is stopped with a “pcs cluster stop” and restarted with a “pcs cluster start”. A some point, the current DC is being stopped with “pcs cluster stop” (via ansible):

    Feb 15 03:13:11 controller-2.redhat.local ansible-pacemaker_cluster[291519]: Invoked with state=offline check_and_fail=False timeout=300 force=True node=None

Then it looks like no election takes place to move the DC role to another running node, and the current DC seems to be stuck for 20minutes in a “stop” sequence, and it looks like it’s not even trying to stop the resources that it’s hosting.

Probably worth noting is that at the time, there was a resource in the cluster that kept erroring out when the cluster tried to cancel a running operation, e.g.:


    [root@controller-2 ~]# bzdiff /var/lib/pacemaker/pengine/pe-input-833.bz2 /var/lib/pacemaker/pengine/pe-input-834.bz2
    1c1
    < <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3472" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359951">
    ---
    > <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3473" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359951">
789c789
    <           <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8792"/>
    ---
    >           <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8793"/>

The same transition went on and on until the node get fenced around 03:33:

    [root@controller-2 pengine]# bzdiff pe-input-919.bz2 pe-input-920.bz2
    1c1
    < <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3599" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359992">
    ---
    > <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3600" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359992">
789c789
    <           <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8878"/>
    ---
    >           <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8879"/>
    [root@controller-2 pengine]# date -d @1613359992
    Mon Feb 15 03:33:12 UTC 2021


Version-Release number of selected component (if applicable):
pacemaker-2.0.5-6.el8.x86_64

How reproducible:
It seems we’re able to reproduce this moderately often in CI. As usual, it is not a 100% thing, but we’ve observed it a few times now over the last week.


Steps to Reproduce:
1. perform a rolling restart of a running Openstack pacemaker cluster 

Actual results:
One node (possibly always the DC? we're unsure) gets stuck in a loop for 20min while stopping, and eventually fails and gets fenced.

Expected results:
All nodes to be able to stop and restart.

Additional info:

Sosreports + tar files of /var/lib/pacemaker from all nodes are here: http://file.rdu.redhat.com/~mbaldess/pacemaker-minor-update-issue/

Comment 2 Ken Gaillot 2021-02-15 17:36:13 UTC

This is almost certainly unrelated, but there are lots of instances of this:

    Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: warning: Cannot execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory
    Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers

Other resource actions succeed fine, so the file must actually be there. The controller (running as hacluster) runs meta-data actions, while the executor (running as root) runs other resource actions, so I'm guessing hacluster lacks access to one of the parent directories (via permissions or SELinux). If that's the case, whatever package that creates the directory should be updated.

The only impact would be that the cluster wouldn't know which parameters, if any, are reloadable or sensitive, which shouldn't have any effect on the problem here, but it's worth fixing. There's a long-term Pacemaker to-do to make meta-data actions go through the executor, which would avoid this issue, but I don't know when that will happen.

Comment 3 Michele Baldessari 2021-02-15 18:43:43 UTC

(In reply to Ken Gaillot from comment #2)
> This is almost certainly unrelated, but there are lots of instances of this:
> 
>     Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: warning: Cannot
> execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or
> directory
>     Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: error: Failed to
> retrieve meta-data for ocf:ovn:ovndb-servers
> 
> Other resource actions succeed fine, so the file must actually be there. 

Right the file is there inside the ovn-dbs container but not on the host.
That is actually why at the time I had filed https://bugzilla.redhat.com/show_bug.cgi?id=1850506 (but as you mention
it is not a trivial amount of work).
They are just a bit annoying or scary if you do not know they are benign. Reason for that is that this Ovn OCF RA
is maintained by other folks and ends up in the ovn package and not in resource-agents rpm.

Comment 4 Ken Gaillot 2021-02-15 19:26:33 UTC

(In reply to Michele Baldessari from comment #3)
> (In reply to Ken Gaillot from comment #2)
> > This is almost certainly unrelated, but there are lots of instances of this:
> > 
> >     Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: warning: Cannot
> > execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or
> > directory
> >     Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: error: Failed to
> > retrieve meta-data for ocf:ovn:ovndb-servers
> > 
> > Other resource actions succeed fine, so the file must actually be there. 
> 
> Right the file is there inside the ovn-dbs container but not on the host.
> That is actually why at the time I had filed
> https://bugzilla.redhat.com/show_bug.cgi?id=1850506 (but as you mention
> it is not a trivial amount of work).
> They are just a bit annoying or scary if you do not know they are benign.
> Reason for that is that this Ovn OCF RA
> is maintained by other folks and ends up in the ovn package and not in
> resource-agents rpm.

Ah right, I forgot about that. Never mind then. One of these days I'll get to that ...

Comment 5 Ken Gaillot 2021-02-15 21:16:56 UTC

This is unrelated to the crm_resource regression in -6, don't worry about the agents.

The cancel action failure is a scheduling bug and the apparent cause of the shutdown hang as well, since the scheduler just keeps incorrectly rescheduling the cancel action each time it fails.

The problem starts at 01:40:51 when an instance of ovn-dbs-bundle needs to be moved from controller-2 to controller-0. The first thing that needs to happen is that the recurring monitor for the ovndb_servers instance inside it needs to be cancelled, so ovndb_servers can be stopped, and so on. The problem is the scheduler schedules the cancel action on controller-0 instead of controller-2. Since controller-0 doesn't yet have the connection to the bundle, it immediately returns an error.

At this point I haven't determined whether it's a regression. Has this same procedure worked on earlier builds?

I'll try to get a fix asap -- if we go beyond the end of this week, we'll need to get exception. Can OpenStack QA ack this?

Comment 6 Michele Baldessari 2021-02-15 21:32:21 UTC

(In reply to Ken Gaillot from comment #5)
> This is unrelated to the crm_resource regression in -6, don't worry about
> the agents.
> 
> The cancel action failure is a scheduling bug and the apparent cause of the
> shutdown hang as well, since the scheduler just keeps incorrectly
> rescheduling the cancel action each time it fails.
> 
> The problem starts at 01:40:51 when an instance of ovn-dbs-bundle needs to
> be moved from controller-2 to controller-0. The first thing that needs to
> happen is that the recurring monitor for the ovndb_servers instance inside
> it needs to be cancelled, so ovndb_servers can be stopped, and so on. The
> problem is the scheduler schedules the cancel action on controller-0 instead
> of controller-2. Since controller-0 doesn't yet have the connection to the
> bundle, it immediately returns an error.
> 
> At this point I haven't determined whether it's a regression. Has this same
> procedure worked on earlier builds?

Thanks Ken. It's hard to say tbh, I need to discuss this with Damien tomorrow.
Previously we were affected by the systemd stop pacemaker_remote quite often, so it
is a bit difficult to get hard data.

> I'll try to get a fix asap -- if we go beyond the end of this week, we'll
> need to get exception. Can OpenStack QA ack this?

Let me ask our QE for this in the meantime.

Thanks!

Comment 8 Ken Gaillot 2021-02-16 15:35:45 UTC

Fix merged upstream as of commit 8f5b73c0

Comment 14 errata-xmlrpc 2021-05-18 15:26:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782

Comment 15 Ken Gaillot 2021-06-08 15:49:59 UTC

*** Bug 1969419 has been marked as a duplicate of this bug. ***

Comment 16 Luca Miccini 2021-06-08 17:55:33 UTC

*** Bug 1969419 has been marked as a duplicate of this bug. ***

Comment 19 Ken Gaillot 2021-06-14 14:42:19 UTC

*** Bug 1971650 has been marked as a duplicate of this bug. ***

Comment 23 Michele Baldessari 2021-06-18 15:11:17 UTC

*** Bug 1973660 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.