Bug 1928762
| Summary: | pacemaker-2.0.5-6 ends up in “error: Shutdown Escalation just popped in state S_TRANSITION_ENGINE!” during a shutdown | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Damien Ciabrini <dciabrin> | |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 8.4 | CC: | cfeist, cluster-maint, dhill, lmiccini, michele, msmazova, pkomarov, pkundal, sathlang, sbradley | |
| Target Milestone: | rc | Keywords: | Triaged, ZStream | |
| Target Release: | 8.4 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.0.5-8.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: If a guest node or bundle instance with active resources had to move at the same time as its remote connection, cancellation of active resource monitors would be ordered before the connection move but routed through the connection's destination node.
Consequence: The cancellation would always fail because the connection was not yet established on the destination node, and the cluster could not make further progress as long as both moves were required.
Fix: Monitor cancellations are now routed through the connection's original node.
Result: The moves succeed.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1971650 1972368 1972369 (view as bug list) | Environment: | ||
| Last Closed: | 2021-05-18 15:26:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1971650, 1972368, 1972369 | |||
This is almost certainly unrelated, but there are lots of instances of this:
Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: warning: Cannot execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or directory
Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Other resource actions succeed fine, so the file must actually be there. The controller (running as hacluster) runs meta-data actions, while the executor (running as root) runs other resource actions, so I'm guessing hacluster lacks access to one of the parent directories (via permissions or SELinux). If that's the case, whatever package that creates the directory should be updated.
The only impact would be that the cluster wouldn't know which parameters, if any, are reloadable or sensitive, which shouldn't have any effect on the problem here, but it's worth fixing. There's a long-term Pacemaker to-do to make meta-data actions go through the executor, which would avoid this issue, but I don't know when that will happen.
(In reply to Ken Gaillot from comment #2) > This is almost certainly unrelated, but there are lots of instances of this: > > Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: warning: Cannot > execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or > directory > Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: error: Failed to > retrieve meta-data for ocf:ovn:ovndb-servers > > Other resource actions succeed fine, so the file must actually be there. Right the file is there inside the ovn-dbs container but not on the host. That is actually why at the time I had filed https://bugzilla.redhat.com/show_bug.cgi?id=1850506 (but as you mention it is not a trivial amount of work). They are just a bit annoying or scary if you do not know they are benign. Reason for that is that this Ovn OCF RA is maintained by other folks and ends up in the ovn package and not in resource-agents rpm. (In reply to Michele Baldessari from comment #3) > (In reply to Ken Gaillot from comment #2) > > This is almost certainly unrelated, but there are lots of instances of this: > > > > Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: warning: Cannot > > execute '/usr/lib/ocf/resource.d/ovn/ovndb-servers': No such file or > > directory > > Feb 15 00:36:28 controller-2 pacemaker-controld[42561]: error: Failed to > > retrieve meta-data for ocf:ovn:ovndb-servers > > > > Other resource actions succeed fine, so the file must actually be there. > > Right the file is there inside the ovn-dbs container but not on the host. > That is actually why at the time I had filed > https://bugzilla.redhat.com/show_bug.cgi?id=1850506 (but as you mention > it is not a trivial amount of work). > They are just a bit annoying or scary if you do not know they are benign. > Reason for that is that this Ovn OCF RA > is maintained by other folks and ends up in the ovn package and not in > resource-agents rpm. Ah right, I forgot about that. Never mind then. One of these days I'll get to that ... This is unrelated to the crm_resource regression in -6, don't worry about the agents. The cancel action failure is a scheduling bug and the apparent cause of the shutdown hang as well, since the scheduler just keeps incorrectly rescheduling the cancel action each time it fails. The problem starts at 01:40:51 when an instance of ovn-dbs-bundle needs to be moved from controller-2 to controller-0. The first thing that needs to happen is that the recurring monitor for the ovndb_servers instance inside it needs to be cancelled, so ovndb_servers can be stopped, and so on. The problem is the scheduler schedules the cancel action on controller-0 instead of controller-2. Since controller-0 doesn't yet have the connection to the bundle, it immediately returns an error. At this point I haven't determined whether it's a regression. Has this same procedure worked on earlier builds? I'll try to get a fix asap -- if we go beyond the end of this week, we'll need to get exception. Can OpenStack QA ack this? (In reply to Ken Gaillot from comment #5) > This is unrelated to the crm_resource regression in -6, don't worry about > the agents. > > The cancel action failure is a scheduling bug and the apparent cause of the > shutdown hang as well, since the scheduler just keeps incorrectly > rescheduling the cancel action each time it fails. > > The problem starts at 01:40:51 when an instance of ovn-dbs-bundle needs to > be moved from controller-2 to controller-0. The first thing that needs to > happen is that the recurring monitor for the ovndb_servers instance inside > it needs to be cancelled, so ovndb_servers can be stopped, and so on. The > problem is the scheduler schedules the cancel action on controller-0 instead > of controller-2. Since controller-0 doesn't yet have the connection to the > bundle, it immediately returns an error. > > At this point I haven't determined whether it's a regression. Has this same > procedure worked on earlier builds? Thanks Ken. It's hard to say tbh, I need to discuss this with Damien tomorrow. Previously we were affected by the systemd stop pacemaker_remote quite often, so it is a bit difficult to get hard data. > I'll try to get a fix asap -- if we go beyond the end of this week, we'll > need to get exception. Can OpenStack QA ack this? Let me ask our QE for this in the meantime. Thanks! Fix merged upstream as of commit 8f5b73c0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 *** Bug 1969419 has been marked as a duplicate of this bug. *** *** Bug 1969419 has been marked as a duplicate of this bug. *** *** Bug 1971650 has been marked as a duplicate of this bug. *** *** Bug 1973660 has been marked as a duplicate of this bug. *** |
Description of problem: Context: 9 nodes cluster + 2 remotes where we run Openstack services. We’re performing a rolling Openstack update (pacemaker gets upgraded from pacemaker-2.0.4-6.el8_3.1.x86_64 to pacemaker-2.0.5-6.el8.x86_64), where we shutdown various pacemaker nodes one after the other (i.e. only one pacemaker node is shut down at a time). Ultimately, all pacemaker nodes are expected to be stopped/restarted. Every node is stopped with a “pcs cluster stop” and restarted with a “pcs cluster start”. A some point, the current DC is being stopped with “pcs cluster stop” (via ansible): Feb 15 03:13:11 controller-2.redhat.local ansible-pacemaker_cluster[291519]: Invoked with state=offline check_and_fail=False timeout=300 force=True node=None Then it looks like no election takes place to move the DC role to another running node, and the current DC seems to be stuck for 20minutes in a “stop” sequence, and it looks like it’s not even trying to stop the resources that it’s hosting. Probably worth noting is that at the time, there was a resource in the cluster that kept erroring out when the cluster tried to cancel a running operation, e.g.: [root@controller-2 ~]# bzdiff /var/lib/pacemaker/pengine/pe-input-833.bz2 /var/lib/pacemaker/pengine/pe-input-834.bz2 1c1 < <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3472" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359951"> --- > <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3473" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359951"> 789c789 < <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8792"/> --- > <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8793"/> The same transition went on and on until the node get fenced around 03:33: [root@controller-2 pengine]# bzdiff pe-input-919.bz2 pe-input-920.bz2 1c1 < <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3599" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359992"> --- > <cib crm_feature_set="3.4.1" validate-with="pacemaker-3.4" epoch="231" num_updates="3600" admin_epoch="0" cib-last-written="Mon Feb 15 03:13:10 2021" update-origin="controller-2" update-client="cibadmin" update-user="root" have-quorum="1" dc-uuid="3" execution-date="1613359992"> 789c789 < <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8878"/> --- > <nvpair id="status-2-fail-count-ovndb_servers.cancel_30000" name="fail-count-ovndb_servers#cancel_30000" value="8879"/> [root@controller-2 pengine]# date -d @1613359992 Mon Feb 15 03:33:12 UTC 2021 Version-Release number of selected component (if applicable): pacemaker-2.0.5-6.el8.x86_64 How reproducible: It seems we’re able to reproduce this moderately often in CI. As usual, it is not a 100% thing, but we’ve observed it a few times now over the last week. Steps to Reproduce: 1. perform a rolling restart of a running Openstack pacemaker cluster Actual results: One node (possibly always the DC? we're unsure) gets stuck in a loop for 20min while stopping, and eventually fails and gets fenced. Expected results: All nodes to be able to stop and restart. Additional info: Sosreports + tar files of /var/lib/pacemaker from all nodes are here: http://file.rdu.redhat.com/~mbaldess/pacemaker-minor-update-issue/