Bug 1835717
| Summary: | pacemaker never promotes a bundle until another transition unblocks it | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> | |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 8.2 | CC: | cfeist, cluster-maint, kgaillot, lmiccini, msmazova, pkomarov | |
| Target Milestone: | rc | Keywords: | ZStream | |
| Target Release: | 8.4 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | All | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.0.5-1.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: When selecting promotable clone instances for promotion on guest nodes, Pacemaker considered whether the guest node itself could run resources, but not whether the guest resource creating it was runnable.
Consequence: An unrunnable guest could be chosen for promotion, unnecessarily leaving some instances unpromoted until the next natural transition.
Fix: Pacemaker now considers whether a guest node's guest resource is runnable when selecting nodes for promotion.
Result: All instances that can be promoted will be.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1935240 1935241 (view as bug list) | Environment: | ||
| Last Closed: | 2021-05-18 15:26:41 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1885645 | |||
| Bug Blocks: | 1935240, 1935241 | |||
This is definitely a pacemaker scheduler bug Fixed upstream as of commit 8c9ee257 A workaround until the fix is available would be waiting a few seconds after the ban, then changing any configuration value (such as a dummy node attribute) to trigger a new transition. For testing purposes I use a bash alias that does attrd_updater -N "$(hostname)" -n trigger-transition -v "$(date)" (my test nodes are named the same as their hostname, you can use any node name) Verified ,
(undercloud) [stack@undercloud-0 ~]$ ansible controller -b -mshell -a'pcs cluster status'|grep version
[WARNING]: Found both group and host with same name: undercloud
* Current DC: controller-1 (version 2.0.5-2.el8-31aa4f5515) - partition with quorum
* Current DC: controller-1 (version 2.0.5-2.el8-31aa4f5515) - partition with quorum
* Current DC: controller-1 (version 2.0.5-2.el8-31aa4f5515) - partition with quorum
* Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
* ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-2
* ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-0
* ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1
[root@controller-2 ~]# pcs resource ban ovn-dbs-bundle controller-2
* Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
* ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-2
* ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-0
[..]
* Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
* ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped
* ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Master controller-0
* ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |
Description of problem: So this morning we were testing a few things and we hit this rather odd behavior. We start off from a good condition: A) Strating point * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 * redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 B) So now we want to just move ovn-dbs-bundle master away from controller-0 and we do so by banning the whole bundle (pcs resource ban ovn-dbs-bunle controller-0). But what happens is the following: * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 * redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 Somehow we observe that a promotion just never takes place. But the interesting thing is that pacemaker clearly thinks that the promotion should happen: * Start ovn-dbs-bundle-0 ( controller-0 ) due to unrunnable ovn-dbs-bundle-podman-0 start (blocked) * Start ovndb_servers:0 ( ovn-dbs-bundle-0 ) due to unrunnable ovn-dbs-bundle-podman-0 start (blocked) * Promote ovndb_servers:1 ( Slave -> Master ovn-dbs-bundle-1 ) The problem is that it seems it is never triggered at all? Now the reason that we suspect that this is a pacemaker bug is the following: C) When the cluster is in state (B), somehow triggering another unrelated transition makes the seemingly blocked ovndb_servers promotion, unstuck again. So when we run: 'pcs resource ban redis-budle controller-0' we suddenly get what we were wanting at (B): * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Stopped * redis-bundle-1 (ocf::heartbeat:redis): Master controller-1 * redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Master controller-1 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 And now crm_simulate only shows the expected blocked bundles (due to the bans): Transition Summary: * Start redis-bundle-0 ( controller-0 ) due to unrunnable redis-bundle-podman-0 start (blocked) * Start redis:0 ( redis-bundle-0 ) due to unrunnable redis-bundle-podman-0 start (blocked) * Start ovn-dbs-bundle-0 ( controller-0 ) due to unrunnable ovn-dbs-bundle-podman-0 start (blocked) * Start ovndb_servers:0 ( ovn-dbs-bundle-0 ) due to unrunnable ovn-dbs-bundle-podman-0 start (blocked) I.e. it is as if in (B) it knows it wants to promote ovndbs_servers but somehow never ends up doing it (we did wait for a longer time as well just to understand if this is not some other issue) Version-Release number of selected component (if applicable): pacemaker-2.0.3-5.el8.x86_64