Bug 2003176 - [OSP16.2] ovn-dbs pacemaker update_tasks can race with pacemaker update_tasks
Summary: [OSP16.2] ovn-dbs pacemaker update_tasks can race with pacemaker update_tasks
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z2
: 16.2 (Train on RHEL 8.4)
Assignee: Sofer Athlan-Guyot
QA Contact: Jason Grosso
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-10 14:14 UTC by Michele Baldessari
Modified: 2022-03-23 22:30 UTC (History)
6 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20210917074218.245da68.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-23 22:29:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1943254 0 None None None 2021-09-10 14:15:37 UTC
OpenStack gerrit 809273 0 None MERGED Guarantee that ovn-dbs-pcmk update_tasks are run when the cluster is up 2022-02-17 15:43:41 UTC
Red Hat Issue Tracker OSP-9696 0 None None None 2021-11-15 12:51:09 UTC
Red Hat Issue Tracker UPG-3356 0 None None None 2021-09-18 15:06:04 UTC
Red Hat Product Errata RHSA-2022:0995 0 None None None 2022-03-23 22:30:03 UTC

Description Michele Baldessari 2021-09-10 14:14:40 UTC
Description of problem:
As noticed by Ladislav:

On particular role compositions, the code joining the update_tasks might order
things differently then on a typical 3ctrl control plane and the ovn-dbs tasks
at step1 (which require the cluster to be up) will happen after the pacemaker
task at step1 which stops the cluster.

So we can observe something like the following:
2021-09-10 10:05:13.370339 | 001c2891-506d-f833-ff5a-000000000954 | TASK | Change the bundle operation timeout
2021-09-10 10:05:14.136798 | 001c2891-506d-f833-ff5a-000000000954 | CHANGED | Change the bundle operation timeout | ovn-db-01
2021-09-10 10:05:14.137982 | 001c2891-506d-f833-ff5a-000000000954 | TIMING | Change the bundle operation timeout | ovn-db-01 | 0:00:54.808754 | 0.77s
2021-09-10 10:05:14.146853 | 001c2891-506d-f833-ff5a-000000000956 | TASK | Acquire the cluster shutdown lock to stop pacemaker cluster
2021-09-10 10:05:14.508085 | 001c2891-506d-f833-ff5a-000000000956 | CHANGED | Acquire the cluster shutdown lock to stop pacemaker cluster | ovn-db-01
2021-09-10 10:05:14.509257 | 001c2891-506d-f833-ff5a-000000000956 | TIMING | Acquire the cluster shutdown lock to stop pacemaker cluster | ovn-db-01 | 0:00:55.180032 | 0.36s
2021-09-10 10:05:14.518668 | 001c2891-506d-f833-ff5a-000000000957 | TASK | Stop pacemaker cluster
2021-09-10 10:05:18.559627 | 001c2891-506d-f833-ff5a-000000000957 | CHANGED | Stop pacemaker cluster | ovn-db-01

2021-09-10 10:05:18.560561 | 001c2891-506d-f833-ff5a-000000000957 | TIMING | Stop pacemaker cluster | ovn-db-01 | 0:00:59.231336 | 4.04s
2021-09-10 10:05:18.569161 | 001c2891-506d-f833-ff5a-000000000958 | TASK | Start pacemaker cluster
2021-09-10 10:05:18.627924 | 001c2891-506d-f833-ff5a-000000000958 | SKIPPED | Start pacemaker cluster | ovn-db-01
2021-09-10 10:05:18.628678 | 001c2891-506d-f833-ff5a-000000000958 | TIMING | Start pacemaker cluster | ovn-db-01 | 0:00:59.299453 | 0.06s
2021-09-10 10:05:18.637292 | 001c2891-506d-f833-ff5a-000000000959 | TASK | Release the cluster shutdown lock
2021-09-10 10:05:18.694945 | 001c2891-506d-f833-ff5a-000000000959 | SKIPPED | Release the cluster shutdown lock | ovn-db-01
2021-09-10 10:05:18.695717 | 001c2891-506d-f833-ff5a-000000000959 | TIMING | Release the cluster shutdown lock | ovn-db-01 | 0:00:59.366493 | 0.06s
2021-09-10 10:05:18.704368 | 001c2891-506d-f833-ff5a-00000000095a | TASK | Clear ovndb cluster pacemaker error
2021-09-10 10:05:19.368816 | 001c2891-506d-f833-ff5a-00000000095a | FATAL | Clear ovndb cluster pacemaker error | ovn-db-01 | error={"changed": true, "cmd": "pcs resource cleanup ovn-dbs-bundle", "delta": "0:00:00.399084", "end": "2021-09-10 10:05:20
.044985", "msg": "non-zero return code", "rc": 1, "start": "2021-09-10 10:05:19.645901", "stderr": "Error: Unable to forget failed operations of resource: ovn-dbs-bundle\nError connecting to the CIB manager: Transport endpoint is not connected\nError perf
orming operation: Transport endpoint is not connected", "stderr_lines": ["Error: Unable to forget failed operations of resource: ovn-dbs-bundle", "Error connecting to the CIB manager: Transport endpoint is not connected", "Error performing operation: Tran
sport endpoint is not connected"], "stdout": "", "stdout_lines": []}

We cannot call pcs resource cleanup at step1, we must call it at step0 so we're
guaranteed that the cluster is up, no matter how heat/ansible decide to order
the update_tasks.

Comment 13 errata-xmlrpc 2022-03-23 22:29:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack Platform 16.2 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0995


Note You need to log in before you can comment on or make changes to this bug.