Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2003176

Summary: [OSP16.2] ovn-dbs pacemaker update_tasks can race with pacemaker update_tasks
Product: Red Hat OpenStack Reporter: Michele Baldessari <michele>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Jason Grosso <jgrosso>
Severity: high Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: jfrancoa, jgrosso, jpretori, lbezdick, ljozsa, mburns
Target Milestone: z2Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20210917074218.245da68.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-23 22:29:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michele Baldessari 2021-09-10 14:14:40 UTC
Description of problem:
As noticed by Ladislav:

On particular role compositions, the code joining the update_tasks might order
things differently then on a typical 3ctrl control plane and the ovn-dbs tasks
at step1 (which require the cluster to be up) will happen after the pacemaker
task at step1 which stops the cluster.

So we can observe something like the following:
2021-09-10 10:05:13.370339 | 001c2891-506d-f833-ff5a-000000000954 | TASK | Change the bundle operation timeout
2021-09-10 10:05:14.136798 | 001c2891-506d-f833-ff5a-000000000954 | CHANGED | Change the bundle operation timeout | ovn-db-01
2021-09-10 10:05:14.137982 | 001c2891-506d-f833-ff5a-000000000954 | TIMING | Change the bundle operation timeout | ovn-db-01 | 0:00:54.808754 | 0.77s
2021-09-10 10:05:14.146853 | 001c2891-506d-f833-ff5a-000000000956 | TASK | Acquire the cluster shutdown lock to stop pacemaker cluster
2021-09-10 10:05:14.508085 | 001c2891-506d-f833-ff5a-000000000956 | CHANGED | Acquire the cluster shutdown lock to stop pacemaker cluster | ovn-db-01
2021-09-10 10:05:14.509257 | 001c2891-506d-f833-ff5a-000000000956 | TIMING | Acquire the cluster shutdown lock to stop pacemaker cluster | ovn-db-01 | 0:00:55.180032 | 0.36s
2021-09-10 10:05:14.518668 | 001c2891-506d-f833-ff5a-000000000957 | TASK | Stop pacemaker cluster
2021-09-10 10:05:18.559627 | 001c2891-506d-f833-ff5a-000000000957 | CHANGED | Stop pacemaker cluster | ovn-db-01

2021-09-10 10:05:18.560561 | 001c2891-506d-f833-ff5a-000000000957 | TIMING | Stop pacemaker cluster | ovn-db-01 | 0:00:59.231336 | 4.04s
2021-09-10 10:05:18.569161 | 001c2891-506d-f833-ff5a-000000000958 | TASK | Start pacemaker cluster
2021-09-10 10:05:18.627924 | 001c2891-506d-f833-ff5a-000000000958 | SKIPPED | Start pacemaker cluster | ovn-db-01
2021-09-10 10:05:18.628678 | 001c2891-506d-f833-ff5a-000000000958 | TIMING | Start pacemaker cluster | ovn-db-01 | 0:00:59.299453 | 0.06s
2021-09-10 10:05:18.637292 | 001c2891-506d-f833-ff5a-000000000959 | TASK | Release the cluster shutdown lock
2021-09-10 10:05:18.694945 | 001c2891-506d-f833-ff5a-000000000959 | SKIPPED | Release the cluster shutdown lock | ovn-db-01
2021-09-10 10:05:18.695717 | 001c2891-506d-f833-ff5a-000000000959 | TIMING | Release the cluster shutdown lock | ovn-db-01 | 0:00:59.366493 | 0.06s
2021-09-10 10:05:18.704368 | 001c2891-506d-f833-ff5a-00000000095a | TASK | Clear ovndb cluster pacemaker error
2021-09-10 10:05:19.368816 | 001c2891-506d-f833-ff5a-00000000095a | FATAL | Clear ovndb cluster pacemaker error | ovn-db-01 | error={"changed": true, "cmd": "pcs resource cleanup ovn-dbs-bundle", "delta": "0:00:00.399084", "end": "2021-09-10 10:05:20
.044985", "msg": "non-zero return code", "rc": 1, "start": "2021-09-10 10:05:19.645901", "stderr": "Error: Unable to forget failed operations of resource: ovn-dbs-bundle\nError connecting to the CIB manager: Transport endpoint is not connected\nError perf
orming operation: Transport endpoint is not connected", "stderr_lines": ["Error: Unable to forget failed operations of resource: ovn-dbs-bundle", "Error connecting to the CIB manager: Transport endpoint is not connected", "Error performing operation: Tran
sport endpoint is not connected"], "stdout": "", "stdout_lines": []}

We cannot call pcs resource cleanup at step1, we must call it at step0 so we're
guaranteed that the cluster is up, no matter how heat/ansible decide to order
the update_tasks.

Comment 13 errata-xmlrpc 2022-03-23 22:29:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack Platform 16.2 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0995