1979010 – Minor update to 13z16 failed with "Unable to find constraint"

Bug 1979010 - Minor update to 13z16 failed with "Unable to find constraint"

Summary: Minor update to 13z16 failed with "Unable to find constraint"

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	documentation
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Alex McLeod
QA Contact:	Sofer Athlan-Guyot
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-04 08:57 UTC by Ravi Singh
Modified:	2021-11-29 18:53 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-29 18:53:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-5759	None	None	None	2021-11-15 13:09:11 UTC
Red Hat Issue Tracker	UPG-3145	None	None	None	2021-09-07 07:40:16 UTC
Red Hat Knowledge Base (Solution)	6175042	None	None	None	2021-09-09 13:48:54 UTC

Description Ravi Singh 2021-07-04 08:57:37 UTC

Description of problem:
Minor update of controllers to 13z16 fails with

~~~
2021-07-03 18:17:02,560 p=14463 u=mistral |  fatal: [X-X-X]: FAILED! => {"changed": true, "cmd": ["pcs", "constraint", "remove", "cli-ban-rabbitmq-bundle-on-X-X-X"], "delta": "0:00:00.432549", "end": "2021-07-03 18:17:02.513329", "msg": "non-zero return code", "rc": 1, "start": "2021-07-03 18:17:02.080780", "stderr": "Error: Unable to find constraint - '
'", "stderr_lines": ["Error: Unable to find constraint - 'cli-ban-rabbitmq-bundle-on-X-X-X'"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):
OSP13z16

How reproducible:
100%

Steps to Reproduce:
1. run minor update of controllers,it failed at some point.
2. Re-run by fixing the issue 
3. Now it fails as mentioned in description

Actual results:
Minor update fails.

Expected results:
Should pass

Additional info:

Comment 3 Sofer Athlan-Guyot 2021-07-05 14:44:26 UTC

Hi,

So, in order to know exactly where you stand I'd need output from:

# on any node controlled by pacemaker:
pcs status
pcs constraint ref rabbitmq-bundle | awk '/cli-ban/{print $0}'

# On every node where rabbitmq run and save the output of the docker
# run command.
image_name=$(sudo cibadmin -Q | xmllint --xpath 'string(//bundle[@id="rabbitmq-bundle"]/docker/@image)' -)
docker run --rm ${image_name} rpm -qi rabbitmq-server

A simple status for every controller that run rabbitmq: was it fully
updated or not.

In the TL;DR section I describe steps to can unlock you for this
update, but with that output I should be able to precisely guide you.

> 3. Later on we realised as per minor update doc. cu should update bootstrap node first & than rest others.
>    Cu again tried to update only ct0 node but it again failed with same error but this time for cli-ban constraint on node ct0.

This should only be needed if you have octavia installed and starting
an update from z13 or earlier, I guess that's your current setup.

TL;DR: The complex procedure to get rabbitmq upgraded is fragile when
the update is interupted at the wrong time and/or order of the updated
controller is changed.

The code was added because rabbitmq is upgraded from 3.6 to 3.7. This
upgrade cannot be a rolling upgrade as rabbitmq 3.7 cannot talk to
rabbitmq 3.6. So we have to make sure the other rabbitmq were not
running when we update the first node.

The ideal sequence is:

 - update controller
   1 on every node we check if:
     a. rabbitmq is 3.6 (so we need to upgrade)
     b. no ban exist (so I'm the first and I need to ban the other one)
   2 when I'm the first one
     a. I ban the other two
     b. rabbitmq get updated to 3.7 and can start as the other two rabbitmq are not running (it will fail to start otherwise as it cannot talk to 3.6)
   3 when I'm not the first one (bans already exist):
     a. I update rabbitmq
     b. I unban myself
     
For 3.a we only check if a ban exist not if a ban exist on the current node.

As you've shown there are ways for this to fail if the update was
interrupted.

1. One or more node are fully updated and we have some leftover ban:

  In this sequence and relative to the code, I can see why it fails:
  
   - we start an update it fails for some reason but ban where created
     and one was removed, but one persisted
  
   - we re-run the update, it sees that some ban exists, so it thinks
     it's in 3.b and try to unban itself while the ban is for another
     node.
  
  So given one node was successfully updated to 13.z16 you can safely
  add the missing ban on the other node(s) so that the update procedure
  is successful.
  
  Let's say we have ctl-0, ctl-1, ctl-2 with ctl-0 completly updated
  with a running version of rabbitmq 3.7.
  
     ssh ctl-0:
     
      # here's one way to check the rabbitmq version:
      image_name=$(sudo cibadmin -Q | xmllint --xpath 'string(//bundle[@id="rabbitmq-bundle"]/docker/@image)' -)
      docker run --rm ${image_name} rpm -qi rabbitmq-server | awk 'match($0, /^Version[^:]*: [0-9]\.([0-9]+)\.[0-9]+/, line){print line[1]}' 
  
  if you get "7" as output then the rabbitmq is updated.
  
      # now you make sure that the other two nodes have a ban:
      pcs resource ban rabbitmq-bundle ctl-1
      pcs resource ban rabbitmq-bundle ctl-2
  
  and re-run the update for those two nodes.

2. No rabbitmq were successfully updated:

  If no rabbitmq has been yet updated but some ban already exists as
  leftover of a previous run, you can safely remove them and re-trigger
  the update.
  
        # from any pacemaker controlled node, this command will give you the bans:
        pcs constraint ref rabbitmq-bundle | awk '/cli-ban/{print $0}'
        # then you remove then using:
        pcs constraint remove cli-ban-rabbitmq-bundle-on-<name of the controller>
        # then you wait for the cluster to settle:
        pcs status | grep rabbitmq
  
  run from any pacemaker controlled node. Then the pacemaker resource
  should reform correctly (as all rabbitmq are 3.6) and you can restart
  the update.

Comment 7 Sofer Athlan-Guyot 2021-07-06 08:03:29 UTC

Hi,

a simple procedure to re-start the update after a unrelated to rabbitmq upgrade failure is:

 - cleanup any bans on the overcloud:

ssh <any node under pacemaker control>
for i in $(sudo pcs constraint ref rabbitmq-bundle | awk '/cli-ban/{print $0}'); do sudo pcs constraint remove $i ; done

 - restart the update in any way you want and on any node
 - cleanup any leftover ban after the controller update

ssh <any node under pacemaker control>
for i in $(sudo pcs constraint ref rabbitmq-bundle | awk '/cli-ban/{print $0}'); do sudo pcs constraint remove $i ; done

To test this I've triggered a failure at step 5 of the update process, so after the first rabbitmq was updated and the bans created.

Then I re-triggered the update and went to completion without issue.

During that update we have:

| step | action                | ctl-0                        | ctl-1                                | ctl-1                                |
|------+-----------------------+------------------------------+--------------------------------------+--------------------------------------|
|    1 | random failure        | rabbitmq updated and running | banned                               | banned                               |
|    2 | ban cleanup           | rabbitmq updated and running | rabbitmq not updated trying to start | rabbitmq not updated trying to start |
|    3 |                       |                              | and eventually failing               | and eventually failing               |
|    4 | restart of the update | rabbitmq updated and running | failed                               | failed                               |
|    5 | ctl-0 re-updated      | rabbitmq updated and running | failed                               | failed                               |
|    6 | ctl-1 updated         | get banned at step 5         | upgraded and running                 | banned                               |
|    7 | ctl-2 updated         | still banned                 | upgraded and running                 | upgraded and running                 |
|    8 | final cleanup         | upgraded and running         | upgrared and running                 | upgraded and running                 |

This procedure should work whatever the state of the cluster to start with. If no ban exist on the overcloud, nothing need to be done.

Comment 18 Alex McLeod 2021-11-24 11:13:24 UTC

Sure thing Jesse, can I get a +1 or comments? Am I right in thinking this applies only to OSP13 updates?

Patch: https://gitlab.cee.redhat.com/rhci-documentation/docs-Red_Hat_Enterprise_Linux_OpenStack_Platform/-/merge_requests/8241

Comment 19 Alex McLeod 2021-11-29 18:53:23 UTC

Merged + published for 13.

Note You need to log in before you can comment on or make changes to this bug.