Bug 1976927 - [update] OSP13 update may appear to fail while it's eventually successful.
Summary: [update] OSP13 update may appear to fail while it's eventually successful.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: async
: 13.0 (Queens)
Assignee: Sofer Athlan-Guyot
QA Contact: Jason Grosso
Vlada Grosu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-28 15:03 UTC by Sofer Athlan-Guyot
Modified: 2022-08-02 12:36 UTC (History)
12 users (show)

Fixed In Version: python-tripleoclient-9.3.1-11.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-23 07:48:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-5502 0 None None None 2022-08-02 12:36:26 UTC
Red Hat Issue Tracker UPG-3116 0 None None None 2021-08-12 08:14:05 UTC
Red Hat Knowledge Base (Solution) 6187721 0 None None None 2021-10-07 11:13:06 UTC
Red Hat Product Errata RHBA-2021:3650 0 None None None 2021-09-23 07:48:43 UTC

Description Sofer Athlan-Guyot 2021-06-28 15:03:43 UTC
Description of problem:

Hi,

when doing an update run of a role it may happens that the command returns failure while the playbook was run fine and to completion.

To check it verify /var/log/mistral/package_update.log after the failure.

You should see no failure there and output continuing.

This can happen to any version of osp13.

The root cause is log rotation of zaqar which trigger a restart of the service on the undercloud and breaks the connection between mistral and zaqar. This makes python-tripleoclient fails while the underlying ansible triggered by mistral is still working fine.

I'm logging this more for reference and if a new version of osp13 ever happens.

Comment 3 Vlada Grosu 2021-07-15 11:59:54 UTC
Hi folks,

I am also going to add a note about this potential issue in the Known issues that might block an update section of the OSP 13 Keeping Red Hat OpenStack Platform Updated doc [1].

How likely is this issue to happen during an update? Does it happen in any particular situations (specific configurations or deployments)?


Many thanks,
Vlada


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#known_issues_that_might_block_an_update

Comment 7 Vlada Grosu 2021-07-19 13:16:18 UTC
Here is the published update:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#known_issues_that_might_block_an_update

Many thanks,
Vlada

(In reply to Vlada Grosu from comment #3)
> Hi folks,
> 
> I am also going to add a note about this potential issue in the Known issues
> that might block an update section of the OSP 13 Keeping Red Hat OpenStack
> Platform Updated doc [1].
> 
> How likely is this issue to happen during an update? Does it happen in any
> particular situations (specific configurations or deployments)?
> 
> 
> Many thanks,
> Vlada
> 
> 
> [1]
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/
> html-single/keeping_red_hat_openstack_platform_updated/
> index#known_issues_that_might_block_an_update

Comment 13 Sofer Athlan-Guyot 2021-08-23 15:56:03 UTC
Hi,

so we have tried to reproduce this with 10 computes, but it didn't happened.  Doing a round of very slow gdb session we found a likely contender for this issue.

https://opendev.org/openstack/python-tripleoclient/src/branch/stable/queens/tripleoclient/plugin.py#L108-L109

This call to cleanup the queue is "unprotected", so if the queue is already destroyed, then the code fail and make the ooo client return 1 while the ansible run was successful.

For another reason this code was removed in OSP14 in that review https://review.opendev.org/c/openstack/python-tripleoclient/+/555193, meaning that it's not critical to not remove the queue.

This would match the error we see on your undercloud : 

  16 queue_delete and 1 failed wit WebSocket connection closed: connection was closed uncleanly (peer dropped the TCP connection without previous WebSocket closing handshake

that 1 failed could be the false error.

So current advice would be to apply the patch https://review.opendev.org/c/openstack/python-tripleoclient/+/555193 (should work on osp13) and verify if the error doesn't occur anymore.

Let us know if you need more guidance to test this patch.

Comment 27 errata-xmlrpc 2021-09-23 07:48:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3650


Note You need to log in before you can comment on or make changes to this bug.