Description of problem: Hi, when doing an update run of a role it may happens that the command returns failure while the playbook was run fine and to completion. To check it verify /var/log/mistral/package_update.log after the failure. You should see no failure there and output continuing. This can happen to any version of osp13. The root cause is log rotation of zaqar which trigger a restart of the service on the undercloud and breaks the connection between mistral and zaqar. This makes python-tripleoclient fails while the underlying ansible triggered by mistral is still working fine. I'm logging this more for reference and if a new version of osp13 ever happens.
Hi folks, I am also going to add a note about this potential issue in the Known issues that might block an update section of the OSP 13 Keeping Red Hat OpenStack Platform Updated doc [1]. How likely is this issue to happen during an update? Does it happen in any particular situations (specific configurations or deployments)? Many thanks, Vlada [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#known_issues_that_might_block_an_update
Here is the published update: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/keeping_red_hat_openstack_platform_updated/index#known_issues_that_might_block_an_update Many thanks, Vlada (In reply to Vlada Grosu from comment #3) > Hi folks, > > I am also going to add a note about this potential issue in the Known issues > that might block an update section of the OSP 13 Keeping Red Hat OpenStack > Platform Updated doc [1]. > > How likely is this issue to happen during an update? Does it happen in any > particular situations (specific configurations or deployments)? > > > Many thanks, > Vlada > > > [1] > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/ > html-single/keeping_red_hat_openstack_platform_updated/ > index#known_issues_that_might_block_an_update
Hi, so we have tried to reproduce this with 10 computes, but it didn't happened. Doing a round of very slow gdb session we found a likely contender for this issue. https://opendev.org/openstack/python-tripleoclient/src/branch/stable/queens/tripleoclient/plugin.py#L108-L109 This call to cleanup the queue is "unprotected", so if the queue is already destroyed, then the code fail and make the ooo client return 1 while the ansible run was successful. For another reason this code was removed in OSP14 in that review https://review.opendev.org/c/openstack/python-tripleoclient/+/555193, meaning that it's not critical to not remove the queue. This would match the error we see on your undercloud : 16 queue_delete and 1 failed wit WebSocket connection closed: connection was closed uncleanly (peer dropped the TCP connection without previous WebSocket closing handshake that 1 failed could be the false error. So current advice would be to apply the patch https://review.opendev.org/c/openstack/python-tripleoclient/+/555193 (should work on osp13) and verify if the error doesn't occur anymore. Let us know if you need more guidance to test this patch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3650