Bug 2001629 - container-status validation incorrectly fails when some containers exited with code 143
Summary: container-status validation incorrectly fails when some containers exited wit...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-validations
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 16.2 (Train on RHEL 8.4)
Assignee: Gaël Chamoulaud
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 2040667
TreeView+ depends on / blocked
 
Reported: 2021-09-06 15:24 UTC by Luigi Toscano
Modified: 2022-03-23 22:12 UTC (History)
7 users (show)

Fixed In Version: openstack-tripleo-validations-11.6.1-2.20220114124841.fef7cf7.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2040667 (view as bug list)
Environment:
Last Closed: 2022-03-23 22:11:37 UTC
Target Upstream Version: stable/train
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 824693 0 None MERGED [Train-Only] Instruct container-status validation to accept 137, 142 and 143 exit code status 2022-02-10 10:56:45 UTC
Red Hat Issue Tracker OSP-8228 0 None None None 2021-11-15 12:51:56 UTC
Red Hat Issue Tracker VALFRWK-603 0 None None None 2021-09-06 17:06:05 UTC
Red Hat Product Errata RHBA-2022:1001 0 None None None 2022-03-23 22:12:12 UTC

Description Luigi Toscano 2021-09-06 15:24:04 UTC
Description of problem:
The container-status validation currently checks whether all the containers exited with status 0.

In a Fast Forward Upgrade scenario from OSP13 to OSP16.2 (and probably to OSP16.1 too), container-status fails sometimes because it finds some specific containers which exited with error 143, namely  neutron-haproxy-qrouter-*. The validation reports error like:

"Failed container detected: neutron-haproxy-qrouter-<foo> Exited (143) 40 minutes ago."

Looking at the logs of the failed jobs there are no many details. The usual content of the log for such container (/var/log/containers/stdouts/neutron-haproxy-qrouter-<foo>.log) is more or less 

2021-09-04T10:09:07.741719971+00:00 stderr F [WARNING] 246/100848 (987071) : Exiting Master process...                                                                                      
2021-09-04T10:09:07.742251294+00:00 stderr F [ALERT] 246/100848 (987071) : Current worker 987075 exited with code 143                                                                       
2021-09-04T10:09:07.742251294+00:00 stderr F [WARNING] 246/100848 (987071) : All workers exited. Exiting... (143)

According networking experts (thanks Slaweq!) each of those containers "is sidecar container to run haproxy for metadata service for neutron router", and it looks like this error message should just be considered as "normal" termination according:
https://www.mail-archive.com/haproxy@formilux.org/msg30473.html

(not sure whether the fix was backported on haproxy 1.8, the version available on OSP16.x, but maybe not).
 
So the validation should probably consider 143 as valid termination value, maybe always, maybe just in some specific cases. It is not an incorrect value at least for that specific container.

Version-Release number of selected component (if applicable):
openstack-tripleo-validations-11.6.1-2.20210612074808.8644a02.el8ost.1.noarch
python3-validations-libs-1.1.1-2.20210607091343.04e84c8.el8ost.1.noarch
validations-common-1.1.2-2.20210611010116.el8ost.2.noarch

How reproducible:
Mostly always on FFU jobs.


Steps to Reproduce:
openstack tripleo validator run --stack qe-Cloud-0 --validation container-status

Comment 3 Gaël Chamoulaud 2021-09-07 07:05:24 UTC
Systemd has been instructed to accept 137, 142 and 143 exit code status.

tripleo-ansible patches:
- https://wimp.usersys.redhat.com/?change_id_input=I8f19a80016a67ccad0371c5d108516aec640f031

python-paunch patches:
- https://wimp.usersys.redhat.com/?change_id_input=Iffcfc8bd18a999ae6921a4131d40241df40050f1

This container-status validation should also be able to accept those exit code status as valid termination status.

Comment 8 Jiri Podivin 2022-02-11 14:05:33 UTC
Manually tested on 11.02.2022 using RHOS 16.2.2, puddle RHOS-16.2-RHEL-8-20220210.n.1

Comment 13 errata-xmlrpc 2022-03-23 22:11:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001


Note You need to log in before you can comment on or make changes to this bug.