Description of problem: The container-status validation currently checks whether all the containers exited with status 0. In a Fast Forward Upgrade scenario from OSP13 to OSP16.2 (and probably to OSP16.1 too), container-status fails sometimes because it finds some specific containers which exited with error 143, namely neutron-haproxy-qrouter-*. The validation reports error like: "Failed container detected: neutron-haproxy-qrouter-<foo> Exited (143) 40 minutes ago." Looking at the logs of the failed jobs there are no many details. The usual content of the log for such container (/var/log/containers/stdouts/neutron-haproxy-qrouter-<foo>.log) is more or less 2021-09-04T10:09:07.741719971+00:00 stderr F [WARNING] 246/100848 (987071) : Exiting Master process... 2021-09-04T10:09:07.742251294+00:00 stderr F [ALERT] 246/100848 (987071) : Current worker 987075 exited with code 143 2021-09-04T10:09:07.742251294+00:00 stderr F [WARNING] 246/100848 (987071) : All workers exited. Exiting... (143) According networking experts (thanks Slaweq!) each of those containers "is sidecar container to run haproxy for metadata service for neutron router", and it looks like this error message should just be considered as "normal" termination according: https://www.mail-archive.com/haproxy@formilux.org/msg30473.html (not sure whether the fix was backported on haproxy 1.8, the version available on OSP16.x, but maybe not). So the validation should probably consider 143 as valid termination value, maybe always, maybe just in some specific cases. It is not an incorrect value at least for that specific container. Version-Release number of selected component (if applicable): openstack-tripleo-validations-11.6.1-2.20210612074808.8644a02.el8ost.1.noarch python3-validations-libs-1.1.1-2.20210607091343.04e84c8.el8ost.1.noarch validations-common-1.1.2-2.20210611010116.el8ost.2.noarch How reproducible: Mostly always on FFU jobs. Steps to Reproduce: openstack tripleo validator run --stack qe-Cloud-0 --validation container-status
Systemd has been instructed to accept 137, 142 and 143 exit code status. tripleo-ansible patches: - https://wimp.usersys.redhat.com/?change_id_input=I8f19a80016a67ccad0371c5d108516aec640f031 python-paunch patches: - https://wimp.usersys.redhat.com/?change_id_input=Iffcfc8bd18a999ae6921a4131d40241df40050f1 This container-status validation should also be able to accept those exit code status as valid termination status.
Manually tested on 11.02.2022 using RHOS 16.2.2, puddle RHOS-16.2-RHEL-8-20220210.n.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1001