2001629 – container-status validation incorrectly fails when some containers exited with code 143

Bug 2001629 - container-status validation incorrectly fails when some containers exited with code 143

Summary: container-status validation incorrectly fails when some containers exited wit...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-validations
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z2
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Gaël Chamoulaud
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2040667
TreeView+	depends on / blocked

Reported:	2021-09-06 15:24 UTC by Luigi Toscano
Modified:	2022-03-23 22:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openstack-tripleo-validations-11.6.1-2.20220114124841.fef7cf7.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2040667 (view as bug list)
Environment:
Last Closed:	2022-03-23 22:11:37 UTC
Target Upstream Version:	stable/train
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	824693	None	MERGED	[Train-Only] Instruct container-status validation to accept 137, 142 and 143 exit code status	2022-02-10 10:56:45 UTC
Red Hat Issue Tracker	OSP-8228	None	None	None	2021-11-15 12:51:56 UTC
Red Hat Issue Tracker	VALFRWK-603	None	None	None	2021-09-06 17:06:05 UTC
Red Hat Product Errata	RHBA-2022:1001	None	None	None	2022-03-23 22:12:12 UTC

Description Luigi Toscano 2021-09-06 15:24:04 UTC

Description of problem:
The container-status validation currently checks whether all the containers exited with status 0.

In a Fast Forward Upgrade scenario from OSP13 to OSP16.2 (and probably to OSP16.1 too), container-status fails sometimes because it finds some specific containers which exited with error 143, namely  neutron-haproxy-qrouter-*. The validation reports error like:

"Failed container detected: neutron-haproxy-qrouter-<foo> Exited (143) 40 minutes ago."

Looking at the logs of the failed jobs there are no many details. The usual content of the log for such container (/var/log/containers/stdouts/neutron-haproxy-qrouter-<foo>.log) is more or less 

2021-09-04T10:09:07.741719971+00:00 stderr F [WARNING] 246/100848 (987071) : Exiting Master process...                                                                                      
2021-09-04T10:09:07.742251294+00:00 stderr F [ALERT] 246/100848 (987071) : Current worker 987075 exited with code 143                                                                       
2021-09-04T10:09:07.742251294+00:00 stderr F [WARNING] 246/100848 (987071) : All workers exited. Exiting... (143)

According networking experts (thanks Slaweq!) each of those containers "is sidecar container to run haproxy for metadata service for neutron router", and it looks like this error message should just be considered as "normal" termination according:
https://www.mail-archive.com/haproxy@formilux.org/msg30473.html

(not sure whether the fix was backported on haproxy 1.8, the version available on OSP16.x, but maybe not).
 
So the validation should probably consider 143 as valid termination value, maybe always, maybe just in some specific cases. It is not an incorrect value at least for that specific container.

Version-Release number of selected component (if applicable):
openstack-tripleo-validations-11.6.1-2.20210612074808.8644a02.el8ost.1.noarch
python3-validations-libs-1.1.1-2.20210607091343.04e84c8.el8ost.1.noarch
validations-common-1.1.2-2.20210611010116.el8ost.2.noarch

How reproducible:
Mostly always on FFU jobs.


Steps to Reproduce:
openstack tripleo validator run --stack qe-Cloud-0 --validation container-status

Comment 3 Gaël Chamoulaud 2021-09-07 07:05:24 UTC

Systemd has been instructed to accept 137, 142 and 143 exit code status.

tripleo-ansible patches:
- https://wimp.usersys.redhat.com/?change_id_input=I8f19a80016a67ccad0371c5d108516aec640f031

python-paunch patches:
- https://wimp.usersys.redhat.com/?change_id_input=Iffcfc8bd18a999ae6921a4131d40241df40050f1

This container-status validation should also be able to accept those exit code status as valid termination status.

Comment 8 Jiri Podivin 2022-02-11 14:05:33 UTC

Manually tested on 11.02.2022 using RHOS 16.2.2, puddle RHOS-16.2-RHEL-8-20220210.n.1

Comment 13 errata-xmlrpc 2022-03-23 22:11:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001

Note You need to log in before you can comment on or make changes to this bug.