Bug 1794044 - Container healthchecks are unreliable (at least ports)
Summary: Container healthchecks are unreliable (at least ports)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z7
: 16.1 (Train on RHEL 8.2)
Assignee: Cédric Jeanneret
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-22 14:36 UTC by Cédric Jeanneret
Modified: 2021-12-09 20:17 UTC (History)
4 users (show)

Fixed In Version: openstack-tripleo-common-11.4.1-1.20210412113429.75bd92a.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-09 20:17:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1860556 0 None None None 2020-01-22 14:36:03 UTC
Launchpad 1921714 0 None None None 2021-04-08 09:18:42 UTC
OpenStack gerrit 703819 0 None MERGED Make healthchecks more strict 2020-11-16 14:03:02 UTC
OpenStack gerrit 785326 0 None NEW healthcheck_port: drop lsof in favor of awk/find 2021-04-08 09:18:42 UTC
Red Hat Issue Tracker OSP-2016 0 None None None 2021-11-18 11:28:44 UTC
Red Hat Product Errata RHBA-2021:3762 0 None None None 2021-12-09 20:17:49 UTC

Description Cédric Jeanneret 2020-01-22 14:36:03 UTC
Hello there,

After some checks and digging with the healthchecks, it appears most of them are unreliable due to the lack of strict error checking, such as "set -o pipefail" and other options.
We probably want to add the following options:
set -eo pipefail
in tripleo-common/healthcheck/common.sh

------

While digging into that issue, I also found out that, apparently, "grep -q -E ..." doesn't return the correct exit code when it does match a piped content - at least in some cases (see bellow).

For instance, in nova_conductor container::

(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -Eq ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" ; echo $?
141

But if we do it without -q:
(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" ; echo $?
tcp ESTAB 0 0 192.168.24.1:54136 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=9))
tcp ESTAB 0 0 192.168.24.1:54138 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=9))
tcp ESTAB 0 0 192.168.24.1:54140 192.168.24.1:5672 users:(("nova-conductor",pid=28,fd=9))
tcp ESTAB 0 0 192.168.24.1:54142 192.168.24.1:5672 users:(("nova-conductor",pid=24,fd=9))
tcp ESTAB 0 0 192.168.24.1:54144 192.168.24.1:5672 users:(("nova-conductor",pid=23,fd=9))
tcp ESTAB 0 0 192.168.24.1:54146 192.168.24.1:5672 users:(("nova-conductor",pid=27,fd=9))
tcp ESTAB 0 0 192.168.24.1:54148 192.168.24.1:5672 users:(("nova-conductor",pid=29,fd=9))
tcp ESTAB 0 0 192.168.24.1:54150 192.168.24.1:5672 users:(("nova-conductor",pid=22,fd=9))
tcp ESTAB 0 0 192.168.24.1:57270 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=10))
tcp ESTAB 0 0 192.168.24.1:57310 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=10))
tcp ESTAB 0 0 192.168.24.1:57320 192.168.24.1:5672 users:(("nova-conductor",pid=28,fd=10))
tcp ESTAB 0 0 192.168.24.1:57324 192.168.24.1:5672 users:(("nova-conductor",pid=24,fd=10))
tcp ESTAB 0 0 192.168.24.1:57326 192.168.24.1:5672 users:(("nova-conductor",pid=23,fd=10))
tcp ESTAB 0 0 192.168.24.1:57364 192.168.24.1:5672 users:(("nova-conductor",pid=22,fd=10))
tcp ESTAB 8 0 192.168.24.1:57328 192.168.24.1:5672 users:(("nova-conductor",pid=27,fd=10))
tcp ESTAB 8 0 192.168.24.1:57360 192.168.24.1:5672 users:(("nova-conductor",pid=29,fd=10))
0

This unreliable behaviour was detected in a rhel-8 OSP-16 container, while on the rhel-8 host, it was working as expected. There's probably something fishy with the container env at some point, but to be honest, I didn't dig further.

A solution for that last issue is to drop the -q and redirect STDOUT to /dev/null:

(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" >/dev/null; echo $?
0

since it will return 0 if nothing is matched, as you can see here:
(ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(15672).*,pid=($(pgrep -d '|' nova-conductor))" >/dev/null; echo $?
1

Special mention: I'm pretty sure healthchecks based on "lsof" are also broken, seeing the amount of "permission denied" in its output.

Comment 1 Cédric Jeanneret 2020-01-22 16:40:29 UTC
Good news: the only healthcheck using lsof (libvirtd) seems to work as expected!

Comment 4 Cédric Jeanneret 2020-01-31 15:07:38 UTC
Moving to z2 - we won't be able to provide the right code correction in time for that one.

Comment 6 Cédric Jeanneret 2020-09-11 08:04:54 UTC
Back on_dev - the patch was reverted, and needs some more work..

Comment 7 Cédric Jeanneret 2021-04-08 09:18:42 UTC
I'm currently working on another patch, and it seems to make the port check more robust and reliable. Let's use it in order to improve 16.1 healthchecks!

Comment 26 errata-xmlrpc 2021-12-09 20:17:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762


Note You need to log in before you can comment on or make changes to this bug.