Hello there, After some checks and digging with the healthchecks, it appears most of them are unreliable due to the lack of strict error checking, such as "set -o pipefail" and other options. We probably want to add the following options: set -eo pipefail in tripleo-common/healthcheck/common.sh ------ While digging into that issue, I also found out that, apparently, "grep -q -E ..." doesn't return the correct exit code when it does match a piped content - at least in some cases (see bellow). For instance, in nova_conductor container:: (ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -Eq ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" ; echo $? 141 But if we do it without -q: (ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" ; echo $? tcp ESTAB 0 0 192.168.24.1:54136 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=9)) tcp ESTAB 0 0 192.168.24.1:54138 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=9)) tcp ESTAB 0 0 192.168.24.1:54140 192.168.24.1:5672 users:(("nova-conductor",pid=28,fd=9)) tcp ESTAB 0 0 192.168.24.1:54142 192.168.24.1:5672 users:(("nova-conductor",pid=24,fd=9)) tcp ESTAB 0 0 192.168.24.1:54144 192.168.24.1:5672 users:(("nova-conductor",pid=23,fd=9)) tcp ESTAB 0 0 192.168.24.1:54146 192.168.24.1:5672 users:(("nova-conductor",pid=27,fd=9)) tcp ESTAB 0 0 192.168.24.1:54148 192.168.24.1:5672 users:(("nova-conductor",pid=29,fd=9)) tcp ESTAB 0 0 192.168.24.1:54150 192.168.24.1:5672 users:(("nova-conductor",pid=22,fd=9)) tcp ESTAB 0 0 192.168.24.1:57270 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=10)) tcp ESTAB 0 0 192.168.24.1:57310 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=10)) tcp ESTAB 0 0 192.168.24.1:57320 192.168.24.1:5672 users:(("nova-conductor",pid=28,fd=10)) tcp ESTAB 0 0 192.168.24.1:57324 192.168.24.1:5672 users:(("nova-conductor",pid=24,fd=10)) tcp ESTAB 0 0 192.168.24.1:57326 192.168.24.1:5672 users:(("nova-conductor",pid=23,fd=10)) tcp ESTAB 0 0 192.168.24.1:57364 192.168.24.1:5672 users:(("nova-conductor",pid=22,fd=10)) tcp ESTAB 8 0 192.168.24.1:57328 192.168.24.1:5672 users:(("nova-conductor",pid=27,fd=10)) tcp ESTAB 8 0 192.168.24.1:57360 192.168.24.1:5672 users:(("nova-conductor",pid=29,fd=10)) 0 This unreliable behaviour was detected in a rhel-8 OSP-16 container, while on the rhel-8 host, it was working as expected. There's probably something fishy with the container env at some point, but to be honest, I didn't dig further. A solution for that last issue is to drop the -q and redirect STDOUT to /dev/null: (ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(5672).*,pid=($(pgrep -d '|' nova-conductor))" >/dev/null; echo $? 0 since it will return 0 if nothing is matched, as you can see here: (ss -ntuap; sudo -u nova ss -ntuap) | sort -u | /usr/bin/grep -E ":(15672).*,pid=($(pgrep -d '|' nova-conductor))" >/dev/null; echo $? 1 Special mention: I'm pretty sure healthchecks based on "lsof" are also broken, seeing the amount of "permission denied" in its output.
Good news: the only healthcheck using lsof (libvirtd) seems to work as expected!
Moving to z2 - we won't be able to provide the right code correction in time for that one.
Back on_dev - the patch was reverted, and needs some more work..
I'm currently working on another patch, and it seems to make the port check more robust and reliable. Let's use it in order to improve 16.1 healthchecks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762