Bug 1902679

Summary: healthcheck_gnocchi_statsd fails while active gnocchi-statsd process is running and listening on correct port
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-tripleo-commonAssignee: Martin Magr <mmagr>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: cylopez, lmadsen, mburns, michal.vasko, mrunge, slinaber
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-common-11.5.1-2.20210213010022.36ad9a1.el8ost.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 07:10:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2020-11-30 11:53:10 UTC
Description of problem:

At this point Red Hat still supports legacy telemetry services in RHOSP 16.1.

healthcheck_gnocchi_statsd fails, but appropriate check commands return exit 0 when executed manually. Logs [1] tells us that the following command fails inside gnocchi_statsd container:
ss -lnp | grep -qE ":8125.*,pid=7,"

But I get exit 0 code when I try to execute same command manually. I kindly ask engineering to help me isolate the problem.

[1]
Oct 28 07:26:22 controller-1 systemd[1]: Starting gnocchi_statsd healthcheck...
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ : 10
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ : curl-healthcheck
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ : /dev/null
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + process=gnocchi-statsd
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ get_config_val /etc/gnocchi/gnocchi.conf statsd port 8125
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ crudini --get /etc/gnocchi/gnocchi.conf statsd port
Oct 28 07:26:22 controller-1 podman[981161]: 2020-10-28 07:26:22.883372853 +0000 UTC m=+0.236581960 container exec c3383a200100259e5a77d018105a1798272b4648f973d8ab688ba5021dfe7e8b (image=redhat_osp_rhel_8-gnocchi-statsd:16.1-51, name=gnocchi_statsd)
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ echo 8125
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + bind_port=8125
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + healthcheck_listen gnocchi-statsd 8125
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + process=gnocchi-statsd
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + shift 1
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + args=8125
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + ports=8125
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: ++ pgrep -d '|' -f gnocchi-statsd
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + pids=7
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + ss -lnp
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + grep -qE ':(8125).*,pid=(7),'
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + echo 'There is no gnocchi-statsd process listening on ports 8125 in the container.'
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: There is no gnocchi-statsd process listening on ports 8125 in the container.
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: + exit 1
Oct 28 07:26:22 controller-1 healthcheck_gnocchi_statsd[981161]: Error: non zero exit code: 1: OCI runtime error
Oct 28 07:26:22 controller-1 systemd[1]: tripleo_gnocchi_statsd_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Oct 28 07:26:22 controller-1 systemd[1]: tripleo_gnocchi_statsd_healthcheck.service: Failed with result 'exit-code'.
Oct 28 07:26:22 controller-1 systemd[1]: Failed to start gnocchi_statsd healthcheck.

Comment 4 Martin Magr 2020-12-15 09:23:25 UTC
This is happening, because the healthcheck is executed as root user:

[root@controller-0 ~]# ps -ef | grep gnocchi                                                                                                                                                                                           
root      339870       1  0 16:21 ?        00:00:00 /usr/bin/podman exec --user root gnocchi_metricd /openstack/healthcheck                                                                                                            
root      339874       1  1 16:21 ?        00:00:00 /usr/bin/podman exec --user root gnocchi_statsd /openstack/healthcheck                                                                                                             
root      339876       1  1 16:21 ?        00:00:00 /usr/bin/podman exec --user root gnocchi_api /openstack/healthcheck      
<snip>

And as can be seen below, the output of ss is different when executed as root and as proper user:

[root@controller-0 ~]# podman exec -it gnocchi_statsd bash
()[gnocchi@controller-0 /]$ ss -lnp | grep 8125
udp                UNCONN              0                    0                                                                                           0.0.0.0:8125                 0.0.0.0:*          users:(("gnocchi-statsd",pid=6,fd=8))   
()[gnocchi@controller-0 /]$ exit
exit
[root@controller-0 ~]# podman exec -uroot -it gnocchi_statsd bash
()[root@controller-0 /]# ss -lnp | grep 8125
udp                UNCONN              0                    0                                                                                           0.0.0.0:8125                                            0.0.0.0:*

Sadly usage of sudo in patch [1] was not implemented in healthcheck_listen, so we gonna need to fix that too now. 


[1] https://github.com/openstack/tripleo-common/commit/d03401438c22e59d4f51cedfd0af6d7d48328d45

Comment 8 errata-xmlrpc 2021-09-15 07:10:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483