Red Hat Bugzilla – Bug 1469434
Health checks for containerized services
Last modified: 2018-11-05 03:32:13 EST
Description of problem: We need to provide sensu checks that will provide openstack operators status of containerized openstack services. In OSP10 and OSP11 we shipped systemctl checks as part of opstools-ansible project. Those checks were monitoring openstack services and visualized results in Uchiwa. Since we introduce containerized openstack services in OSP12 we have to adjust those checks to monitor containerized openstack services.
Alternative for systemd checks will be check-docker-health implemented as solution of rhbz#1452229 with a set of healthcheck scripts for each container. Let this BZ serve for tracking healthcheck framework work and as a tracker for additional BZs for healthcheck scripts.
Relevant spec: https://review.openstack.org/#/c/465633/
All patches has been merged upstream.
First tests failed, adding fix patch
Two more fixes are required for healthchecks: https://review.openstack.org/497468 https://review.openstack.org/497792
The healthchecks report false negative when used with internal TLS. I've reported the bug upstream at https://bugs.launchpad.net/tripleo/+bug/1713689.
Created upstream bug for false negatives of RabbitMQ connected services (https://bugs.launchpad.net/tripleo/+bug/1714077)
Removing abandoned patches from external tracker.
Hi, in which puddle I can test this feature?
FailedQA because several containers still reported as unhealthy. The container are: nova_vnc_proxy,ceilometer_agent_central and aodh_evaluator The healthcheck should be less strict in case of those containers.
ceilometer_agent_central and aodh_evaluator are using Redis for coordination, so we need to change their health check.
Added upstream bug for this.
*** Bug 1505543 has been marked as a duplicate of this bug. ***
(In reply to Leonid Natapov from comment #15) > FailedQA because several containers still reported as unhealthy. > > The container are: nova_vnc_proxy,ceilometer_agent_central and aodh_evaluator > > The healthcheck should be less strict in case of those containers. Closed :https://bugzilla.redhat.com/show_bug.cgi?id=1505543 as duplicate of this bug . After successful overcloud deployment the container health check shows unhealthy state for : nova_vnc_proxy / ceilometer_agent_central / aodh_evaluator - it seems that there are no open ports for the services in the containers or the port are wrong or the health check is wrong. [root@controller-0 ~]# docker ps | grep unhealthy c5c2c95f0a05 192.168.24.1:8787/rhosp12/openstack-nova-novncproxy-docker:20171017.1 "kolla_start" 3 days ago Up 3 days (unhealthy) nova_vnc_proxy 318cbdf9accd 192.168.24.1:8787/rhosp12/openstack-ceilometer-central-docker:20171017.1 "kolla_start" 3 days ago Up 3 days (unhealthy) ceilometer_agent_central 5dd0f0496c2b 192.168.24.1:8787/rhosp12/openstack-aodh-evaluator-docker:20171017.1 "kolla_start" 3 days ago Up 3 days (unhealthy) aodh_evaluator [root@controller-0 ~]# for i in `docker ps|awk '/unhealthy/ {print $NF}'`; do echo $i; docker inspect $i| jq ".[].State.Health.Log[].Output"|head -n1;done nova_vnc_proxy "There is no nova-novncproxy process with opened RabbitMQ ports (5671,5672) running in the container\n" ceilometer_agent_central "There is no ceilometer-poll process with opened RabbitMQ ports (5671,5672) running in the container\n" aodh_evaluator "There is no aodh-evaluator process with opened RabbitMQ ports (5671,5672) running in the container\n"
Omri, the upstream patch at https://review.openstack.org/#/c/513471/ should fix the last 3 remaining issues with the false negative healtchecks.
Hi folks, also nova_migration_target has unhealthy status sudo docker inspect nova_migration_target | jq ".[].State.Health.Log[].Output"|head -n1 "There is no nova-compute process with opened RabbitMQ ports (5671,5672) running in the container\n"
Yes, it is caused by reusing nova-compute image for nova-migration-target container. Patch for this is going to land today.
FailedQA environment: OSP12 puddle: 20171102.1 641108f139fd 192.168.24.1:8787/rhosp12/openstack-nova-compute-docker:20171102.1 "kolla_start" About an hour ago Up About an hour (unhealthy) nova_migration_target [heat-admin@compute-0 ~]$ sudo docker inspect nova_migration_target | jq ".[].State.Health.Log[].Output"|head -n1 "There is no nova-compute process with opened RabbitMQ ports (5671,5672) running in the container\n"
Do we also need health checks for all the ironic services here? https://review.openstack.org/#/c/506125
Stable/pike backport of nova_migration_target hc has been merged upstream.
This bug is Test-Blocker and Automation-Blocker for Container:DFG in the test-plan of 'container sanity' we're checking the health status of each container - > when one of the containers returning (unhealthy) status our automation results reflect this failure.
Tested with openstack-tripleo-common-7.6.3-10.el7ost.noarch. ea5db9c2d692 192.168.24.1:8787/rhosp12/openstack-nova-compute:2018-01-24.2 "kolla_start" 39 hours ago Up 39 hours (healthy) nova_compute
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0253