Bug 1469434 - Health checks for containerized services [NEEDINFO]
Health checks for containerized services
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
urgent Severity high
: z1
: 12.0 (Pike)
Assigned To: Martin Magr
Leonid Natapov
: AutomationBlocker, TestBlocker, Triaged, ZStream
: 1505543 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-11 05:26 EDT by Leonid Natapov
Modified: 2018-02-14 07:08 EST (History)
28 users (show)

See Also:
Fixed In Version: openstack-tripleo-common-7.6.3-10.el7ost
Doc Type: Known Issue
Doc Text:
When using the Docker CLI to report the state of running containers, the nova_migration_target container might be incorrectly reported as "unhealthy". This is due to an issue with the health check itself, and not with an accurate reflection of the state of the running container.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-01-30 16:24:32 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
jdanjou: needinfo? (bschmaus)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1710629 None None None 2017-08-14 09:53 EDT
Launchpad 1713689 None None None 2017-08-29 06:58 EDT
Launchpad 1714077 None None None 2017-08-30 15:39 EDT
Launchpad 1714951 None None None 2017-09-04 08:59 EDT
Launchpad 1714952 None None None 2017-09-04 10:13 EDT
Launchpad 1724922 None None None 2017-10-19 14:34 EDT
Launchpad 1730649 None None None 2017-11-07 06:34 EST
OpenStack gerrit 483081 None master: MERGED tripleo-common: healthchecks: start to implement container healthchecks (Id9fc19dd386e395317093d1723b836ae2807fdf0) 2017-11-28 13:04 EST
OpenStack gerrit 483104 None master: MERGED tripleo-common: healthchecks: implement service-specific checks (I9882d5047c960499dae279eda5d9cbacc01ac5d3) 2017-11-28 13:03 EST
OpenStack gerrit 488419 None master: MERGED tripleo-common: Add health checks during kolla build (I98300eb6716992bbeb3544013a397a27a3427cf0) 2017-11-28 13:03 EST
OpenStack gerrit 490892 None master: MERGED tripleo-common: Add health check command for ironic-pxe image (If5b77481330fa697f1bab16696acb70075052d4f) 2017-11-28 13:03 EST
OpenStack gerrit 493528 None master: MERGED tripleo-common: Use correct path in healthcheck scripts (Id7d169cfd34bf2a45e9bff0e389e77189cb774d7) 2017-11-28 13:03 EST
OpenStack gerrit 497468 None master: MERGED tripleo-common: Add clustercheck healthcheck (I46be211323442d309d7bb000bede02b0e8ccb512) 2017-11-28 13:03 EST
OpenStack gerrit 497792 None master: MERGED tripleo-common: Fix port health check false negatives (I01389ce87f81486fd887d71e6816df76276011e0) 2017-11-28 13:03 EST
OpenStack gerrit 499880 None stable/pike: MERGED tripleo-common: Fix port health check false negatives (I01389ce87f81486fd887d71e6816df76276011e0) 2017-11-28 13:03 EST
OpenStack gerrit 500495 None master: MERGED tripleo-common: Fix the path to HEALTHCHECK_SCRIPTS in healthcheck/ironic-api (I7e8156ac0a44beef1eae6500f049d159bbeb5b4b... 2017-11-28 13:03 EST
OpenStack gerrit 501226 None stable/pike: MERGED tripleo-common: Add clustercheck healthcheck (I46be211323442d309d7bb000bede02b0e8ccb512) 2017-11-28 13:03 EST
OpenStack gerrit 502913 None stable/pike: MERGED tripleo-common: Fix the path to HEALTHCHECK_SCRIPTS in healthcheck/ironic-api (I7e8156ac0a44beef1eae6500f049d159bbeb5b4b... 2017-11-28 13:03 EST
OpenStack gerrit 504937 None stable/pike: MERGED tripleo-common: Add health check command for ironic-pxe image (If5b77481330fa697f1bab16696acb70075052d4f) 2017-11-28 13:03 EST
OpenStack gerrit 513471 None master: MERGED tripleo-common: Fix nova_vnc_proxy, ceilometer_agent_central and aodh_evaluator health check (Ic6b51ad6be0d230375bdfb13e... 2017-11-28 13:03 EST
OpenStack gerrit 515612 None stable/pike: MERGED tripleo-common: Fix nova_vnc_proxy, ceilometer_agent_central and aodh_evaluator health check (Ic6b51ad6be0d230375bdfb13e... 2017-11-28 13:03 EST
OpenStack gerrit 518298 None master: MERGED tripleo-common: Healthcheck for nova_migration_target container (Id48e32b43d2e0cd4319905131f1c0bee9774e5f0) 2017-11-28 13:03 EST
OpenStack gerrit 518709 None stable/pike: MERGED tripleo-common: Healthcheck for nova_migration_target container (Id48e32b43d2e0cd4319905131f1c0bee9774e5f0) 2017-11-28 13:03 EST

  None (edit)
Description Leonid Natapov 2017-07-11 05:26:35 EDT
Description of problem:


We need to provide sensu checks that will provide openstack operators status of containerized openstack services.

In OSP10 and OSP11 we shipped systemctl checks as part of opstools-ansible project. Those checks were monitoring openstack services and visualized results in Uchiwa.  Since we introduce containerized openstack services in OSP12 we have to adjust those checks to monitor containerized openstack services.
Comment 1 Martin Magr 2017-07-11 06:49:26 EDT
Alternative for systemd checks will be check-docker-health implemented as solution of rhbz#1452229 with a set of healthcheck scripts for each container.

Let this BZ serve for tracking healthcheck framework work and as a tracker for additional BZs for healthcheck scripts.
Comment 2 Martin Magr 2017-07-11 06:51:21 EDT
Relevant spec: https://review.openstack.org/#/c/465633/
Comment 3 Martin Magr 2017-08-09 09:11:15 EDT
All patches has been merged upstream.
Comment 5 Martin Magr 2017-08-14 08:53:19 EDT
First tests failed, adding fix patch
Comment 6 Martin Magr 2017-08-25 07:57:03 EDT
Two more fixes are required for healthchecks:

https://review.openstack.org/497468
https://review.openstack.org/497792
Comment 7 Martin André 2017-08-29 06:58:37 EDT
The healthchecks report false negative when used with internal TLS. I've reported the bug upstream at https://bugs.launchpad.net/tripleo/+bug/1713689.
Comment 8 Martin Magr 2017-08-30 15:39:11 EDT
Created upstream bug for false negatives of RabbitMQ connected services (https://bugs.launchpad.net/tripleo/+bug/1714077)
Comment 9 Martin Magr 2017-09-04 10:16:52 EDT
Removing abandoned patches from external tracker.
Comment 11 Artem Hrechanychenko 2017-09-27 16:28:30 EDT
Hi, 
in which puddle I can test this feature?
Comment 15 Leonid Natapov 2017-10-18 12:52:20 EDT
FailedQA because several containers still reported as unhealthy.

The container are: nova_vnc_proxy,ceilometer_agent_central and aodh_evaluator

The healthcheck should be less strict in case of those containers.
Comment 16 Martin Magr 2017-10-19 13:43:57 EDT
ceilometer_agent_central and aodh_evaluator are using Redis for coordination, so we need to change their health check.
Comment 17 Martin Magr 2017-10-19 14:34:23 EDT
Added upstream bug for this.
Comment 18 Omri Hochman 2017-10-23 17:02:05 EDT
*** Bug 1505543 has been marked as a duplicate of this bug. ***
Comment 19 Omri Hochman 2017-10-23 17:03:42 EDT
(In reply to Leonid Natapov from comment #15)
> FailedQA because several containers still reported as unhealthy.
> 
> The container are: nova_vnc_proxy,ceilometer_agent_central and aodh_evaluator
> 
> The healthcheck should be less strict in case of those containers.


Closed :https://bugzilla.redhat.com/show_bug.cgi?id=1505543  as duplicate of this bug .  

After successful overcloud deployment the container health check shows unhealthy state for : nova_vnc_proxy / ceilometer_agent_central / aodh_evaluator  -  it seems that there are no open ports for the services in the containers or the port are wrong or the health check is wrong. 
 

[root@controller-0 ~]# docker ps | grep unhealthy
c5c2c95f0a05        192.168.24.1:8787/rhosp12/openstack-nova-novncproxy-docker:20171017.1           "kolla_start"            3 days ago          Up 3 days (unhealthy)                       nova_vnc_proxy
318cbdf9accd        192.168.24.1:8787/rhosp12/openstack-ceilometer-central-docker:20171017.1        "kolla_start"            3 days ago          Up 3 days (unhealthy)                       ceilometer_agent_central
5dd0f0496c2b        192.168.24.1:8787/rhosp12/openstack-aodh-evaluator-docker:20171017.1            "kolla_start"            3 days ago          Up 3 days (unhealthy)                       aodh_evaluator


[root@controller-0 ~]# for i in `docker ps|awk '/unhealthy/ {print $NF}'`; do echo $i; docker inspect $i| jq ".[].State.Health.Log[].Output"|head -n1;done
nova_vnc_proxy
"There is no nova-novncproxy process with opened RabbitMQ ports (5671,5672) running in the container\n"
ceilometer_agent_central
"There is no ceilometer-poll process with opened RabbitMQ ports (5671,5672) running in the container\n"
aodh_evaluator
"There is no aodh-evaluator process with opened RabbitMQ ports (5671,5672) running in the container\n"
Comment 20 Martin André 2017-10-24 03:10:53 EDT
Omri, the upstream patch at https://review.openstack.org/#/c/513471/ should fix the last 3 remaining issues with the false negative healtchecks.
Comment 21 Artem Hrechanychenko 2017-10-25 04:59:19 EDT
Hi folks,
also nova_migration_target has unhealthy status

sudo docker inspect nova_migration_target | jq ".[].State.Health.Log[].Output"|head -n1
"There is no nova-compute process with opened RabbitMQ ports (5671,5672) running in the container\n"
Comment 22 Martin Magr 2017-10-27 05:41:39 EDT
Yes, it is caused by reusing nova-compute image for nova-migration-target container. Patch for this is going to land today.
Comment 24 Artem Hrechanychenko 2017-11-06 07:04:27 EST
FailedQA

environment:
OSP12 puddle: 20171102.1 
641108f139fd        192.168.24.1:8787/rhosp12/openstack-nova-compute-docker:20171102.1         "kolla_start"       About an hour ago   Up About an hour (unhealthy)                       nova_migration_target

[heat-admin@compute-0 ~]$ sudo docker inspect nova_migration_target | jq ".[].State.Health.Log[].Output"|head -n1
"There is no nova-compute process with opened RabbitMQ ports (5671,5672) running in the container\n"
Comment 25 Derek Higgins 2017-11-06 09:59:11 EST
Do we also need health checks for all the ironic services here?
https://review.openstack.org/#/c/506125
Comment 27 Martin Magr 2017-11-13 05:49:45 EST
Stable/pike backport of nova_migration_target hc has been merged upstream.
Comment 34 Omri Hochman 2017-11-30 09:48:11 EST
This bug is Test-Blocker and Automation-Blocker for Container:DFG 
in the test-plan of 'container sanity' we're checking the health status of each container  - > when one of the containers returning (unhealthy) status our automation results reflect this failure.
Comment 44 Leonid Natapov 2018-01-28 00:57:23 EST
Tested with openstack-tripleo-common-7.6.3-10.el7ost.noarch.



ea5db9c2d692        192.168.24.1:8787/rhosp12/openstack-nova-compute:2018-01-24.2         "kolla_start"       39 hours ago        Up 39 hours (healthy)                       nova_compute
Comment 51 errata-xmlrpc 2018-01-30 16:24:32 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0253

Note You need to log in before you can comment on or make changes to this bug.