Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1469434

Summary: Health checks for containerized services
Product: Red Hat OpenStack Reporter: Leonid Natapov <lnatapov>
Component: openstack-tripleo-commonAssignee: Martin Magr <mmagr>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: agurenko, ahrechan, apannu, astupnik, athomas, bschmaus, dcadzow, derekh, imain, jbadiapa, jdanjou, jjoyce, jschluet, lars, m.andre, mariel, mburns, mcornea, mmagr, mrunge, ohochman, oidgar, rhel-osp-director-maint, rmccabe, sasha, sclewis, slinaber, ssmolyak, tvignaud
Target Milestone: z1Keywords: AutomationBlocker, TestBlocker, Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-7.6.3-10.el7ost Doc Type: Known Issue
Doc Text:
When using the Docker CLI to report the state of running containers, the nova_migration_target container might be incorrectly reported as "unhealthy". This is due to an issue with the health check itself, and not with an accurate reflection of the state of the running container.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-30 21:24:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Leonid Natapov 2017-07-11 09:26:35 UTC
Description of problem:


We need to provide sensu checks that will provide openstack operators status of containerized openstack services.

In OSP10 and OSP11 we shipped systemctl checks as part of opstools-ansible project. Those checks were monitoring openstack services and visualized results in Uchiwa.  Since we introduce containerized openstack services in OSP12 we have to adjust those checks to monitor containerized openstack services.

Comment 1 Martin Magr 2017-07-11 10:49:26 UTC
Alternative for systemd checks will be check-docker-health implemented as solution of rhbz#1452229 with a set of healthcheck scripts for each container.

Let this BZ serve for tracking healthcheck framework work and as a tracker for additional BZs for healthcheck scripts.

Comment 2 Martin Magr 2017-07-11 10:51:21 UTC
Relevant spec: https://review.openstack.org/#/c/465633/

Comment 3 Martin Magr 2017-08-09 13:11:15 UTC
All patches has been merged upstream.

Comment 5 Martin Magr 2017-08-14 12:53:19 UTC
First tests failed, adding fix patch

Comment 6 Martin Magr 2017-08-25 11:57:03 UTC
Two more fixes are required for healthchecks:

https://review.openstack.org/497468
https://review.openstack.org/497792

Comment 7 Martin André 2017-08-29 10:58:37 UTC
The healthchecks report false negative when used with internal TLS. I've reported the bug upstream at https://bugs.launchpad.net/tripleo/+bug/1713689.

Comment 8 Martin Magr 2017-08-30 19:39:11 UTC
Created upstream bug for false negatives of RabbitMQ connected services (https://bugs.launchpad.net/tripleo/+bug/1714077)

Comment 9 Martin Magr 2017-09-04 14:16:52 UTC
Removing abandoned patches from external tracker.

Comment 11 Artem Hrechanychenko 2017-09-27 20:28:30 UTC
Hi, 
in which puddle I can test this feature?

Comment 15 Leonid Natapov 2017-10-18 16:52:20 UTC
FailedQA because several containers still reported as unhealthy.

The container are: nova_vnc_proxy,ceilometer_agent_central and aodh_evaluator

The healthcheck should be less strict in case of those containers.

Comment 16 Martin Magr 2017-10-19 17:43:57 UTC
ceilometer_agent_central and aodh_evaluator are using Redis for coordination, so we need to change their health check.

Comment 17 Martin Magr 2017-10-19 18:34:23 UTC
Added upstream bug for this.

Comment 18 Omri Hochman 2017-10-23 21:02:05 UTC
*** Bug 1505543 has been marked as a duplicate of this bug. ***

Comment 19 Omri Hochman 2017-10-23 21:03:42 UTC
(In reply to Leonid Natapov from comment #15)
> FailedQA because several containers still reported as unhealthy.
> 
> The container are: nova_vnc_proxy,ceilometer_agent_central and aodh_evaluator
> 
> The healthcheck should be less strict in case of those containers.


Closed :https://bugzilla.redhat.com/show_bug.cgi?id=1505543  as duplicate of this bug .  

After successful overcloud deployment the container health check shows unhealthy state for : nova_vnc_proxy / ceilometer_agent_central / aodh_evaluator  -  it seems that there are no open ports for the services in the containers or the port are wrong or the health check is wrong. 
 

[root@controller-0 ~]# docker ps | grep unhealthy
c5c2c95f0a05        192.168.24.1:8787/rhosp12/openstack-nova-novncproxy-docker:20171017.1           "kolla_start"            3 days ago          Up 3 days (unhealthy)                       nova_vnc_proxy
318cbdf9accd        192.168.24.1:8787/rhosp12/openstack-ceilometer-central-docker:20171017.1        "kolla_start"            3 days ago          Up 3 days (unhealthy)                       ceilometer_agent_central
5dd0f0496c2b        192.168.24.1:8787/rhosp12/openstack-aodh-evaluator-docker:20171017.1            "kolla_start"            3 days ago          Up 3 days (unhealthy)                       aodh_evaluator


[root@controller-0 ~]# for i in `docker ps|awk '/unhealthy/ {print $NF}'`; do echo $i; docker inspect $i| jq ".[].State.Health.Log[].Output"|head -n1;done
nova_vnc_proxy
"There is no nova-novncproxy process with opened RabbitMQ ports (5671,5672) running in the container\n"
ceilometer_agent_central
"There is no ceilometer-poll process with opened RabbitMQ ports (5671,5672) running in the container\n"
aodh_evaluator
"There is no aodh-evaluator process with opened RabbitMQ ports (5671,5672) running in the container\n"

Comment 20 Martin André 2017-10-24 07:10:53 UTC
Omri, the upstream patch at https://review.openstack.org/#/c/513471/ should fix the last 3 remaining issues with the false negative healtchecks.

Comment 21 Artem Hrechanychenko 2017-10-25 08:59:19 UTC
Hi folks,
also nova_migration_target has unhealthy status

sudo docker inspect nova_migration_target | jq ".[].State.Health.Log[].Output"|head -n1
"There is no nova-compute process with opened RabbitMQ ports (5671,5672) running in the container\n"

Comment 22 Martin Magr 2017-10-27 09:41:39 UTC
Yes, it is caused by reusing nova-compute image for nova-migration-target container. Patch for this is going to land today.

Comment 24 Artem Hrechanychenko 2017-11-06 12:04:27 UTC
FailedQA

environment:
OSP12 puddle: 20171102.1 
641108f139fd        192.168.24.1:8787/rhosp12/openstack-nova-compute-docker:20171102.1         "kolla_start"       About an hour ago   Up About an hour (unhealthy)                       nova_migration_target

[heat-admin@compute-0 ~]$ sudo docker inspect nova_migration_target | jq ".[].State.Health.Log[].Output"|head -n1
"There is no nova-compute process with opened RabbitMQ ports (5671,5672) running in the container\n"

Comment 25 Derek Higgins 2017-11-06 14:59:11 UTC
Do we also need health checks for all the ironic services here?
https://review.openstack.org/#/c/506125

Comment 27 Martin Magr 2017-11-13 10:49:45 UTC
Stable/pike backport of nova_migration_target hc has been merged upstream.

Comment 34 Omri Hochman 2017-11-30 14:48:11 UTC
This bug is Test-Blocker and Automation-Blocker for Container:DFG 
in the test-plan of 'container sanity' we're checking the health status of each container  - > when one of the containers returning (unhealthy) status our automation results reflect this failure.

Comment 44 Leonid Natapov 2018-01-28 05:57:23 UTC
Tested with openstack-tripleo-common-7.6.3-10.el7ost.noarch.



ea5db9c2d692        192.168.24.1:8787/rhosp12/openstack-nova-compute:2018-01-24.2         "kolla_start"       39 hours ago        Up 39 hours (healthy)                       nova_compute

Comment 51 errata-xmlrpc 2018-01-30 21:24:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0253