Bug 1693196 - [OSP16][Undercloud][healthcheck] failed healthcheck for ceilometer_agent_compute
Summary: [OSP16][Undercloud][healthcheck] failed healthcheck for ceilometer_agent_compute
Keywords:
Status: CLOSED DUPLICATE of bug 1910939
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z8
: 16.1 (Train on RHEL 8.2)
Assignee: Martin Magr
QA Contact: Leonid Natapov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-27 10:28 UTC by Artem Hrechanychenko
Modified: 2024-10-01 16:14 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-common-11.4.1-1.20210407183435.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-24 11:23:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 648027 0 'None' MERGED Silent file descriptor checks 2021-02-09 22:21:37 UTC
OpenStack gerrit 692802 0 'None' MERGED Fix ceilometer_agent_compute health check 2021-02-09 22:21:37 UTC
OpenStack gerrit 693837 0 'None' MERGED Fix ceilometer_agent_compute health check 2021-02-09 22:21:37 UTC
OpenStack gerrit 757089 0 None MERGED Fix ceilometer_agent_compute healthcheck 2021-02-09 22:21:38 UTC
OpenStack gerrit 762124 0 None MERGED Fix ceilometer_agent_compute healthcheck 2021-02-09 22:21:38 UTC
Red Hat Bugzilla 1689671 0 medium CLOSED Undercloud: neutron containers healthcheck failed 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker OSP-4080 0 None None None 2021-11-12 17:52:11 UTC
Red Hat Knowledge Base (Solution) 6097391 0 None None None 2021-07-23 15:15:30 UTC

Description Artem Hrechanychenko 2019-03-27 10:28:00 UTC
Description of problem:
After Overcloud installation

check health-check for container on Compute node

[heat-admin@compute-0 ~]$ sudo systemctl status tripleo_ceilometer_agent_compute_healthcheck.service 
● tripleo_ceilometer_agent_compute_healthcheck.service - ceilometer_agent_compute healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_ceilometer_agent_compute_healthcheck.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-03-27 10:21:44 UTC; 1min 13s ago
  Process: 136470 ExecStart=/usr/bin/podman exec ceilometer_agent_compute /openstack/healthcheck (code=exited, status=1/FAILURE)
 Main PID: 136470 (code=exited, status=1/FAILURE)

Mar 27 10:21:44 compute-0 systemd[1]: Starting ceilometer_agent_compute healthcheck...
Mar 27 10:21:44 compute-0 podman[136470]: There is no ceilometer-poll process with opened RabbitMQ ports (5671,5672) running in the container
Mar 27 10:21:44 compute-0 podman[136470]: exit status 1
Mar 27 10:21:44 compute-0 systemd[1]: tripleo_ceilometer_agent_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Mar 27 10:21:44 compute-0 systemd[1]: tripleo_ceilometer_agent_compute_healthcheck.service: Failed with result 'exit-code'.
Mar 27 10:21:44 compute-0 systemd[1]: Failed to start ceilometer_agent_compute healthcheck.


Container runs
f7624d8ba4f0  192.168.24.1:8787/rhosp15/openstack-ceilometer-compute:20190325.1          kolla_start  13 hours ago  Up 13 hours ago         ceilometer_agent_compute

[heat-admin@compute-0 ~]$ sudo podman logs ceilometer_agent_compute
+ sudo -E kolla_set_configs
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Deleting /etc/ceilometer/ceilometer.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/ceilometer/ceilometer.conf to /etc/ceilometer/ceilometer.conf
INFO:__main__:Writing out command to execute
++ cat /run_command
+ CMD='/usr/bin/ceilometer-polling --polling-namespaces compute --logfile /var/log/ceilometer/compute.log'
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ CEILOMETER_LOG_DIR=/var/log/kolla/ceilometer
++ [[ ! -d /var/log/kolla/ceilometer ]]
++ mkdir -p /var/log/kolla/ceilometer
+++ stat -c %U:%G /var/log/kolla/ceilometer
++ [[ root:kolla != \c\e\i\l\o\m\e\t\e\r\:\k\o\l\l\a ]]
++ chown ceilometer:kolla /var/log/kolla/ceilometer
+++ stat -c %a /var/log/kolla/ceilometer
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/ceilometer
++ . /usr/local/bin/kolla_ceilometer_extend_start
+ echo 'Running command: '\''/usr/bin/ceilometer-polling --polling-namespaces compute --logfile /var/log/ceilometer/compute.log'\'''
Running command: '/usr/bin/ceilometer-polling --polling-namespaces compute --logfile /var/log/ceilometer/compute.log'
+ exec /usr/bin/ceilometer-polling --polling-namespaces compute --logfile /var/log/ceilometer/compute.log

Version-Release number of selected component (if applicable):
OSP15 compose RHOS_TRUNK-15.0-RHEL-8-20190326.n.0

container image openstack-ceilometer-compute:20190325.1 

How reproducible:
Always

Steps to Reproduce:
1.Deploy undercloud OSP15 
2.Deploy Overcloud OSP15
3. check healthcheck status for container on overcloud compute node

Actual results:
There is no ceilometer-poll process with opened RabbitMQ ports (5671,5672) running in the container

Expected results:
service exited with exit code ==0 

Additional info:

Comment 2 Cédric Jeanneret 2019-03-27 12:17:06 UTC
Hello!

pretty sure this one is linked to https://bugzilla.redhat.com/show_bug.cgi?id=1689671
The following patch will probably solve this issue: https://review.openstack.org/648027

I'm taking this BZ.

Cheers,

C.

Comment 6 Nataf Sharabi 2019-06-11 09:12:38 UTC
Hi,

I've installed OSP15 core_puddle=RHOS_TRUNK-15.0-RHEL-8-20190604.n.2

undercloud:1,controller:1,compute:1



(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.12
Warning: Permanently added '192.168.24.12' (ECDSA) to the list of known hosts.
Last login: Mon Jun 10 11:34:33 2019 from 192.168.24.254
[heat-admin@compute-0 ~]$ 
[heat-admin@compute-0 ~]$ sudo systemctl status tripleo_ceilometer_agent_compute_healthcheck.service
● tripleo_ceilometer_agent_compute_healthcheck.service - ceilometer_agent_compute healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_ceilometer_agent_compute_healthcheck.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-06-11 08:56:21 UTC; 50s ago
  Process: 252186 ExecStart=/usr/bin/podman exec ceilometer_agent_compute /openstack/healthcheck 5672 (code=exited, status=1/FAILURE)
 Main PID: 252186 (code=exited, status=1/FAILURE)

Jun 11 08:56:21 compute-0 systemd[1]: Starting ceilometer_agent_compute healthcheck...
Jun 11 08:56:21 compute-0 podman[252186]: There is no ceilometer-polling process with opened RabbitMQ ports (5672) running in the container
Jun 11 08:56:21 compute-0 podman[252186]: exit status 1
Jun 11 08:56:21 compute-0 systemd[1]: tripleo_ceilometer_agent_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Jun 11 08:56:21 compute-0 systemd[1]: tripleo_ceilometer_agent_compute_healthcheck.service: Failed with result 'exit-code'.
Jun 11 08:56:21 compute-0 systemd[1]: Failed to start ceilometer_agent_compute healthcheck.

[heat-admin@compute-0 ~]$ sudo podman logs ceilometer_agent_compute
#The logs are empty

[heat-admin@compute-0 ~]$ sudo podman ls
CONTAINER ID  IMAGE                                                                      COMMAND               CREATED       STATUS           PORTS  NAMES
2a8e9ba993d3  192.168.24.1:8787/rhosp15/openstack-nova-compute:20190604.1                dumb-init --singl...  22 hours ago  Up 22 hours ago         nova_compute
7533c83f18b7  192.168.24.1:8787/rhosp15/openstack-neutron-metadata-agent-ovn:20190604.1  dumb-init --singl...  22 hours ago  Up 22 hours ago         ovn_metadata_agent
8abed1793fad  192.168.24.1:8787/rhosp15/openstack-ovn-controller:20190604.1              dumb-init --singl...  22 hours ago  Up 22 hours ago         ovn_controller
79f3ac5bb7b0  192.168.24.1:8787/rhosp15/openstack-nova-compute:20190604.1                dumb-init --singl...  22 hours ago  Up 22 hours ago         nova_migration_target
cdbff47e1aa0  192.168.24.1:8787/rhosp15/openstack-cron:20190604.1                        dumb-init --singl...  22 hours ago  Up 22 hours ago         logrotate_crond
47c9bef560e4  192.168.24.1:8787/rhosp15/openstack-ceilometer-compute:20190604.1          dumb-init --singl...  22 hours ago  Up 22 hours ago         ceilometer_agent_compute
be547c98832f  192.168.24.1:8787/rhosp15/openstack-iscsid:20190604.1                      dumb-init --singl...  22 hours ago  Up 22 hours ago         iscsid
fb30bb4e95ce  192.168.24.1:8787/rhosp15/openstack-nova-libvirt:20190604.1                dumb-init --singl...  22 hours ago  Up 22 hours ago         nova_libvirt
739e51d60b33  192.168.24.1:8787/rhosp15/openstack-nova-libvirt:20190604.1                dumb-init --singl...  22 hours ago  Up 22 hours ago         nova_virtlogd

It seems that the problem hasn't been resolved.

Nataf

Comment 7 Artem Hrechanychenko 2019-06-13 15:11:45 UTC
Confirm that I got the same issue

Comment 8 Cédric Jeanneret 2019-06-28 07:16:21 UTC
OK, will put it back on the bench and work out a solution then :).

Comment 9 Cédric Jeanneret 2019-07-05 07:36:46 UTC
Setting right DFG(s) - they should take care of the ceilometer healthchecks.

Comment 21 Lon Hohberger 2020-03-06 11:38:23 UTC
According to our records, this should be resolved by openstack-tripleo-common-10.8.3-0.20200113210450.0e559fc.el8ost.  This build is available now.

Comment 22 Martin Magr 2020-09-24 10:20:46 UTC
This is still issue on OSP16. From the output below we can see that the HC script has been fixed, but the default correct value is still being overriden. That happens probably during deploy. Further investigation is required.

[root@compute-0 ~]# systemctl status tripleo_ceilometer_agent_compute_healthcheck.service
● tripleo_ceilometer_agent_compute_healthcheck.service - ceilometer_agent_compute healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_ceilometer_agent_compute_healthcheck.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-09-24 10:13:12 UTC; 53s ago
  Process: 309837 ExecStart=/usr/bin/podman exec --user root ceilometer_agent_compute /openstack/healthcheck 5672 (code=exited, status=1/FAILURE)
 Main PID: 309837 (code=exited, status=1/FAILURE)

Sep 24 10:13:11 compute-0 systemd[1]: Starting ceilometer_agent_compute healthcheck...
Sep 24 10:13:12 compute-0 podman[309837]: 2020-09-24 10:13:12.092599591 +0000 UTC m=+0.322381794 container exec f04e88e773d3d4941877dbb20acbfd0ea6971b4f3e68bfde157bc72487271186 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osb>
Sep 24 10:13:12 compute-0 healthcheck_ceilometer_agent_compute[309837]: There is no ceilometer-polling process with opened Redis ports (5672) running in the container
Sep 24 10:13:12 compute-0 healthcheck_ceilometer_agent_compute[309837]: Error: non zero exit code: 1: OCI runtime error
Sep 24 10:13:12 compute-0 systemd[1]: tripleo_ceilometer_agent_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 24 10:13:12 compute-0 systemd[1]: tripleo_ceilometer_agent_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 24 10:13:12 compute-0 systemd[1]: Failed to start ceilometer_agent_compute healthcheck.
[root@compute-0 ~]# 
[root@compute-0 ~]# podman exec -it ceilometer_agent_compute bash
()[root@compute-0 /]# cat /openstack/healthcheck 
#!/bin/bash

. ${HEALTHCHECK_SCRIPTS:-/usr/share/openstack-tripleo-common/healthcheck}/common.sh

process='ceilometer-polling'
args="${@:-6379}"

if healthcheck_port $process $args; then
    exit 0
else
    ports=${args// /,}
    echo "There is no $process process with opened Redis ports ($ports) running in the container"
    exit 1
fi
()[root@compute-0 /]#

Targeting to OSP16 since OSP15 is EOL.

Comment 30 Martin Magr 2021-08-24 11:23:22 UTC
The original issue from description was fixed and the new issue is being handled in bug #1910939. Closing this as duplicate.

*** This bug has been marked as a duplicate of bug 1910939 ***


Note You need to log in before you can comment on or make changes to this bug.