Bug 2149002 - Metrics are missing in collectd healthchecks
Summary: Metrics are missing in collectd healthchecks
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: z5
: 16.2 (Train on RHEL 8.4)
Assignee: Martin Magr
QA Contact: Leonid Natapov
Joanne O'Flynn
URL:
Whiteboard:
Depends On:
Blocks: 2149008
TreeView+ depends on / blocked
 
Reported: 2022-11-28 13:49 UTC by Abhishek
Modified: 2023-09-19 04:30 UTC (History)
6 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.6.1-2.20230211104940.370c34a
Doc Type: Enhancement
Doc Text:
Clone Of:
: 2149008 (view as bug list)
Environment:
Last Closed: 2023-04-26 12:17:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 866066 0 None MERGED [TRAIN-ONLY] Collect and parse all health check logs 2023-01-31 15:20:45 UTC
OpenStack gerrit 870431 0 None NEW [TRAIN-ONLY] Add more cases for container health check 2023-02-21 14:11:43 UTC
Red Hat Issue Tracker OSP-20492 0 None None None 2022-11-28 13:54:03 UTC
Red Hat Product Errata RHBA-2023:1763 0 None None None 2023-04-26 12:17:51 UTC

Comment 1 Abhishek 2022-11-28 13:56:23 UTC
Description of problem:

The following data is missing in STF/collectd health check; the customer could not be able to create alerts based on it in Prometheus.

ovn_controller, 
ovn_metadata_agent,
neutron ovn
gallera
redis
rabbit
haproxy


How reproducible:

1. Deploy RHOSP 16.2.3, STF 1.4


**STF 1.4 Deployments:**
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/service_telemetry_framework_1.4/assembly-preparing-your-ocp-environment-for-stf_assembly

2. Run following for few hours both on one of the controllers (to figure out ovn and neutron health checks) and on one of the computes (to figure out nova_libvirt)
tail -f /var/log/containers/collectd/healthchecks.log > /tmp/debug.log 

3. Verify  /tmp/debug.log from both nodes

Actual results:
Data is missing 

Expected results:
Data should be visible 


Additional info:
I'm attaching healthchecks.log from customer environment

Comment 3 Martin Magr 2022-11-29 20:44:59 UTC
There are indeed some containers not reported even though they have health checks associated. Fix for that has been submitted upstream.

Unfortunately there are no health checks for "bundle" containers such as ovn-dbs-bundle-podman, haproxy-bundle-podman, redis-bundle-podman, rabbitmq-bundle-podman or galera-bundle-podman. You can check containers that reports health by `systemctl list-timers` and look for `tripleo_<container-name>_healthcheck.service` to figure out on which containers customer can create alerts in Prometheus.

Comment 10 Leonid Natapov 2023-03-29 07:57:56 UTC
openstack-tripleo-heat-templates-11.6.1-2.20230320130752.f1322eb.el8ost.noarch

tested according to test instructions in comment #7.
Verified.

Comment 11 Erin Peterson 2023-04-18 17:59:16 UTC
If you think customers need a description of this bug in addition to the content of the BZ summary field, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.
 
If this bug does not require an additional Doc Text description, please set the 'requires_doc_text' flag to '-'.

Comment 17 errata-xmlrpc 2023-04-26 12:17:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.2.5 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1763

Comment 18 Red Hat Bugzilla 2023-09-19 04:30:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.