2156010 – Wrong alerts are coming on Prometheus dashboard. Services are up and running however wrong alerts are seen and giving wrong details.

Bug 2156010 - Wrong alerts are coming on Prometheus dashboard. Services are up and running however wrong alerts are seen and giving wrong details.

Summary: Wrong alerts are coming on Prometheus dashboard. Services are up and running ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.2 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z5
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Martin Magr
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-12-23 12:05 UTC by Ganesh Kadam
Modified:	2023-09-19 04:31 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.6.1-2.20230211104940.370c34a
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2158781 (view as bug list)
Environment:
Last Closed:	2023-04-26 12:17:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Updated collectd_check_health.py (4.63 KB, text/x-python3) 2023-02-09 15:11 UTC, Martin Magr	no flags	Details
Updated openstack-healthcheck.conf (300 bytes, text/plain) 2023-02-09 15:14 UTC, Martin Magr	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	870431	None	NEW	[TRAIN-ONLY] Add more cases for container health check	2023-04-25 15:40:17 UTC
Red Hat Issue Tracker	OSP-21031	None	None	None	2022-12-23 12:10:51 UTC
Red Hat Product Errata	RHBA-2023:1763	None	None	None	2023-04-26 12:17:44 UTC

Comment 49 Martin Magr 2023-02-09 15:11:52 UTC

Created attachment 1943110 [details]
Updated collectd_check_health.py

Working script for parsing healthchecks.log also available on [1] and for a limited time also on [2].

[1] https://raw.githubusercontent.com/paramite/tripleo-heat-templates/hc-dead-container/container_config_scripts/monitoring/collectd_check_health.py
[2] https://transfer.sh/XJROOp/collectd_check_health.py

Comment 50 Martin Magr 2023-02-09 15:14:20 UTC

Created attachment 1943111 [details]
Updated openstack-healthcheck.conf

Also available at [1] and for a limited time at [2].

[1] https://gist.github.com/paramite/12a43b481ba0003b7c493f3a64f09706
[2] https://transfer.sh/DYGNwJ/openstack-healthcheck.conf

Comment 51 Martin Magr 2023-02-10 10:08:59 UTC

Steps to fix the issue manually:

1. Copy file from comment #49 to /var/lib/container-config-scripts/collectd_check_health.py
2. Ensure correct permissions of the script:
  $ chmod a+rx /var/lib/container-config-scripts/collectd_check_health.py
  $ semanage fcontext -a -s system_u -t container_file_t /var/lib/container-config-scripts/collectd_check_health.py
  $ restorecon -FRv /var/lib/container-config-scripts/collectd_check_health.py
3. Copy file from comment #50 to /etc/rsyslog.d/openstack-healthcheck.conf
4. Restart rsyslog service and ensure /var/log/containers/collectd/healthchecks.log is being filled with logs:
  $ systemctl restart rsyslog
  $ tail -f /var/log/containers/collectd/healthchecks.log
5. Ensure all container health checks are executed at least every 5 minutes
  $ (crontab -l 2>/dev/null; echo "*/5 * * * * systemctl list-timers | grep tripleo | awk '{print $NF}' | xargs systemctl start") | crontab -
6. Update your alert queries according to following template:
  last_over_time(sensubility_container_health_status{host=~"X-.+",process="Y"}[5m]) == 0
  - where X is a node name, eg. controller of compute or ...
  - where Y is a container name, eg. nova_compute or nova_libvirt or ovn_controller or ...
7. wait at least 10s until all changes will propagate to Prometheus and then check your Prometheus UI

Comment 54 Leonid Natapov 2023-03-29 07:56:50 UTC

tested according instructions in comment #30.
Both unhealthy and not running containers reported in prometheus as 0. healthy and running as 1

Comment 60 errata-xmlrpc 2023-04-26 12:17:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.2.5 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1763

Comment 61 Red Hat Bugzilla 2023-09-19 04:31:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.