Bug 2156010
| Summary: | Wrong alerts are coming on Prometheus dashboard. Services are up and running however wrong alerts are seen and giving wrong details. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ganesh Kadam <gkadam> | ||||||
| Component: | openstack-tripleo-heat-templates | Assignee: | Martin Magr <mmagr> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 16.2 (Train) | CC: | abhijadh, augol, jschluet, lmadsen, mburns, mmagr, mrunge, parthee, sukar | ||||||
| Target Milestone: | z5 | Keywords: | Triaged | ||||||
| Target Release: | 16.2 (Train on RHEL 8.4) | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | openstack-tripleo-heat-templates-11.6.1-2.20230211104940.370c34a | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 2158781 (view as bug list) | Environment: | |||||||
| Last Closed: | 2023-04-26 12:17:32 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Comment 49
Martin Magr
2023-02-09 15:11:52 UTC
Created attachment 1943111 [details] Updated openstack-healthcheck.conf Also available at [1] and for a limited time at [2]. [1] https://gist.github.com/paramite/12a43b481ba0003b7c493f3a64f09706 [2] https://transfer.sh/DYGNwJ/openstack-healthcheck.conf Steps to fix the issue manually: 1. Copy file from comment #49 to /var/lib/container-config-scripts/collectd_check_health.py 2. Ensure correct permissions of the script: $ chmod a+rx /var/lib/container-config-scripts/collectd_check_health.py $ semanage fcontext -a -s system_u -t container_file_t /var/lib/container-config-scripts/collectd_check_health.py $ restorecon -FRv /var/lib/container-config-scripts/collectd_check_health.py 3. Copy file from comment #50 to /etc/rsyslog.d/openstack-healthcheck.conf 4. Restart rsyslog service and ensure /var/log/containers/collectd/healthchecks.log is being filled with logs: $ systemctl restart rsyslog $ tail -f /var/log/containers/collectd/healthchecks.log 5. Ensure all container health checks are executed at least every 5 minutes $ (crontab -l 2>/dev/null; echo "*/5 * * * * systemctl list-timers | grep tripleo | awk '{print $NF}' | xargs systemctl start") | crontab - 6. Update your alert queries according to following template: last_over_time(sensubility_container_health_status{host=~"X-.+",process="Y"}[5m]) == 0 - where X is a node name, eg. controller of compute or ... - where Y is a container name, eg. nova_compute or nova_libvirt or ovn_controller or ... 7. wait at least 10s until all changes will propagate to Prometheus and then check your Prometheus UI tested according instructions in comment #30. Both unhealthy and not running containers reported in prometheus as 0. healthy and running as 1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.2.5 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:1763 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |