Bug 2223294

Summary: Collectd Sensubility doesn't work on OSP17.1 and RHEL8.
Product: Red Hat OpenStack Reporter: Leonid Natapov <lnatapov>
Component: collectd-sensubilityAssignee: Martin Magr <mmagr>
Status: ASSIGNED --- QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact: mgeary <mgeary>
Priority: high    
Version: 17.1 (Wallaby)CC: gregraka, lmadsen, mmagr, mrunge, pgrist
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 17.1Flags: mmagr: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
There is a known issue when performing an in-place upgrade from RHOSP 16.2 to 17.1 GA. The collection agent, `collectd-sensubility` fails to run on RHEL 8 Compute nodes. + Workaround: On affected nodes edit the file, `/var/lib/container-config-scripts/collectd_check_health.py`, and replace `"healthy: .State.Health.Status}"` with `"healthy: .State.Healthcheck.Status}"/` on line 26.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Leonid Natapov 2023-07-17 09:49:31 UTC
Colelctd Sensubility doesn't work on OSP17.1 and RHEL8.

latest collectd is collectd-5.12.0-10.el8ost

This scenario ma only happen after FFU in Mixed RHEL environment when compute node(s) are RHEL8. Clean OSP17.1 this scenario won't happen. 


The error that I get in sensubility.log
---------------------------------------

\\\"/scripts/collectd_check_health.py\\\", line 91, in \\u003cmodule\\u003e\\n    rc, status = fetch_container_health(o.decode())\\n  File \\\"/scripts/collectd_check_health.py\\\", line 74, in fetch_container_health\\n    if len(item['healthy']) \\u003e 0 and item['status'] != 'stopped':\\nTypeError: object of type 'NoneType' has no len()\\n\",\"status\":\"1\"}}}"},"startsAt":"2023-07-14T11:12:27Z"}}]
[DEBUG] Requesting execution of check. [check: check-container-health]
[DEBUG] Executed check script. [output: Traceback (most recent call last):
  File "/scripts/collectd_check_health.py", line 91, in <module>
    rc, status = fetch_container_health(o.decode())
  File "/scripts/collectd_check_health.py", line 74, in fetch_container_health
    if len(item['healthy']) > 0 and item['status'] != 'stopped':
TypeError: object of type 'NoneType' has no len()

The problem is that healthcheck script is using podman inspect <container-name> command, which apparently changed output.


Workaround:
-----------

To change /var/lib/container-config-scripts/collectd_check_health.py on line 26 s/“healthy: .State.Health.Status}“/ “healthy: .State.Healthcheck.Status}“/