OSP17 | Collectd Sensubility | Collectd sensubility doesn't work in OSP17 as a result of structural changes in OSP17. Starting OSP17 healthchecks apparently not executed by systemctl that means that container heatlhcheck script does not parse expected data out of the log. The healthcheck exec simply does not report "healthy" or "unhealthy" anymore it just add status to a log history in podman itself.
Yes unfortunately there is not reported an output of "healthy" or "unhealthy" any more, so the healthcheck.stdout log does not contain necessary information for collectd_check_health.py to parse the status of containers correctly. Sadly there is no similar information logged at all and so we cannot continue with parsing the log. Luckily there is a podman.socket service available in OSP17 and so we can parse health of each container using podman-remote from withing collectd container, but the problem is that the podman.socket service is not started and enabled after deployment. So we need to do following things: 1. get in touch with DFG:DF to figure out the best way to have podman.socket started and enabled by default or when sensubility is enabled 2. add packages "podman-remote" and "jq" to collectd container image 3. modify collectd_check_health.py script to parse output of "podman-remote --url unix://run/podman/podman.sock inspect <container-name> | jq '.[0]["State"]["Health"]'" instead of the log file
podman has been updated in osp17 and so code from tripleo which created systemd timers / services for health checks have been removed because newer podman has this built in. Unfortunately the built-in systemd services are logging slightly differently and important log records are now missing in podman log. We gonna need to refactor sensubility container check to use podman socket instead of parsing the podman log. Good thing on this is that this will be more clean and flexible solution. We are getting back to the times where we had this available with docker till osp13.
Failed QA: Tested with: RHOS-17.0-RHEL-9-20220811.n.0 openstack-tripleo-heat-templates-14.3.1-0.20220719171716.feca772.el9ost.noarch openstack-tripleo-common-15.4.1-0.20220705010407.51f6577.el9ost.noarch sensubility.log shows the following error: --------------- [DEBUG] Sending AMQP1.0 message [address: sensubility/osp17-telemetry, body: {"labels":{"check":"check-container-health","client":"controller-0.redhat.local","severity":"WARNING"},"annotations":{"command":"/scripts/collectd_check_health.py","duration":0.08489357,"executed":1660640135,"issued":1660640135,"output":"Failed to list containers:\n\ntime=\"2022-08-16T08:55:35Z\" level=error msg=\"stat /root/.config/containers/storage.conf: permission denied\"\n\n","status":1,"ves":"{\"commonEventHeader\":{\"domain\":\"heartbeat\",\"eventType\":\"checkResult\",\"eventId\":\"controller-0.redhat.local-check-container-health\",\"priority\":\"High\",\"reportingEntityId\":\"e9ffba1b-d02f-4184-97b3-a356b55c3ac7\",\"reportingEntityName\":\"controller-0.redhat.local\",\"sourceId\":\"e9ffba1b-d02f-4184-97b3-a356b55c3ac7\",\"sourceName\":\"controller-0.redhat.local-collectd-sensubility\",\"startingEpochMicrosec\":1660640135,\"lastEpochMicrosec\":1660640135},\"heartbeatFields\":{\"additionalFields\":{\"check\":\"check-container-health\",\"command\":\"/scripts/collectd_check_health.py\",\"duration\":\"0.084894\",\"executed\":\"1660640135\",\"issued\":\"1660640135\",\"output\":\"Failed to list containers:\\n\\ntime=\\\"2022-08-16T08:55:35Z\\\" level=error msg=\\\"stat /root/.config/containers/storage.conf: permission denied\\\"\\n\\n\",\"status\":\"1\"}}}"},"startsAt":"2022-08-16T08:55:35Z"}] [DEBUG] Requesting execution of check. [check: check-container-health] [DEBUG] Executed check script. [status: 1, output: Failed to list containers: time="2022-08-16T08:55:45Z" level=error msg="stat /root/.config/containers/storage.conf: permission denied"
@mmagr please update the doc_text from Known Issue to appropriate state as this has moved to MODIFIED. Please coordinate with QE in case this fails QE and we need to keep the Known Issue documentation text for release notes.
FailedQA. The fix entirely brakes OSP17 overcloud deployment with STF. 2022-09-01 07:54:03.945 64072 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud [-] Exception occured while running the command: ValueError: Failed to deploy: ERROR: resources.CephStorageServiceChain<file:///home/stack/overcloud-deploy/overcloud/tripleo-heat-templates/common/services/cephstorage-role.yaml>.resources.ServiceChain<nested_stack>.resources.3<file:///home/stack/overcloud-deploy/overcloud/tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml>.outputs.role_data.value.docker_config.step_2.if: : collectd_init_perm.image.get_attr: The specified reference "RoleParametersValue" (in unknown) is incorrect.
FiledQA. Overcloud deployment passes but sensubility still doesn't work. Moving back to Assign. --------------------------------------------------------------------------------------------------- [DEBUG] Sending AMQP1.0 message [body: {"labels":{"check":"check-container-health","client":"compute-0.redhat.local","severity":"WARNING"},"annotations":{"command":"/scripts/collectd_check_health.py","duration":0.705429861,"executed":1662269566,"issued":1662269566,"output":"Failed to list containers:\n\nCannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM\nError: unable to connect to Podman socket: Get \"http://d/v4.1.1/libpod/_ping\": dial unix ///run/podman/podman.sock: connect: permission denied\n\n","status":1,"ves":"{\"commonEventHeader\":{\"domain\":\"heartbeat\",\"eventType\":\"checkResult\",\"eventId\":\"compute-0.redhat.local-check-container-health\",\"priority\":\"High\",\"reportingEntityId\":\"8210d7be-c5c3-4a7e-95f6-754afb85d1c2\",\"reportingEntityName\":\"compute-0.redhat.local\",\"sourceId\":\"8210d7be-c5c3-4a7e-95f6-754afb85d1c2\",\"sourceName\":\"compute-0.redhat.local-collectd-sensubility\",\"startingEpochMicrosec\":1662269566,\"lastEpochMicrosec\":1662269566},\"heartbeatFields\":{\"additionalFields\":{\"check\":\"check-container-health\",\"command\":\"/scripts/collectd_check_health.py\",\"duration\":\"0.705430\",\"executed\":\"1662269566\",\"issued\":\"1662269566\",\"output\":\"Failed to list containers:\\n\\nCannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM\\nError: unable to connect to Podman socket: Get \\\"http://d/v4.1.1/libpod/_ping\\\": dial unix ///run/podman/podman.sock: connect: permission denied\\n\\n\",\"status\":\"1\"}}}"},"startsAt":"2022-09-04T05:32:46Z"}, address: sensubility/osp17-telemetry] [DEBUG] Requesting execution of check. [check: check-container-health] [DEBUG] Executed check script. [command: /scripts/collectd_check_health.py, status: 1, output: Failed to list containers: Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM Error: unable to connect to Podman socket: Get "http://d/v4.1.1/libpod/_ping": dial unix ///run/podman/podman.sock: connect: permission denied
We will also need to patch selinux policies to enable systemd creating socket in /var/lib/config-data.
Relevant AVCs ---- time->Thu Sep 15 08:46:29 2022 type=AVC msg=audit(1663231589.213:223510): avc: denied { create } for pid=1 comm="systemd" name="podman.sock" scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=1 ---- time->Thu Sep 15 08:46:29 2022 type=AVC msg=audit(1663231589.213:223511): avc: denied { write } for pid=1 comm="systemd" name="podman.sock" dev="vda4" ino=143041949 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=1
Relevant PR for the openstack-selinux: https://github.com/redhat-openstack/openstack-selinux/pull/101
According to our records, this should be resolved by openstack-tripleo-common-15.4.1-0.20220705010409.51f6577.el9ost. This build is available now.
sensubility.log shows no errors. Grafana representation of APIs is correct
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.0.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0271