Bug 2091076 - [RHOSP 17.0] collectd sensubility doesn't work in OSP17 as a result of structural changes
Summary: [RHOSP 17.0] collectd sensubility doesn't work in OSP17 as a result of struct...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: z1
: 17.0
Assignee: Martin Magr
QA Contact: Leonid Natapov
mgeary
URL:
Whiteboard:
Depends On:
Blocks: 2124294 2152888
TreeView+ depends on / blocked
 
Reported: 2022-05-27 14:17 UTC by Leonid Natapov
Modified: 2023-01-25 12:29 UTC (History)
15 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-0.20221124130331.feca772.el9ost tripleo-ansible-3.3.1-0.20221123230736.fa5422f.el9ost openstack-tripleo-common-15.4.1-0.20220705010407.51f6577.el9ost openstack-selinux-0.8.34-0.20221101160640.a82a63a.el9ost
Doc Type: Bug Fix
Doc Text:
Before this update, unavailability of the Podman log content caused the health check status script to fail. With this update, an update to the health check status script resolves the issue by using the Podman socket instead of the Podman log. As a result, API health checks, provided through sensubility for Service Telemetry Framework, are now operational.
Clone Of:
: 2152888 (view as bug list)
Environment:
Last Closed: 2023-01-25 12:28:50 UTC
Target Upstream Version:
Embargoed:
joflynn: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github redhat-openstack openstack-selinux pull 101 0 None Merged Allow init_t to create and mange socket in container_file_t 2022-12-16 20:53:41 UTC
OpenStack gerrit 431570 0 None ABANDONED libvirt: avoid generating script with empty path 2022-12-16 20:53:44 UTC
OpenStack gerrit 848597 0 None MERGED Add dependency for container healthcheck script 2022-12-16 20:53:46 UTC
OpenStack gerrit 850929 0 None MERGED Update sensubility's container health check 2022-12-16 20:53:48 UTC
OpenStack gerrit 854139 0 None MERGED Fix collectd-sensubility script output 2022-12-16 20:53:52 UTC
OpenStack gerrit 854140 0 None MERGED Make sure sensubility has proper permission 2022-12-16 20:53:55 UTC
OpenStack gerrit 854814 0 None MERGED Set /run/podman ACL before starting collectd 2022-12-16 20:53:58 UTC
OpenStack gerrit 854991 0 None MERGED Set /run/podman ACL before starting collectd 2022-12-16 20:54:01 UTC
OpenStack gerrit 857855 0 None MERGED Add podman_socket role 2022-12-16 20:54:04 UTC
OpenStack gerrit 857857 0 None MERGED Move podman socket 2022-12-16 20:54:07 UTC
Red Hat Issue Tracker OSP-15422 0 None None None 2022-05-27 14:37:55 UTC
Red Hat Product Errata RHBA-2023:0271 0 None None None 2023-01-25 12:29:35 UTC

Description Leonid Natapov 2022-05-27 14:17:26 UTC
OSP17 | Collectd Sensubility | Collectd sensubility doesn't work in OSP17 as a result of structural changes in OSP17.

Starting OSP17 healthchecks apparently not executed by systemctl 
that means that container heatlhcheck script does not parse expected data out of the log.

The healthcheck exec simply does not report "healthy" or "unhealthy" anymore
it just add status to a log history in podman itself.

Comment 2 Martin Magr 2022-05-27 14:26:42 UTC
Yes unfortunately there is not reported an output of "healthy" or "unhealthy" any more, so the healthcheck.stdout log does not contain necessary information for collectd_check_health.py to parse the status of containers correctly. Sadly there is no similar information logged at all and so we cannot continue with parsing the log. Luckily there is a podman.socket service available in OSP17 and so we can parse health of each container using podman-remote from withing collectd container, but the problem is that the podman.socket service is not started and enabled after deployment.

So we need to do following things:

1. get in touch with DFG:DF to figure out the best way to have podman.socket started and enabled by default or when sensubility is enabled
2. add packages "podman-remote" and "jq" to collectd container image
3. modify collectd_check_health.py script to parse output of "podman-remote --url unix://run/podman/podman.sock inspect <container-name> | jq '.[0]["State"]["Health"]'" instead of the log file

Comment 6 Martin Magr 2022-06-20 20:00:03 UTC
podman has been updated in osp17 and so code from tripleo which created systemd timers / services for health checks have been removed because newer podman has this built in. Unfortunately the built-in systemd services are logging slightly differently and important log records are now missing in podman log. We gonna need to refactor sensubility container check to use podman socket instead of parsing the podman log. Good thing on this is that this will be more clean and flexible solution. We are getting back to the times where we had this available with docker till osp13.

Comment 17 Leonid Natapov 2022-08-16 09:14:32 UTC
Failed QA:

Tested with:
RHOS-17.0-RHEL-9-20220811.n.0
openstack-tripleo-heat-templates-14.3.1-0.20220719171716.feca772.el9ost.noarch
openstack-tripleo-common-15.4.1-0.20220705010407.51f6577.el9ost.noarch

sensubility.log shows the following error:
---------------


[DEBUG] Sending AMQP1.0 message [address: sensubility/osp17-telemetry, body: {"labels":{"check":"check-container-health","client":"controller-0.redhat.local","severity":"WARNING"},"annotations":{"command":"/scripts/collectd_check_health.py","duration":0.08489357,"executed":1660640135,"issued":1660640135,"output":"Failed to list containers:\n\ntime=\"2022-08-16T08:55:35Z\" level=error msg=\"stat /root/.config/containers/storage.conf: permission denied\"\n\n","status":1,"ves":"{\"commonEventHeader\":{\"domain\":\"heartbeat\",\"eventType\":\"checkResult\",\"eventId\":\"controller-0.redhat.local-check-container-health\",\"priority\":\"High\",\"reportingEntityId\":\"e9ffba1b-d02f-4184-97b3-a356b55c3ac7\",\"reportingEntityName\":\"controller-0.redhat.local\",\"sourceId\":\"e9ffba1b-d02f-4184-97b3-a356b55c3ac7\",\"sourceName\":\"controller-0.redhat.local-collectd-sensubility\",\"startingEpochMicrosec\":1660640135,\"lastEpochMicrosec\":1660640135},\"heartbeatFields\":{\"additionalFields\":{\"check\":\"check-container-health\",\"command\":\"/scripts/collectd_check_health.py\",\"duration\":\"0.084894\",\"executed\":\"1660640135\",\"issued\":\"1660640135\",\"output\":\"Failed to list containers:\\n\\ntime=\\\"2022-08-16T08:55:35Z\\\" level=error msg=\\\"stat /root/.config/containers/storage.conf: permission denied\\\"\\n\\n\",\"status\":\"1\"}}}"},"startsAt":"2022-08-16T08:55:35Z"}]
[DEBUG] Requesting execution of check. [check: check-container-health]
[DEBUG] Executed check script. [status: 1, output: Failed to list containers:

time="2022-08-16T08:55:45Z" level=error msg="stat /root/.config/containers/storage.conf: permission denied"

Comment 36 Leif Madsen 2022-08-30 19:45:34 UTC
@mmagr please update the doc_text from Known Issue to appropriate state as this has moved to MODIFIED. Please coordinate with QE in case this fails QE and we need to keep the Known Issue documentation text for release notes.

Comment 38 Leonid Natapov 2022-09-01 08:52:13 UTC
FailedQA.

The fix entirely brakes OSP17 overcloud deployment with STF.


2022-09-01 07:54:03.945 64072 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud [-] Exception occured while running the command: ValueError: Failed to deploy: ERROR: resources.CephStorageServiceChain<file:///home/stack/overcloud-deploy/overcloud/tripleo-heat-templates/common/services/cephstorage-role.yaml>.resources.ServiceChain<nested_stack>.resources.3<file:///home/stack/overcloud-deploy/overcloud/tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml>.outputs.role_data.value.docker_config.step_2.if: : collectd_init_perm.image.get_attr: The specified reference "RoleParametersValue" (in unknown) is incorrect.

Comment 43 Leonid Natapov 2022-09-04 05:46:59 UTC
FiledQA. Overcloud deployment passes but sensubility still doesn't work. Moving back to Assign.
---------------------------------------------------------------------------------------------------


[DEBUG] Sending AMQP1.0 message [body: {"labels":{"check":"check-container-health","client":"compute-0.redhat.local","severity":"WARNING"},"annotations":{"command":"/scripts/collectd_check_health.py","duration":0.705429861,"executed":1662269566,"issued":1662269566,"output":"Failed to list containers:\n\nCannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM\nError: unable to connect to Podman socket: Get \"http://d/v4.1.1/libpod/_ping\": dial unix ///run/podman/podman.sock: connect: permission denied\n\n","status":1,"ves":"{\"commonEventHeader\":{\"domain\":\"heartbeat\",\"eventType\":\"checkResult\",\"eventId\":\"compute-0.redhat.local-check-container-health\",\"priority\":\"High\",\"reportingEntityId\":\"8210d7be-c5c3-4a7e-95f6-754afb85d1c2\",\"reportingEntityName\":\"compute-0.redhat.local\",\"sourceId\":\"8210d7be-c5c3-4a7e-95f6-754afb85d1c2\",\"sourceName\":\"compute-0.redhat.local-collectd-sensubility\",\"startingEpochMicrosec\":1662269566,\"lastEpochMicrosec\":1662269566},\"heartbeatFields\":{\"additionalFields\":{\"check\":\"check-container-health\",\"command\":\"/scripts/collectd_check_health.py\",\"duration\":\"0.705430\",\"executed\":\"1662269566\",\"issued\":\"1662269566\",\"output\":\"Failed to list containers:\\n\\nCannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM\\nError: unable to connect to Podman socket: Get \\\"http://d/v4.1.1/libpod/_ping\\\": dial unix ///run/podman/podman.sock: connect: permission denied\\n\\n\",\"status\":\"1\"}}}"},"startsAt":"2022-09-04T05:32:46Z"}, address: sensubility/osp17-telemetry]
[DEBUG] Requesting execution of check. [check: check-container-health]
[DEBUG] Executed check script. [command: /scripts/collectd_check_health.py, status: 1, output: Failed to list containers:

Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM
Error: unable to connect to Podman socket: Get "http://d/v4.1.1/libpod/_ping": dial unix ///run/podman/podman.sock: connect: permission denied

Comment 55 Martin Magr 2022-09-15 11:18:58 UTC
We will also need to patch selinux policies to enable systemd creating socket in /var/lib/config-data.

Comment 56 Martin Magr 2022-09-15 11:37:25 UTC
Relevant AVCs

----
time->Thu Sep 15 08:46:29 2022
type=AVC msg=audit(1663231589.213:223510): avc:  denied  { create } for  pid=1 comm="systemd" name="podman.sock" scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=1
----
time->Thu Sep 15 08:46:29 2022
type=AVC msg=audit(1663231589.213:223511): avc:  denied  { write } for  pid=1 comm="systemd" name="podman.sock" dev="vda4" ino=143041949 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=1

Comment 57 Cédric Jeanneret 2022-09-15 11:42:59 UTC
Relevant PR for the openstack-selinux:
https://github.com/redhat-openstack/openstack-selinux/pull/101

Comment 60 OSP Team 2022-11-25 11:45:07 UTC
According to our records, this should be resolved by openstack-tripleo-common-15.4.1-0.20220705010409.51f6577.el9ost.  This build is available now.

Comment 67 Leonid Natapov 2023-01-17 16:08:15 UTC
sensubility.log shows no errors. Grafana representation of APIs is correct

Comment 73 errata-xmlrpc 2023-01-25 12:28:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.0.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0271


Note You need to log in before you can comment on or make changes to this bug.