Bug 2091076
| Summary: | [RHOSP 17.0] collectd sensubility doesn't work in OSP17 as a result of structural changes | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Leonid Natapov <lnatapov> | |
| Component: | openstack-tripleo-heat-templates | Assignee: | Martin Magr <mmagr> | |
| Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | |
| Severity: | high | Docs Contact: | mgeary <mgeary> | |
| Priority: | urgent | |||
| Version: | 17.0 (Wallaby) | CC: | astillma, cjeanner, erpeters, jamsmith, joflynn, jpichon, jschluet, lmadsen, mburns, mmagr, mrunge, pgrist, rheslop, spower, stchen | |
| Target Milestone: | z1 | Keywords: | Regression, Triaged | |
| Target Release: | 17.0 | Flags: | joflynn:
needinfo-
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-tripleo-heat-templates-14.3.1-0.20221124130331.feca772.el9ost tripleo-ansible-3.3.1-0.20221123230736.fa5422f.el9ost openstack-tripleo-common-15.4.1-0.20220705010407.51f6577.el9ost openstack-selinux-0.8.34-0.20221101160640.a82a63a.el9ost | Doc Type: | Bug Fix | |
| Doc Text: |
Before this update, unavailability of the Podman log content caused the health check status script to fail. With this update, an update to the health check status script resolves the issue by using the Podman socket instead of the Podman log. As a result, API health checks, provided through sensubility for Service Telemetry Framework, are now operational.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2152888 (view as bug list) | Environment: | ||
| Last Closed: | 2023-01-25 12:28:50 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2124294, 2152888 | |||
|
Description
Leonid Natapov
2022-05-27 14:17:26 UTC
Yes unfortunately there is not reported an output of "healthy" or "unhealthy" any more, so the healthcheck.stdout log does not contain necessary information for collectd_check_health.py to parse the status of containers correctly. Sadly there is no similar information logged at all and so we cannot continue with parsing the log. Luckily there is a podman.socket service available in OSP17 and so we can parse health of each container using podman-remote from withing collectd container, but the problem is that the podman.socket service is not started and enabled after deployment. So we need to do following things: 1. get in touch with DFG:DF to figure out the best way to have podman.socket started and enabled by default or when sensubility is enabled 2. add packages "podman-remote" and "jq" to collectd container image 3. modify collectd_check_health.py script to parse output of "podman-remote --url unix://run/podman/podman.sock inspect <container-name> | jq '.[0]["State"]["Health"]'" instead of the log file podman has been updated in osp17 and so code from tripleo which created systemd timers / services for health checks have been removed because newer podman has this built in. Unfortunately the built-in systemd services are logging slightly differently and important log records are now missing in podman log. We gonna need to refactor sensubility container check to use podman socket instead of parsing the podman log. Good thing on this is that this will be more clean and flexible solution. We are getting back to the times where we had this available with docker till osp13. Failed QA:
Tested with:
RHOS-17.0-RHEL-9-20220811.n.0
openstack-tripleo-heat-templates-14.3.1-0.20220719171716.feca772.el9ost.noarch
openstack-tripleo-common-15.4.1-0.20220705010407.51f6577.el9ost.noarch
sensubility.log shows the following error:
---------------
[DEBUG] Sending AMQP1.0 message [address: sensubility/osp17-telemetry, body: {"labels":{"check":"check-container-health","client":"controller-0.redhat.local","severity":"WARNING"},"annotations":{"command":"/scripts/collectd_check_health.py","duration":0.08489357,"executed":1660640135,"issued":1660640135,"output":"Failed to list containers:\n\ntime=\"2022-08-16T08:55:35Z\" level=error msg=\"stat /root/.config/containers/storage.conf: permission denied\"\n\n","status":1,"ves":"{\"commonEventHeader\":{\"domain\":\"heartbeat\",\"eventType\":\"checkResult\",\"eventId\":\"controller-0.redhat.local-check-container-health\",\"priority\":\"High\",\"reportingEntityId\":\"e9ffba1b-d02f-4184-97b3-a356b55c3ac7\",\"reportingEntityName\":\"controller-0.redhat.local\",\"sourceId\":\"e9ffba1b-d02f-4184-97b3-a356b55c3ac7\",\"sourceName\":\"controller-0.redhat.local-collectd-sensubility\",\"startingEpochMicrosec\":1660640135,\"lastEpochMicrosec\":1660640135},\"heartbeatFields\":{\"additionalFields\":{\"check\":\"check-container-health\",\"command\":\"/scripts/collectd_check_health.py\",\"duration\":\"0.084894\",\"executed\":\"1660640135\",\"issued\":\"1660640135\",\"output\":\"Failed to list containers:\\n\\ntime=\\\"2022-08-16T08:55:35Z\\\" level=error msg=\\\"stat /root/.config/containers/storage.conf: permission denied\\\"\\n\\n\",\"status\":\"1\"}}}"},"startsAt":"2022-08-16T08:55:35Z"}]
[DEBUG] Requesting execution of check. [check: check-container-health]
[DEBUG] Executed check script. [status: 1, output: Failed to list containers:
time="2022-08-16T08:55:45Z" level=error msg="stat /root/.config/containers/storage.conf: permission denied"
@mmagr please update the doc_text from Known Issue to appropriate state as this has moved to MODIFIED. Please coordinate with QE in case this fails QE and we need to keep the Known Issue documentation text for release notes. FailedQA. The fix entirely brakes OSP17 overcloud deployment with STF. 2022-09-01 07:54:03.945 64072 ERROR tripleoclient.v1.overcloud_deploy.DeployOvercloud [-] Exception occured while running the command: ValueError: Failed to deploy: ERROR: resources.CephStorageServiceChain<file:///home/stack/overcloud-deploy/overcloud/tripleo-heat-templates/common/services/cephstorage-role.yaml>.resources.ServiceChain<nested_stack>.resources.3<file:///home/stack/overcloud-deploy/overcloud/tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml>.outputs.role_data.value.docker_config.step_2.if: : collectd_init_perm.image.get_attr: The specified reference "RoleParametersValue" (in unknown) is incorrect. FiledQA. Overcloud deployment passes but sensubility still doesn't work. Moving back to Assign.
---------------------------------------------------------------------------------------------------
[DEBUG] Sending AMQP1.0 message [body: {"labels":{"check":"check-container-health","client":"compute-0.redhat.local","severity":"WARNING"},"annotations":{"command":"/scripts/collectd_check_health.py","duration":0.705429861,"executed":1662269566,"issued":1662269566,"output":"Failed to list containers:\n\nCannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM\nError: unable to connect to Podman socket: Get \"http://d/v4.1.1/libpod/_ping\": dial unix ///run/podman/podman.sock: connect: permission denied\n\n","status":1,"ves":"{\"commonEventHeader\":{\"domain\":\"heartbeat\",\"eventType\":\"checkResult\",\"eventId\":\"compute-0.redhat.local-check-container-health\",\"priority\":\"High\",\"reportingEntityId\":\"8210d7be-c5c3-4a7e-95f6-754afb85d1c2\",\"reportingEntityName\":\"compute-0.redhat.local\",\"sourceId\":\"8210d7be-c5c3-4a7e-95f6-754afb85d1c2\",\"sourceName\":\"compute-0.redhat.local-collectd-sensubility\",\"startingEpochMicrosec\":1662269566,\"lastEpochMicrosec\":1662269566},\"heartbeatFields\":{\"additionalFields\":{\"check\":\"check-container-health\",\"command\":\"/scripts/collectd_check_health.py\",\"duration\":\"0.705430\",\"executed\":\"1662269566\",\"issued\":\"1662269566\",\"output\":\"Failed to list containers:\\n\\nCannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM\\nError: unable to connect to Podman socket: Get \\\"http://d/v4.1.1/libpod/_ping\\\": dial unix ///run/podman/podman.sock: connect: permission denied\\n\\n\",\"status\":\"1\"}}}"},"startsAt":"2022-09-04T05:32:46Z"}, address: sensubility/osp17-telemetry]
[DEBUG] Requesting execution of check. [check: check-container-health]
[DEBUG] Executed check script. [command: /scripts/collectd_check_health.py, status: 1, output: Failed to list containers:
Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM
Error: unable to connect to Podman socket: Get "http://d/v4.1.1/libpod/_ping": dial unix ///run/podman/podman.sock: connect: permission denied
We will also need to patch selinux policies to enable systemd creating socket in /var/lib/config-data. Relevant AVCs
----
time->Thu Sep 15 08:46:29 2022
type=AVC msg=audit(1663231589.213:223510): avc: denied { create } for pid=1 comm="systemd" name="podman.sock" scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=1
----
time->Thu Sep 15 08:46:29 2022
type=AVC msg=audit(1663231589.213:223511): avc: denied { write } for pid=1 comm="systemd" name="podman.sock" dev="vda4" ino=143041949 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=sock_file permissive=1
Relevant PR for the openstack-selinux: https://github.com/redhat-openstack/openstack-selinux/pull/101 According to our records, this should be resolved by openstack-tripleo-common-15.4.1-0.20220705010409.51f6577.el9ost. This build is available now. sensubility.log shows no errors. Grafana representation of APIs is correct Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.0.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0271 |