Description of problem: collectd fails to start with the following error [tripleo-admin@n1cs1b1-osp1-comp001 ~]$ sudo systemctl status tripleo_podman_collectd_acl.service -l × tripleo_podman_collectd_acl.service - ACL setting for /var/lib/tripleo-podman/collectd/podman.sock Loaded: loaded (/etc/systemd/system/tripleo_podman_collectd_acl.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Thu 2024-08-01 14:14:43 IST; 5h 59min ago Main PID: 3676 (code=exited, status=255/EXCEPTION) CPU: 131ms Aug 01 14:14:42 n1cs1b1-osp1-comp001 systemd[1]: Starting ACL setting for /var/lib/tripleo-podman/collectd/podman.sock... Aug 01 14:14:43 n1cs1b1-osp1-comp001 podman[3676]: 2024-08-01 14:14:43.440984869 +0530 IST m=+0.418289924 system refresh Aug 01 14:14:43 n1cs1b1-osp1-comp001 podman[3676]: Error: can only create exec sessions on running containers: container state improper Aug 01 14:14:43 n1cs1b1-osp1-comp001 systemd[1]: tripleo_podman_collectd_acl.service: Main process exited, code=exited, status=255/EXCEPTION Aug 01 14:14:43 n1cs1b1-osp1-comp001 systemd[1]: tripleo_podman_collectd_acl.service: Failed with result 'exit-code'. Aug 01 14:14:43 n1cs1b1-osp1-comp001 systemd[1]: Failed to start ACL setting for /var/lib/tripleo-podman/collectd/podman.sock. It looks like this https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=2249626 but the version in the errata mentioned are older than installed.
From reading the customer ticket, restarting the service manually works. Does the service stay up afterwards? If the service does not stay up, we need collectd log files and also collectd config files from a compute node. Can we please also fetch a collectd service file from a compute node?
(In reply to Matthias Runge from comment #3) > From reading the customer ticket, restarting the service manually works. > Does the service stay up afterwards? Yes it does, however there's clearly a problem as the service should start automatically with the rest of the services and containers. The current workaround is to restart it manually which requires manual intervention for all the nodes in the overcloud.
How often does this happen and which HW your host is? The service tripleo_podman_collectd_acl.service is dependent on tripleo_podman_collectd.service, so this is just a timing issue (collectd container not spawned fast enough before the ACL procedure starts).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHOSP 17.1.4 (openstack-tripleo-heat-templates) security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:9978
What versions of windows does your program run on? Please give details https://geometrydash-lite.com of the steps.