Bug 2304312

Summary: collectd fails to start
Product: Red Hat OpenStack Reporter: Siggy Sigwald <ssigwald>
Component: openstack-tripleo-heat-templatesAssignee: Martin Magr <mmagr>
Status: CLOSED ERRATA QA Contact: myadla
Severity: high Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: jbadiapa, jveiraca, lars, mariel, mburns, mmagr, mrunge, pgrist, riramos
Target Milestone: z4Keywords: Triaged, ZStream
Target Release: 17.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-17.1.20240919130751.e7c7ce3.el9ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-11-21 09:30:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Siggy Sigwald 2024-08-13 12:56:11 UTC
Description of problem:
collectd fails to start with the following error

[tripleo-admin@n1cs1b1-osp1-comp001 ~]$ sudo systemctl status tripleo_podman_collectd_acl.service -l
× tripleo_podman_collectd_acl.service - ACL setting for /var/lib/tripleo-podman/collectd/podman.sock
     Loaded: loaded (/etc/systemd/system/tripleo_podman_collectd_acl.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Thu 2024-08-01 14:14:43 IST; 5h 59min ago
   Main PID: 3676 (code=exited, status=255/EXCEPTION)
        CPU: 131ms

Aug 01 14:14:42 n1cs1b1-osp1-comp001 systemd[1]: Starting ACL setting for /var/lib/tripleo-podman/collectd/podman.sock...
Aug 01 14:14:43 n1cs1b1-osp1-comp001 podman[3676]: 2024-08-01 14:14:43.440984869 +0530 IST m=+0.418289924 system refresh
Aug 01 14:14:43 n1cs1b1-osp1-comp001 podman[3676]: Error: can only create exec sessions on running containers: container state improper
Aug 01 14:14:43 n1cs1b1-osp1-comp001 systemd[1]: tripleo_podman_collectd_acl.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 01 14:14:43 n1cs1b1-osp1-comp001 systemd[1]: tripleo_podman_collectd_acl.service: Failed with result 'exit-code'.
Aug 01 14:14:43 n1cs1b1-osp1-comp001 systemd[1]: Failed to start ACL setting for /var/lib/tripleo-podman/collectd/podman.sock.

It looks like this https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=2249626
but the version in the errata mentioned are older than installed.

Comment 3 Matthias Runge 2024-08-14 09:05:24 UTC
From reading the customer ticket, restarting the service manually works. Does the service stay up afterwards? 

If the service does not stay up, we need collectd log files and also collectd config files from a compute node. 
Can we please also fetch a collectd service file from a compute node?

Comment 6 Siggy Sigwald 2024-08-15 10:21:53 UTC
(In reply to Matthias Runge from comment #3)
> From reading the customer ticket, restarting the service manually works.
> Does the service stay up afterwards? 
Yes it does, however there's clearly a problem as the service should start automatically with the rest of the services and containers. The current workaround is to restart it manually which requires manual intervention for all the nodes in the overcloud.

Comment 8 Martin Magr 2024-08-16 13:58:34 UTC
How often does this happen and which HW your host is? The service tripleo_podman_collectd_acl.service is dependent on tripleo_podman_collectd.service, so this is just a timing issue (collectd container not spawned fast enough before the ACL procedure starts).

Comment 21 errata-xmlrpc 2024-11-21 09:30:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHOSP 17.1.4 (openstack-tripleo-heat-templates) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:9978

Comment 22 Stanley Predovic 2024-12-18 04:51:18 UTC Comment hidden (spam)