Description of problem: At first there were no reported issues with collectd container and it was even able to poll and send metrics to the STF hosted on top of OCP 4.3. So, it went pretty smooth with no issues neither with the STF itself or collectd polling as you can see. ~~~ ● tripleo_collectd.service - collectd container Loaded: loaded (/etc/systemd/system/tripleo_collectd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2020-07-24 13:04:52 UTC; 1h 52min ago Main PID: 576494 (conmon) Tasks: 0 (limit: 26213) Memory: 1.2M CGroup: /system.slice/tripleo_collectd.service ‣ 576494 /usr/bin/conmon --api-version 1 -s -c 2da5d40a5746e0a1e1e2e2b5ad8f2faa5d5d7e69b49efafd162fa10f30a1bc73 -u 2da5d40a> Jul 24 13:04:51 overcloud-controller-0 systemd[1]: Starting collectd container... Jul 24 13:04:52 overcloud-controller-0 podman[576435]: 2020-07-24 13:04:52.273189239 +0000 UTC m=+0.864768215 container init 2da5d40a5> Jul 24 13:04:52 overcloud-controller-0 podman[576435]: 2020-07-24 13:04:52.323458025 +0000 UTC m=+0.915037000 container start 2da5d40a> Jul 24 13:04:52 overcloud-controller-0 podman[576435]: collectd Jul 24 13:04:52 overcloud-controller-0 systemd[1]: Started collectd container. ~~~ Although by this time, my deployment command did not have /usr/share/openstack-tripleo-heat-templates/environments/metrics/collect-read-rabbitmq.yaml enabled. So, I did not request rabbitmq monitoring additionally in the first run. So far so good, however when the environment file to monitor rabbitmq issues on the overcloud was introduced issues started to appear with collectd container. And now, the collectd container and associated service started flapping continuously. ~~~ podman logs collectd [2020-07-24 16:25:34] plugin_load: plugin "logfile" successfully loaded. Error: Reading the config file failed! Read the logs for details. +++ [root@overcloud-controller-0 ~]# systemctl status -l tripleo_collectd.service ● tripleo_collectd.service - collectd container Loaded: loaded (/etc/systemd/system/tripleo_collectd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2020-07-24 16:25:35 UTC; 26min ago Process: 394041 ExecStart=/usr/bin/podman start collectd (code=exited, status=0/SUCCESS) Main PID: 394053 (code=exited, status=1/FAILURE) Jul 24 16:38:13 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:39:28 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:41:13 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:42:17 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:43:58 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:45:09 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:46:12 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:48:12 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:49:47 overcloud-controller-0 systemd[1]: collectd container is not active. Jul 24 16:51:17 overcloud-controller-0 systemd[1]: collectd container is not active. [root@overcloud-controller-0 ~]# Active: activating (start) since Fri 2020-07-24 16:55:06 UTC; 975ms ago Process: 509828 ExecStop=/usr/bin/podman stop -t 10 collectd (code=exited, status=0/SUCCESS) Main PID: 515107 (code=exited, status=1/FAILURE); Control PID: 515215 (podman) Tasks: 9 (limit: 26213) Memory: 35.5M CGroup: /system.slice/tripleo_collectd.service └─515215 /usr/bin/podman start collectd +++ ● tripleo_collectd.service - collectd container Loaded: loaded (/etc/systemd/system/tripleo_collectd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2020-07-24 16:55:25 UTC; 4s ago Process: 509828 ExecStop=/usr/bin/podman stop -t 10 collectd (code=exited, status=0/SUCCESS) Process: 517113 ExecStart=/usr/bin/podman start collectd (code=exited, status=0/SUCCESS) Main PID: 517213 (code=exited, status=1/FAILURE) Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Main process exited, code=exited, status=1/FAILURE Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Failed with result 'exit-code'. Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Service RestartSec=100ms expired, scheduling restart. Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Scheduled restart job, restart counter is at 33. Jul 24 16:55:25 overcloud-controller-0 systemd[1]: Stopped collectd container. Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Start request repeated too quickly. Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Failed with result 'exit-code'. Jul 24 16:55:25 overcloud-controller-0 systemd[1]: Failed to start collectd container. ~~~ After an interval of 60 was introduced, the container was up and stable. +++ [root@overcloud-controller-0 ~]# cat /var/lib/config-data/puppet-generated/collectd/etc/collectd.d/python-config.conf # Generated by Puppet <Plugin "python"> ModulePath "/usr/lib/python3.6/site-packages" LogTraces true Interactive false Import "collectd_rabbitmq_monitoring" # Configuration for collectd_rabbitmq_monitoring <Module "collectd_rabbitmq_monitoring"> host "192.168.24.20" password "xI25NcpnpTwGNkxcJmVP81wjy" port "5672" username "guest" interval 60 <<-- </Module> </Plugin> +++ So, this python configuration file is populated when the environment file for rabbitmq plugin is introduced and on that basis I think that the issue is caused based on the configuration specified in /usr/share/openstack-tripleo-heat-templates/environments/metrics/collect-read-rabbitmq.yaml file. So we should be looking at introducing the interval param via the environment file or if there are other solutions that will be great as well. Version-Release number of selected component (if applicable): RHOSP16.0.2 + STF on top of OCP4.3 However this looks to be more of collectd. How reproducible: Pass the env file "/usr/share/openstack-tripleo-heat-templates/environments/metrics/collect-read-rabbitmq.yaml" while running stack update. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
It seems that you are testing https://bugzilla.redhat.com/show_bug.cgi?id=1290256, which has not (yet) been successfully tested by QA. I am going to close this as a duplicate of above bug. Your feedback is very helpful for us. Apparently, there is still work to be done. *** This bug has been marked as a duplicate of bug 1290256 ***