1860915 – [STF] [Collectd] [RHOSP16] Collectd container fails to start when environment file to enable 'overcloud RabbitMQ instance' monitoring is passed during stack update

Bug 1860915 - [STF] [Collectd] [RHOSP16] Collectd container fails to start when environment file to enable 'overcloud RabbitMQ instance' monitoring is passed during stack update

Summary: [STF] [Collectd] [RHOSP16] Collectd container fails to start when environment...

Keywords:
Status:	CLOSED DUPLICATE of bug 1290256
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	collectd-rabbitmq-monitoring
Sub Component:
Version:	16.0 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Matthias Runge
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-27 12:42 UTC by Ketan Mehta
Modified:	2020-07-27 13:18 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-27 13:18:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ketan Mehta 2020-07-27 12:42:59 UTC

Description of problem:

At first there were no reported issues with collectd container and it was even able to poll and send metrics to the STF hosted on top of OCP 4.3.

So, it went pretty smooth with no issues neither with the STF itself or collectd polling as you can see.

~~~
● tripleo_collectd.service - collectd container
   Loaded: loaded (/etc/systemd/system/tripleo_collectd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2020-07-24 13:04:52 UTC; 1h 52min ago
 Main PID: 576494 (conmon)
    Tasks: 0 (limit: 26213)
   Memory: 1.2M
   CGroup: /system.slice/tripleo_collectd.service
           ‣ 576494 /usr/bin/conmon --api-version 1 -s -c 2da5d40a5746e0a1e1e2e2b5ad8f2faa5d5d7e69b49efafd162fa10f30a1bc73 -u 2da5d40a>

Jul 24 13:04:51 overcloud-controller-0 systemd[1]: Starting collectd container...
Jul 24 13:04:52 overcloud-controller-0 podman[576435]: 2020-07-24 13:04:52.273189239 +0000 UTC m=+0.864768215 container init 2da5d40a5>
Jul 24 13:04:52 overcloud-controller-0 podman[576435]: 2020-07-24 13:04:52.323458025 +0000 UTC m=+0.915037000 container start 2da5d40a>
Jul 24 13:04:52 overcloud-controller-0 podman[576435]: collectd
Jul 24 13:04:52 overcloud-controller-0 systemd[1]: Started collectd container.
~~~

Although by this time, my deployment command did not have /usr/share/openstack-tripleo-heat-templates/environments/metrics/collect-read-rabbitmq.yaml enabled. So, I did not request rabbitmq monitoring additionally in the first run. 

So far so good, however when the environment file to monitor rabbitmq issues on the overcloud was introduced issues started to appear with collectd container.

And now, the collectd container and associated service started flapping continuously.

~~~
podman logs collectd
[2020-07-24 16:25:34] plugin_load: plugin "logfile" successfully loaded.
Error: Reading the config file failed!
Read the logs for details.
+++

[root@overcloud-controller-0 ~]# systemctl status -l tripleo_collectd.service
● tripleo_collectd.service - collectd container
   Loaded: loaded (/etc/systemd/system/tripleo_collectd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2020-07-24 16:25:35 UTC; 26min ago
  Process: 394041 ExecStart=/usr/bin/podman start collectd (code=exited, status=0/SUCCESS)
 Main PID: 394053 (code=exited, status=1/FAILURE)

Jul 24 16:38:13 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:39:28 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:41:13 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:42:17 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:43:58 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:45:09 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:46:12 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:48:12 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:49:47 overcloud-controller-0 systemd[1]: collectd container is not active.
Jul 24 16:51:17 overcloud-controller-0 systemd[1]: collectd container is not active.
[root@overcloud-controller-0 ~]# 
  Active: activating (start) since Fri 2020-07-24 16:55:06 UTC; 975ms ago
  Process: 509828 ExecStop=/usr/bin/podman stop -t 10 collectd (code=exited, status=0/SUCCESS)
 Main PID: 515107 (code=exited, status=1/FAILURE); Control PID: 515215 (podman)
    Tasks: 9 (limit: 26213)
   Memory: 35.5M
   CGroup: /system.slice/tripleo_collectd.service
           └─515215 /usr/bin/podman start collectd
+++
● tripleo_collectd.service - collectd container
   Loaded: loaded (/etc/systemd/system/tripleo_collectd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2020-07-24 16:55:25 UTC; 4s ago
  Process: 509828 ExecStop=/usr/bin/podman stop -t 10 collectd (code=exited, status=0/SUCCESS)
  Process: 517113 ExecStart=/usr/bin/podman start collectd (code=exited, status=0/SUCCESS)
 Main PID: 517213 (code=exited, status=1/FAILURE)

Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Failed with result 'exit-code'.
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Service RestartSec=100ms expired, scheduling restart.
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Scheduled restart job, restart counter is at 33.
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: Stopped collectd container.
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Start request repeated too quickly.
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: tripleo_collectd.service: Failed with result 'exit-code'.
Jul 24 16:55:25 overcloud-controller-0 systemd[1]: Failed to start collectd container.
~~~

After an interval of 60 was introduced, the container was up and stable.

+++
[root@overcloud-controller-0 ~]# cat /var/lib/config-data/puppet-generated/collectd/etc/collectd.d/python-config.conf
# Generated by Puppet
<Plugin "python">
  ModulePath "/usr/lib/python3.6/site-packages"
  LogTraces true
  Interactive false

  Import "collectd_rabbitmq_monitoring"
  # Configuration for collectd_rabbitmq_monitoring
  <Module "collectd_rabbitmq_monitoring">
    host "192.168.24.20"
    password "xI25NcpnpTwGNkxcJmVP81wjy"
    port "5672"
    username "guest"
    interval 60  <<--
  </Module>
</Plugin>
+++

So, this python configuration file is populated when the environment file for rabbitmq plugin is introduced and on that basis I think that the issue is caused based on the configuration specified in /usr/share/openstack-tripleo-heat-templates/environments/metrics/collect-read-rabbitmq.yaml file.

So we should be looking at introducing the interval param via the environment file or if there are other solutions that will be great as well.

Version-Release number of selected component (if applicable):

RHOSP16.0.2 + STF on top of OCP4.3

However this looks to be more of collectd.

How reproducible:

Pass the env file "/usr/share/openstack-tripleo-heat-templates/environments/metrics/collect-read-rabbitmq.yaml" while running stack update.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Matthias Runge 2020-07-27 13:18:24 UTC

It seems that you are testing https://bugzilla.redhat.com/show_bug.cgi?id=1290256, which has not (yet) been successfully tested by QA.
I am going to close this as a duplicate of above bug. Your feedback is very helpful for us. Apparently, there is still work to be done.

*** This bug has been marked as a duplicate of bug 1290256 ***

Note You need to log in before you can comment on or make changes to this bug.