Description of problem: STF is deployed and statistics from controller and storage nodes are populating. Statistics from nova compute nodes are not populating and when checking status of connections on compute nodes the following error is presented: [heat-admin@dal-osc-novacomputeiha-0 ~]$ sudo podman exec -it metrics_qdr qdstat --bus=192.168.200.199:5666 --connections Timeout: Connection amqp://192.168.200.199:5666/$management timed out: Opening link 627af914-039a-41a4-81d3-96bf27e09755-$management Version-Release number of selected component (if applicable): service-telemetry-operator.v1.5.1680516659 How reproducible: We've re-run deploy a couple of times including combining the ExtraConfig section from the enable-stf.yaml template with the ExtraConfig from compute-instanceha.yaml Steps to Reproduce: 1. Deploy OpenShift 2. Deploy STF 3. Check Prometheus/Grafana for statistics and see they are missing 4. Login to compute node and run sudo podman exec -it metrics_qdr cat /etc/qpid-dispatch/qdrouterd.conf to get the listener IP 5. Run sudo podman exec -it metrics_qdr qdstat --bus=<listenerip>:5666 --connections Actual results: Statistics from compute nodes are not collected Expected results: All statistics collected including from compute nodes Additional info: N/A
Few extra details about the OSP environment: 1. This is a OSP 16.2.4 release with STF 1.5 running on OCP 4.12 2. Instance HA is used on the compute nodes 3. Collectd container on the compute nodes does start and loads all the plugins but it only logs this one error repeatedly: root@dal-osc-novacomputeiha-0 ~]# cat /var/log/containers/collectd/collectd.log [2023-05-10 14:05:59] plugin_load: plugin "amqp1" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "cpu" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "df" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "disk" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "exec" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "hugepages" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "interface" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "load" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "memory" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "python" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "unixsock" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "uptime" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "virt" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "vmem" successfully loaded. [2023-05-10 14:05:59] UNKNOWN plugin: plugin_get_interval: Unable to determine Interval from context. [2023-05-10 14:05:59] plugin_load: plugin "libpodstats" successfully loaded. [2023-05-10 14:05:59] virt plugin: reader virt-0 initialized [2023-05-10 14:05:59] Initialization complete, entering read-loop. [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted
Troubleshooting done so far: - disabling the virt plugin in collectd config and restarting the collectd container clears the error from collectd logs. Dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted started populating. - we put virt plugin back on and removing only "pcpu" ExtraStat from ExtraStats "pcpu cpu_util vcpupin vcpu memory disk disk_err disk_allocation disk_capacity disk_physical domain_state job_stats_background perf". This also made dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted working.
Update - doing the troubleshooting steps above makes the metrics show up from the compute node only for approximately 10min.
based on the information given on the customer case, I've updated this bugzilla title to be more appropriate. It looks like a duplicate to https://bugzilla.redhat.com/show_bug.cgi?id=2134557
The router_id is constructed here https://github.com/openstack/puppet-qdr/blob/stable/train/manifests/init.pp#L181
*** Bug 2209038 has been marked as a duplicate of this bug. ***
We did a bit of testing and might have a workaround that we'll need to fully verify for multi-cloud deployments, but based on initial testing we should be able to override the `qdr::router_id` value and use the short hostname rather than the FQDN. Here is a set of notes from Yadnesh who did some POC testing, along with providing a starting hiera value. > ykulkarn: I was able to get the router id set to short hostname using ExtraConfig: qdr::router_id: "%{::hostname}" Which is rendered as router { mode: edge id: ctrl-2-16-2 workerThreads: 4 debugDump: /var/log/qdrouterd saslConfigPath: /etc/sasl2 saslConfigName: qdrouterd } I also looked for something like UUID/instance ID in hiera but nothing like that exists. So the closest value that I could find which is unique for every node is the short hostname. > lmadsen: I would suppose we'd want to recommend someone do someting like.... qdr::router_id: "%{::hostname}.cloud01" TODO: fully verify what things look like in QDR console when making these changes, along with verifying the possible extension of using a unique value on the end to make sure the RouterID is unique in multi-cloud deployment scenarios. Once we verify that, we should have enough information to provide a documentation enhancement.
Dropping priority here as support case is closed with the workaround provided, and RHOSP 17.1 has a rebased version of QDR (metrics_qdr service container) which doesn't have the same router ID length problem. This issue will require some environment setup, so I might try and work on this as part of some STF 1.5.3 release work since I'll need an OSP environment at the same time.