Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2208020

Summary: metrics_qdr not sending messages with router_id longer than 62 chars
Product: Service Telemetry Framework Reporter: Bill Farmer <bill.farmer>
Component: DocumentationAssignee: Leif Madsen <lmadsen>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: low Docs Contact: mgeary <mgeary>
Priority: medium    
Version: 1.5CC: bmclaren, chrisbro, cjanisze, dhill, jbadiapa, jjoyce, jschluet, lhh, lmadsen, mariel, mburns, mgarciac, mlecki, mrunge, shrjoshi
Target Milestone: z3Keywords: Documentation, Triaged
Target Release: 1.5 (STF)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The stf-connectors.yaml in the Service Telemetry Framework (STF) for configuration of the RHOSP overcloud deployment documentation has been updated to include a new `qdr::router_id` parameter that provides an override of the default router ID value which in some cases can be longer than then maximum of 62 characters.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-07 17:17:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bill Farmer 2023-05-17 17:14:21 UTC
Description of problem:
STF is deployed and statistics from controller and storage nodes are populating. Statistics from nova compute nodes are not populating and when checking status of connections on compute nodes the following error is presented:

[heat-admin@dal-osc-novacomputeiha-0 ~]$ sudo podman exec -it metrics_qdr qdstat --bus=192.168.200.199:5666 --connections
Timeout: Connection amqp://192.168.200.199:5666/$management timed out: Opening link 627af914-039a-41a4-81d3-96bf27e09755-$management

Version-Release number of selected component (if applicable):
service-telemetry-operator.v1.5.1680516659


How reproducible: We've re-run deploy a couple of times including combining the ExtraConfig section from the enable-stf.yaml template with the ExtraConfig from compute-instanceha.yaml


Steps to Reproduce:
1. Deploy OpenShift
2. Deploy STF
3. Check Prometheus/Grafana for statistics and see they are missing
4. Login to compute node and run sudo podman exec -it metrics_qdr cat /etc/qpid-dispatch/qdrouterd.conf to get the listener IP
5. Run sudo podman exec -it metrics_qdr qdstat --bus=<listenerip>:5666 --connections

Actual results:
Statistics from compute nodes are not collected

Expected results:
All statistics collected including from compute nodes

Additional info:
N/A

Comment 1 mlecki 2023-05-18 14:51:13 UTC
Few extra details about the OSP environment:
1. This is a OSP 16.2.4 release with STF 1.5 running on OCP 4.12
2. Instance HA is used on the compute nodes
3. Collectd container on the compute nodes does start and loads all the plugins but it only logs this one error repeatedly:
root@dal-osc-novacomputeiha-0 ~]# cat /var/log/containers/collectd/collectd.log 
[2023-05-10 14:05:59] plugin_load: plugin "amqp1" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "cpu" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "df" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "disk" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "exec" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "hugepages" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "interface" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "load" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "memory" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "python" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "unixsock" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "uptime" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "virt" successfully loaded.
[2023-05-10 14:05:59] plugin_load: plugin "vmem" successfully loaded.
[2023-05-10 14:05:59] UNKNOWN plugin: plugin_get_interval: Unable to determine Interval from context.
[2023-05-10 14:05:59] plugin_load: plugin "libpodstats" successfully loaded.
[2023-05-10 14:05:59] virt plugin: reader virt-0 initialized
[2023-05-10 14:05:59] Initialization complete, entering read-loop.
[2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted
[2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted
[2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted

Comment 2 mlecki 2023-05-18 15:55:59 UTC
Troubleshooting done so far:
 - disabling the virt plugin in collectd config and restarting the collectd container clears the error from collectd logs. Dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted started populating.
- we put virt plugin back on and removing only "pcpu" ExtraStat from   ExtraStats "pcpu cpu_util vcpupin vcpu memory disk disk_err disk_allocation disk_capacity disk_physical domain_state job_stats_background perf". This also made dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted working.

Comment 3 mlecki 2023-05-18 16:20:30 UTC
Update - doing the troubleshooting steps above makes the metrics show up from the compute node only for approximately 10min.

Comment 5 Matthias Runge 2023-05-22 08:56:52 UTC
based on the information given on the customer case, I've updated this bugzilla title to be more appropriate. It looks like a duplicate to https://bugzilla.redhat.com/show_bug.cgi?id=2134557

Comment 7 Matthias Runge 2023-05-22 09:54:48 UTC
The router_id is constructed here https://github.com/openstack/puppet-qdr/blob/stable/train/manifests/init.pp#L181

Comment 9 Matthias Runge 2023-05-22 12:01:27 UTC
*** Bug 2209038 has been marked as a duplicate of this bug. ***

Comment 13 Leif Madsen 2023-05-31 20:20:43 UTC
We did a bit of testing and might have a workaround that we'll need to fully verify for multi-cloud deployments, but based on initial testing we should be able to override the `qdr::router_id` value and use the short hostname rather than the FQDN. Here is a set of notes from Yadnesh who did some POC testing, along with providing a starting hiera value.


> ykulkarn:

I was able to get the router id set to short hostname using

    ExtraConfig:
        qdr::router_id: "%{::hostname}"

Which is rendered as

    router {
        mode: edge
        id: ctrl-2-16-2
        workerThreads: 4
        debugDump: /var/log/qdrouterd
        saslConfigPath: /etc/sasl2
        saslConfigName: qdrouterd
    }

I also looked for something like UUID/instance ID in hiera but nothing like that exists. So the closest value that I could find which is unique for every node is the short hostname.

> lmadsen:

I would suppose we'd want to recommend someone do someting like.... qdr::router_id: "%{::hostname}.cloud01"


TODO: fully verify what things look like in QDR console when making these changes, along with verifying the possible extension of using a unique value on the end to make sure the RouterID is unique in multi-cloud deployment scenarios. Once we verify that, we should have enough information to provide a documentation enhancement.

Comment 14 Leif Madsen 2023-08-14 13:24:52 UTC
Dropping priority here as support case is closed with the workaround provided, and RHOSP 17.1 has a rebased version of QDR (metrics_qdr service container) which doesn't have the same router ID length problem. This issue will require some environment setup, so I might try and work on this as part of some STF 1.5.3 release work since I'll need an OSP environment at the same time.

Comment 15 Leif Madsen 2023-09-07 23:16:15 UTC
I have tested this with my local script and everything works as intended. I have also created https://github.com/infrawatch/documentation/pull/487 for STF documentation to show how to override.

Comment 16 Leif Madsen 2023-09-11 19:19:56 UTC
Changes landed upstream in the STF documentation. Moving to POST.

Comment 17 Leif Madsen 2023-09-26 15:47:31 UTC
I'm moving this to MODIFIED because the documentation changes have landed upstream, and all helper scripts in STF testing are now using the new override method. Here is sample output from my RHOSP 17.1 deployment which shows the new shorter names:

    [stack@undercloud-0 ~]$ oc exec -it deployment/default-interconnect -- qdstat --connections
    2023-09-26 15:45:26.757662 UTC
    default-interconnect-7cc45b86c5-tflmn

    Connections
      id    host                container                             role    dir  security                                authentication  tenant  last dlv      uptime
      =========================================================================================================================================================================
      43    10.128.0.116:51880  bridge-1ce                            edge    in   no-security                             anonymous-user          000:00:00:06  000:01:12:37
      45    10.128.0.117:51414  bridge-101                            edge    in   no-security                             anonymous-user          000:00:00:18  000:01:12:36
      51    10.129.0.241:60896  bridge-343                            edge    in   no-security                             anonymous-user          -             000:01:12:30
      52    10.129.0.240:33282  bridge-343                            edge    in   no-security                             anonymous-user          000:00:00:02  000:01:12:30
      58    10.129.0.242:53050  bridge-139                            edge    in   no-security                             anonymous-user          -             000:01:12:26
      2250  10.129.0.1:46320    compute-0.cloud1                      edge    in   TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384)  anonymous-user          000:00:00:03  000:00:35:19
      2251  10.129.0.1:46324    compute-1.cloud1                      edge    in   TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384)  anonymous-user          000:00:00:03  000:00:35:18
      2255  10.129.0.1:47522    controller-2.cloud1                   edge    in   TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384)  anonymous-user          000:00:00:06  000:00:34:57
      2256  10.129.0.1:47530    controller-1.cloud1                   edge    in   TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384)  anonymous-user          000:00:00:02  000:00:34:57
      2257  10.129.0.1:47548    controller-0.cloud1                   edge    in   TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384)  anonymous-user          000:00:00:02  000:00:34:52
      2258  127.0.0.1:53048     cfd37f73-015b-4a90-b3de-4db276bf8eb6  normal  in   no-security                             no-auth                 000:00:00:00  000:00:00:00

Comment 18 Leif Madsen 2023-09-26 15:51:18 UTC
Moving this to the Service Telemetry Framework project for better tracking and targeting STF 1.5.3 release.

Comment 19 Leif Madsen 2023-11-06 18:09:18 UTC
Moving to ON_QA as the changes have landed in the upstream documentation, and are ready for verification.

Comment 20 Leif Madsen 2023-12-07 17:17:48 UTC
Published in changes to documentation for STF 1.5.3. Currently being pushed in system and should be available in a couple of hours or less.