Bug 2208020
| Summary: | metrics_qdr not sending messages with router_id longer than 62 chars | ||
|---|---|---|---|
| Product: | Service Telemetry Framework | Reporter: | Bill Farmer <bill.farmer> |
| Component: | Documentation | Assignee: | Leif Madsen <lmadsen> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Leonid Natapov <lnatapov> |
| Severity: | low | Docs Contact: | mgeary <mgeary> |
| Priority: | medium | ||
| Version: | 1.5 | CC: | bmclaren, chrisbro, cjanisze, dhill, jbadiapa, jjoyce, jschluet, lhh, lmadsen, mariel, mburns, mgarciac, mlecki, mrunge, shrjoshi |
| Target Milestone: | z3 | Keywords: | Documentation, Triaged |
| Target Release: | 1.5 (STF) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
The stf-connectors.yaml in the Service Telemetry Framework (STF) for configuration of the RHOSP overcloud deployment documentation has been updated to include a new `qdr::router_id` parameter that provides an override of the default router ID value which in some cases can be longer than then maximum of 62 characters.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-12-07 17:17:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Bill Farmer
2023-05-17 17:14:21 UTC
Few extra details about the OSP environment: 1. This is a OSP 16.2.4 release with STF 1.5 running on OCP 4.12 2. Instance HA is used on the compute nodes 3. Collectd container on the compute nodes does start and loads all the plugins but it only logs this one error repeatedly: root@dal-osc-novacomputeiha-0 ~]# cat /var/log/containers/collectd/collectd.log [2023-05-10 14:05:59] plugin_load: plugin "amqp1" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "cpu" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "df" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "disk" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "exec" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "hugepages" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "interface" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "load" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "memory" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "python" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "unixsock" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "uptime" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "virt" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "vmem" successfully loaded. [2023-05-10 14:05:59] UNKNOWN plugin: plugin_get_interval: Unable to determine Interval from context. [2023-05-10 14:05:59] plugin_load: plugin "libpodstats" successfully loaded. [2023-05-10 14:05:59] virt plugin: reader virt-0 initialized [2023-05-10 14:05:59] Initialization complete, entering read-loop. [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted Troubleshooting done so far: - disabling the virt plugin in collectd config and restarting the collectd container clears the error from collectd logs. Dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted started populating. - we put virt plugin back on and removing only "pcpu" ExtraStat from ExtraStats "pcpu cpu_util vcpupin vcpu memory disk disk_err disk_allocation disk_capacity disk_physical domain_state job_stats_background perf". This also made dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted working. Update - doing the troubleshooting steps above makes the metrics show up from the compute node only for approximately 10min. based on the information given on the customer case, I've updated this bugzilla title to be more appropriate. It looks like a duplicate to https://bugzilla.redhat.com/show_bug.cgi?id=2134557 The router_id is constructed here https://github.com/openstack/puppet-qdr/blob/stable/train/manifests/init.pp#L181 *** Bug 2209038 has been marked as a duplicate of this bug. *** We did a bit of testing and might have a workaround that we'll need to fully verify for multi-cloud deployments, but based on initial testing we should be able to override the `qdr::router_id` value and use the short hostname rather than the FQDN. Here is a set of notes from Yadnesh who did some POC testing, along with providing a starting hiera value. > ykulkarn: I was able to get the router id set to short hostname using ExtraConfig: qdr::router_id: "%{::hostname}" Which is rendered as router { mode: edge id: ctrl-2-16-2 workerThreads: 4 debugDump: /var/log/qdrouterd saslConfigPath: /etc/sasl2 saslConfigName: qdrouterd } I also looked for something like UUID/instance ID in hiera but nothing like that exists. So the closest value that I could find which is unique for every node is the short hostname. > lmadsen: I would suppose we'd want to recommend someone do someting like.... qdr::router_id: "%{::hostname}.cloud01" TODO: fully verify what things look like in QDR console when making these changes, along with verifying the possible extension of using a unique value on the end to make sure the RouterID is unique in multi-cloud deployment scenarios. Once we verify that, we should have enough information to provide a documentation enhancement. Dropping priority here as support case is closed with the workaround provided, and RHOSP 17.1 has a rebased version of QDR (metrics_qdr service container) which doesn't have the same router ID length problem. This issue will require some environment setup, so I might try and work on this as part of some STF 1.5.3 release work since I'll need an OSP environment at the same time. I have tested this with my local script and everything works as intended. I have also created https://github.com/infrawatch/documentation/pull/487 for STF documentation to show how to override. Changes landed upstream in the STF documentation. Moving to POST. I'm moving this to MODIFIED because the documentation changes have landed upstream, and all helper scripts in STF testing are now using the new override method. Here is sample output from my RHOSP 17.1 deployment which shows the new shorter names:
[stack@undercloud-0 ~]$ oc exec -it deployment/default-interconnect -- qdstat --connections
2023-09-26 15:45:26.757662 UTC
default-interconnect-7cc45b86c5-tflmn
Connections
id host container role dir security authentication tenant last dlv uptime
=========================================================================================================================================================================
43 10.128.0.116:51880 bridge-1ce edge in no-security anonymous-user 000:00:00:06 000:01:12:37
45 10.128.0.117:51414 bridge-101 edge in no-security anonymous-user 000:00:00:18 000:01:12:36
51 10.129.0.241:60896 bridge-343 edge in no-security anonymous-user - 000:01:12:30
52 10.129.0.240:33282 bridge-343 edge in no-security anonymous-user 000:00:00:02 000:01:12:30
58 10.129.0.242:53050 bridge-139 edge in no-security anonymous-user - 000:01:12:26
2250 10.129.0.1:46320 compute-0.cloud1 edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:03 000:00:35:19
2251 10.129.0.1:46324 compute-1.cloud1 edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:03 000:00:35:18
2255 10.129.0.1:47522 controller-2.cloud1 edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:06 000:00:34:57
2256 10.129.0.1:47530 controller-1.cloud1 edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:02 000:00:34:57
2257 10.129.0.1:47548 controller-0.cloud1 edge in TLSv1/SSLv3(DHE-RSA-AES256-GCM-SHA384) anonymous-user 000:00:00:02 000:00:34:52
2258 127.0.0.1:53048 cfd37f73-015b-4a90-b3de-4db276bf8eb6 normal in no-security no-auth 000:00:00:00 000:00:00:00
Moving this to the Service Telemetry Framework project for better tracking and targeting STF 1.5.3 release. Moving to ON_QA as the changes have landed in the upstream documentation, and are ready for verification. Published in changes to documentation for STF 1.5.3. Currently being pushed in system and should be available in a couple of hours or less. |