Bug 2208020
| Summary: | metrics_qdr not sending messages with router_id longer than 62 chars | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Bill Farmer <bill.farmer> |
| Component: | documentation | Assignee: | Leif Madsen <lmadsen> |
| Status: | ASSIGNED --- | QA Contact: | Alex Yefimov <ayefimov> |
| Severity: | low | Docs Contact: | mgeary <mgeary> |
| Priority: | medium | ||
| Version: | 16.2 (Train) | CC: | bmclaren, chrisbro, cjanisze, dhill, jbadiapa, jjoyce, jschluet, lhh, lmadsen, mariel, mburns, mgarciac, mlecki, mrunge, shrjoshi |
| Target Milestone: | z2 | Keywords: | Documentation, Triaged |
| Target Release: | 17.1 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Bill Farmer
2023-05-17 17:14:21 UTC
Few extra details about the OSP environment: 1. This is a OSP 16.2.4 release with STF 1.5 running on OCP 4.12 2. Instance HA is used on the compute nodes 3. Collectd container on the compute nodes does start and loads all the plugins but it only logs this one error repeatedly: root@dal-osc-novacomputeiha-0 ~]# cat /var/log/containers/collectd/collectd.log [2023-05-10 14:05:59] plugin_load: plugin "amqp1" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "cpu" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "df" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "disk" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "exec" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "hugepages" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "interface" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "load" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "memory" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "python" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "unixsock" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "uptime" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "virt" successfully loaded. [2023-05-10 14:05:59] plugin_load: plugin "vmem" successfully loaded. [2023-05-10 14:05:59] UNKNOWN plugin: plugin_get_interval: Unable to determine Interval from context. [2023-05-10 14:05:59] plugin_load: plugin "libpodstats" successfully loaded. [2023-05-10 14:05:59] virt plugin: reader virt-0 initialized [2023-05-10 14:05:59] Initialization complete, entering read-loop. [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted [2023-05-10 14:05:59] virt plugin: getting the CPU params count failed: Requested operation is not valid: cgroup CPUACCT controller is not mounted Troubleshooting done so far: - disabling the virt plugin in collectd config and restarting the collectd container clears the error from collectd logs. Dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted started populating. - we put virt plugin back on and removing only "pcpu" ExtraStat from ExtraStats "pcpu cpu_util vcpupin vcpu memory disk disk_err disk_allocation disk_capacity disk_physical domain_state job_stats_background perf". This also made dashboard graphs in "Infrastructure Node View" from the compute node where collect was restarted working. Update - doing the troubleshooting steps above makes the metrics show up from the compute node only for approximately 10min. based on the information given on the customer case, I've updated this bugzilla title to be more appropriate. It looks like a duplicate to https://bugzilla.redhat.com/show_bug.cgi?id=2134557 The router_id is constructed here https://github.com/openstack/puppet-qdr/blob/stable/train/manifests/init.pp#L181 *** Bug 2209038 has been marked as a duplicate of this bug. *** We did a bit of testing and might have a workaround that we'll need to fully verify for multi-cloud deployments, but based on initial testing we should be able to override the `qdr::router_id` value and use the short hostname rather than the FQDN. Here is a set of notes from Yadnesh who did some POC testing, along with providing a starting hiera value. > ykulkarn: I was able to get the router id set to short hostname using ExtraConfig: qdr::router_id: "%{::hostname}" Which is rendered as router { mode: edge id: ctrl-2-16-2 workerThreads: 4 debugDump: /var/log/qdrouterd saslConfigPath: /etc/sasl2 saslConfigName: qdrouterd } I also looked for something like UUID/instance ID in hiera but nothing like that exists. So the closest value that I could find which is unique for every node is the short hostname. > lmadsen: I would suppose we'd want to recommend someone do someting like.... qdr::router_id: "%{::hostname}.cloud01" TODO: fully verify what things look like in QDR console when making these changes, along with verifying the possible extension of using a unique value on the end to make sure the RouterID is unique in multi-cloud deployment scenarios. Once we verify that, we should have enough information to provide a documentation enhancement. Dropping priority here as support case is closed with the workaround provided, and RHOSP 17.1 has a rebased version of QDR (metrics_qdr service container) which doesn't have the same router ID length problem. This issue will require some environment setup, so I might try and work on this as part of some STF 1.5.3 release work since I'll need an OSP environment at the same time. |