Bug 2134557
| Summary: | [STF] metrics_qdr pods consume 100% of one cpu core and does not push any metrics to STF | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | jpeyrard |
| Component: | Service Telemetry Framework | Assignee: | Leif Madsen <lmadsen> |
| Status: | CLOSED CANTFIX | QA Contact: | Leonid Natapov <lnatapov> |
| Severity: | high | Docs Contact: | mgeary <mgeary> |
| Priority: | low | ||
| Version: | 16.2 (Train) | CC: | augol, joflynn, kgiusti, lmadsen, mmagr, mrunge, mzheng, pveiga |
| Target Milestone: | z4 | Keywords: | Triaged |
| Target Release: | 16.2 (Train on RHEL 8.4) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
You can cause framing errors if you configure an ID value longer than 62 characters for the metrics_qdr service. An example error message is 'failed: amqp:connection:framing-error connection aborted'. When the metrics_qdr service is unstable, no telemetry data flows to Service Telemetry Framework (STF).
Workaround: Do not set the metrics_qdr ID value longer than 62 characters. The default value for the router ID is set to 'Router.fqdn', where 'fqdn' is the fully-qualified domain name of the node.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-24 20:47:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2136302 | ||
| Bug Blocks: | |||
|
Description
jpeyrard
2022-10-13 15:30:37 UTC
Here is a way to reproduce the behavior 100% of the time.
Any 62 character length id will reproduce it.
# cat /etc/qpid-dispatch/qdrouterd.conf
router {
mode: edge
id: 0000000000000000000000000000000000000000000000000000000000000
workerThreads: 6
}
listener {
host: 192.168.1.42
port: 5667
}
address {
prefix: unicast
distribution: closest
}
# qdstat --bus=192.168.1.42:5667
Timeout: Connection amqp://192.168.1.42:5667 timed out: Closing connection
top - 16:23:41 up 31 days, 4:12, 4 users, load average: 0.26, 0.56, 0.61
Tasks: 176 total, 1 running, 174 sleeping, 1 stopped, 0 zombie
%Cpu(s): 25.2 us, 0.1 sy, 0.0 ni, 74.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3731.3 total, 540.1 free, 1238.6 used, 1952.6 buff/cache
MiB Swap: 4060.0 total, 4039.1 free, 20.9 used. 2147.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
112210 root 20 0 544268 20576 10276 S 99.7 0.5 0:04.06 qdrouterd
So qrouterd will spin with a 62 character in "router" "id" he does not like that.
Let me see from where it came.
Testing multiple release I discover the following
qpid-dispatch-1.14.0 has issue
qpid-dispatch-1.17.0 has issue
qpid-dispatch-1.18.0 has issue
qpid-dispatch-1.19.0 works fine
qpid-dispatch "github master" works fine
perf give us this overview
Samples: 105K of event 'cycles', 4000 Hz, Event count (approx.): 3943553532 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
36.19% qdrouterd [.] qd_iterator_octet
18.69% qdrouterd [.] qd_iterator_remove_trailing_separator
7.16% qdrouterd [.] qd_iterator_prefix
4.70% qdrouterd [.] qd_iterator_reset_view.part.6
From gdb, the second thread is looping in the following function :
(gdb) bt
#0 0x0000000000445f53 in qd_iterator_octet (iter=iter@entry=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:739
#1 0x00000000004465ad in qd_iterator_prefix (iter=iter@entry=0x7fffe4032d88, prefix=<optimized out>) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:885
#2 0x00000000004469d8 in parse_address_view (iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:191
#3 view_initialize (iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:374
#4 qd_iterator_reset_view (iter=iter@entry=0x7fffe4032d88, view=view@entry=ITER_VIEW_ADDRESS_HASH) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:670
#5 0x0000000000446af9 in qd_iterator_reset_view (view=ITER_VIEW_ADDRESS_HASH, iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:667
#6 qd_iterator_string (text=text@entry=0x7ffff229c960 "amqp:/_edge/", '0' <repeats 61 times>, "/temp.CIQP4TkQap1BE51",
view=view@entry=ITER_VIEW_ADDRESS_HASH) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:597
#7 0x000000000047d4ff in qdr_lookup_terminus_address_CT (core=0x895490, dir=dir@entry=QD_OUTGOING, conn=conn@entry=0x7fffe401f588,
terminus=terminus@entry=0x7fffe4040548, link_route=link_route@entry=0x7ffff229caa5, unavailable=unavailable@entry=0x7ffff229caa6,
core_endpoint=0x7ffff229caa7, fallback=0x7ffff229caa8, accept_dynamic=true, create_if_not_found=true)
at /root/build/qpid-dispatch-1.18.0/src/router_core/modules/address_lookup_client/lookup_client.c:198
#8 0x000000000047ddc0 in qcm_addr_lookup_CT (context=0x867e90, conn=0x7fffe401f588, link=0x7fffe404b448, dir=QD_OUTGOING, source=0x7fffe4040548,
target=0x7fffe4040648) at /root/build/qpid-dispatch-1.18.0/src/router_core/modules/address_lookup_client/lookup_client.c:570
#9 0x0000000000462239 in qdr_link_inbound_first_attach_CT (core=0x895490, action=<optimized out>, discard=<optimized out>)
at /root/build/qpid-dispatch-1.18.0/src/router_core/connections.c:1865
#10 0x0000000000472b4b in router_core_thread (arg=0x895490) at /root/build/qpid-dispatch-1.18.0/src/router_core/router_core_thread.c:236
#11 0x00007ffff77531cf in start_thread (arg=<optimized out>) at pthread_create.c:479
#12 0x00007ffff6a71dd3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Closing this issue as this is a known issue with the version of metrics_qdr deployed with RHOSP 16.2. The workaround to avoid this issue is to limit the Router ID to be less than 62 characters. A fix is implemented in AMQ Interconnect 1.19.0 and is targeting RHOSP 17.1 at https://bugzilla.redhat.com/show_bug.cgi?id=2136302 For more information, see the RHOSP 16.2 release notes at https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/release_notes/index#known_issues_5 |