Bug 2134557 - [STF] metrics_qdr pods consume 100% of one cpu core and does not push any metrics to STF
Summary: [STF] metrics_qdr pods consume 100% of one cpu core and does not push any met...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: Service Telemetry Framework
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
low
high
Target Milestone: z4
: 16.2 (Train on RHEL 8.4)
Assignee: Leif Madsen
QA Contact: Leonid Natapov
mgeary
URL:
Whiteboard:
Depends On: 2136302
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-13 15:30 UTC by jpeyrard
Modified: 2023-05-23 14:24 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
You can cause framing errors if you configure an ID value longer than 62 characters for the metrics_qdr service. An example error message is 'failed: amqp:connection:framing-error connection aborted'. When the metrics_qdr service is unstable, no telemetry data flows to Service Telemetry Framework (STF). Workaround: Do not set the metrics_qdr ID value longer than 62 characters. The default value for the router ID is set to 'Router.fqdn', where 'fqdn' is the fully-qualified domain name of the node.
Clone Of:
Environment:
Last Closed: 2023-01-24 20:47:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-19383 0 None None None 2022-10-13 15:31:30 UTC
Red Hat Knowledge Base (Solution) 6980503 0 None None None 2022-10-17 20:48:44 UTC

Description jpeyrard 2022-10-13 15:30:37 UTC
Description of problem:

We have no metric on the Openshift Prometheus dashboard for 3xOpenStack controller node.
All OpenStack compute node are able to send metrics to the Openshift Prometheus dashboard but none of the three controller are able to do it.
Looking at the "metrics_qdr" pod running on the three controller node, we see the "qdrouterd" process consume 100% of 1xCPU core in user land.

Version-Release number of selected component (if applicable):

Kernel     : 4.18.0-305.57.1.el8_4.x86_64
metric_qdr : quay.apps.mgmt1.mde1.XXXX.YYYYY.ZZZZ.AAAA.BBBB/openstack/qdrouterd:16.2



How reproducible:

If we reboot one OSP controller, we still see the same behavior.
qdrouterd goes to 100%.



Steps to Reproduce:
N/A

Actual results:

No metrics on prometheus dashboard on the openshift cluster for the 3xcontroller.
Metrics on prometheus dashboard are visible for all compute node of the same openstack cluster.


Expected results:

Metric from the 3xcontroller are available.

Additional info:

Here we can see one process per controller consuming 100% of cpu from sosreport :

$ grep -h sbin/qdrouterd *-sosreport-controller-?-03326725-2022-10-11-*/*/sos_commands/process/ps_auxwww
42465     125572 99.3  0.3 618336 244256 ?       Sl   Oct05 8694:30 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
42465      24735 99.3  0.3 609024 244804 ?       Sl   Oct05 8698:22 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf
42465     849188 99.4  0.1 1027804 89284 ?       Sl   Oct10 1740:41 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf

We see internal connection from ceilometer or collectd client to qdrouterd process failling with failed: amqp:connection:framing-error connection aborted

If we try a manual connection using "qdstat" command from any of the three controller we get the following result :

[root@controller-0 ~]# sudo podman exec -it metrics_qdr qdstat --bus=192.168.251.16:5666 --connections
Timeout: Connection amqp://192.168.251.16:5666/$management timed out: Opening link 6c78330d-6813-41b2-893d-1db189a43683-$management

And logs of qdrouterd are filled with :

$ grep  amqp:connection:framing-error 0160-sosreport-controller-2-03326725-2022-10-11-iantxij.tar.xz/sosreport-controller-2-03326725-2022-10-11-iantxij/var/log/containers//metrics_qdr/metrics_qdr.log  | tail -3
2022-10-11 16:37:58.289340 +0000 SERVER (info) [C7568] Connection from 192.168.251.18:45416 (to 192.168.251.18:5666) failed: amqp:connection:framing-error connection aborted
2022-10-11 16:39:19.432773 +0000 SERVER (info) [C7572] Connection from 192.168.251.18:59880 (to 192.168.251.18:5666) failed: amqp:connection:framing-error connection aborted
2022-10-11 16:41:00.738640 +0000 SERVER (info) [C7573] Connection from 192.168.251.18:52244 (to 192.168.251.18:5666) failed: amqp:connection:framing-error connection aborted

We see that TCP connection are able to be in ESTABLISHED, so nothing is blocking on TCP/IP side
For example, we have the qdrouterd client side to OCP and ceilometer and collectd connected to qdrouterd

$ grep -E '(qdrouterd|45448|49788|45440|45452)' 0160-sosreport-controller-2-03326725-2022-10-11-iantxij.tar.xz/sosreport-controller-2-03326725-2022-10-11-iantxij/sos_commands/networking/netstat_-W_-neopa | sort -k 4
tcp        0      0 10.142.22.18:50688      10.142.30.130:443       ESTABLISHED 42465      367721150  125572/qdrouterd     off (0.00/0/0)
tcp        0      0 192.168.251.18:45440    192.168.251.18:5666     ESTABLISHED 0          369032911  872490/collectd      off (0.00/0/0)
tcp        0      0 192.168.251.18:45448    192.168.251.18:5666     ESTABLISHED 42409      369040673  872513/collectd-sen  keepalive (1.74/0/0)
tcp        0      0 192.168.251.18:45452    192.168.251.18:5666     ESTABLISHED 42409      369040674  872513/collectd-sen  keepalive (1.74/0/0)
tcp        0      0 192.168.251.18:49788    192.168.251.18:5666     ESTABLISHED 42405      223800939  131496/ceilometer-a  off (0.00/0/0)
tcp        0      0 192.168.251.18:5666     0.0.0.0:*               LISTEN      42465      203775263  125572/qdrouterd     off (0.00/0/0)
tcp        0      0 192.168.251.18:5666     192.168.251.18:45440    ESTABLISHED 42465      369034138  125572/qdrouterd     off (0.00/0/0)
tcp        0      0 192.168.251.18:5666     192.168.251.18:45448    ESTABLISHED 42465      369034141  125572/qdrouterd     off (0.00/0/0)
tcp        0      0 192.168.251.18:5666     192.168.251.18:45452    ESTABLISHED 42465      369034143  125572/qdrouterd     off (0.00/0/0)
tcp        0      0 192.168.251.18:5666     192.168.251.18:49788    ESTABLISHED 42465      223817857  125572/qdrouterd     off (0.00/0/0)

On the openshift side the "qdrouterd" process run in the "default-interconnect" pods in the "service-telemetry" namespace.
We can see "amqp:connection:framing-error", but they are quite difficult to match to the client side error.

$ grep "amqp:connection:framing-error" 0040-service-telemetry.tar.gz/inspect.local.5556256313638590178/namespaces/service-telemetry/pods/default-interconnect-6455cfbcc6-xxjqn/default-interconnect/default-interconnect/logs/current.log
2022-10-03T16:29:49.197534214+02:00 2022-10-03 14:29:49.197424 +0000 SERVER (info) [C6] Connection from 10.128.0.1:56848 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.197608534+02:00 2022-10-03 14:29:49.197527 +0000 SERVER (info) [C12] Connection from 10.128.0.1:56860 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.197661458+02:00 2022-10-03 14:29:49.197605 +0000 SERVER (info) [C18] Connection from 10.128.0.1:56872 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.197720532+02:00 2022-10-03 14:29:49.197666 +0000 SERVER (info) [C20] Connection from 10.128.0.1:56876 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.200231695+02:00 2022-10-03 14:29:49.200184 +0000 SERVER (info) [C4] Connection from 10.128.0.1:56846 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.201779462+02:00 2022-10-03 14:29:49.201725 +0000 SERVER (info) [C15] Connection from 10.128.0.1:56866 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.202030342+02:00 2022-10-03 14:29:49.201986 +0000 SERVER (info) [C11] Connection from 10.128.0.1:56858 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.202250444+02:00 2022-10-03 14:29:49.202206 +0000 SERVER (info) [C10] Connection from 10.128.0.1:56856 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.204471001+02:00 2022-10-03 14:29:49.204423 +0000 SERVER (info) [C14] Connection from 10.128.0.1:56864 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.206614403+02:00 2022-10-03 14:29:49.206586 +0000 SERVER (info) [C3] Connection from 10.128.0.1:56840 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.207063106+02:00 2022-10-03 14:29:49.207018 +0000 SERVER (info) [C19] Connection from 10.128.0.1:56874 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.207490161+02:00 2022-10-03 14:29:49.207460 +0000 SERVER (info) [C8] Connection from 10.128.0.1:56852 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.210344158+02:00 2022-10-03 14:29:49.210314 +0000 SERVER (info) [C16] Connection from 10.128.0.1:56868 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.210713052+02:00 2022-10-03 14:29:49.210688 +0000 SERVER (info) [C17] Connection from 10.128.0.1:56870 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.212937406+02:00 2022-10-03 14:29:49.212907 +0000 SERVER (info) [C2] Connection from 10.128.0.1:56838 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.213396263+02:00 2022-10-03 14:29:49.213358 +0000 SERVER (info) [C5] Connection from 10.128.0.1:56844 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.214186660+02:00 2022-10-03 14:29:49.214134 +0000 SERVER (info) [C9] Connection from 10.128.0.1:56854 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.215128109+02:00 2022-10-03 14:29:49.215100 +0000 SERVER (info) [C21] Connection from 10.128.0.1:56878 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.217424559+02:00 2022-10-03 14:29:49.217384 +0000 SERVER (info) [C13] Connection from 10.128.0.1:56862 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
2022-10-03T16:29:49.217424559+02:00 2022-10-03 14:29:49.217410 +0000 SERVER (info) [C7] Connection from 10.128.0.1:56850 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error

Comment 3 jpeyrard 2022-10-14 14:25:00 UTC
Here is a way to reproduce the behavior 100% of the time.
Any 62 character length id will reproduce it.


# cat /etc/qpid-dispatch/qdrouterd.conf
router {
    mode: edge
    id: 0000000000000000000000000000000000000000000000000000000000000
    workerThreads: 6
}
listener {
    host: 192.168.1.42
    port: 5667
}
address {
    prefix: unicast
    distribution: closest
}

# qdstat --bus=192.168.1.42:5667
Timeout: Connection amqp://192.168.1.42:5667 timed out: Closing connection

top - 16:23:41 up 31 days,  4:12,  4 users,  load average: 0.26, 0.56, 0.61
Tasks: 176 total,   1 running, 174 sleeping,   1 stopped,   0 zombie
%Cpu(s): 25.2 us,  0.1 sy,  0.0 ni, 74.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3731.3 total,    540.1 free,   1238.6 used,   1952.6 buff/cache
MiB Swap:   4060.0 total,   4039.1 free,     20.9 used.   2147.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                               
 112210 root      20   0  544268  20576  10276 S  99.7   0.5   0:04.06 qdrouterd

So qrouterd will spin with a 62 character in "router" "id" he does not like that.
Let me see from where it came.

Comment 4 jpeyrard 2022-10-14 16:27:38 UTC
Testing multiple release I discover the following

qpid-dispatch-1.14.0 has issue
qpid-dispatch-1.17.0 has issue
qpid-dispatch-1.18.0 has issue
qpid-dispatch-1.19.0 works fine
qpid-dispatch "github master" works fine


perf give us this overview 

Samples: 105K of event 'cycles', 4000 Hz, Event count (approx.): 3943553532 lost: 0/0 drop: 0/0                                                               
Overhead  Shared Object                                     Symbol                                                                                            
  36.19%  qdrouterd                                         [.] qd_iterator_octet
  18.69%  qdrouterd                                         [.] qd_iterator_remove_trailing_separator
   7.16%  qdrouterd                                         [.] qd_iterator_prefix
   4.70%  qdrouterd                                         [.] qd_iterator_reset_view.part.6

From gdb, the second thread is looping in the following function : 

(gdb) bt
#0  0x0000000000445f53 in qd_iterator_octet (iter=iter@entry=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:739
#1  0x00000000004465ad in qd_iterator_prefix (iter=iter@entry=0x7fffe4032d88, prefix=<optimized out>) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:885
#2  0x00000000004469d8 in parse_address_view (iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:191
#3  view_initialize (iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:374
#4  qd_iterator_reset_view (iter=iter@entry=0x7fffe4032d88, view=view@entry=ITER_VIEW_ADDRESS_HASH) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:670
#5  0x0000000000446af9 in qd_iterator_reset_view (view=ITER_VIEW_ADDRESS_HASH, iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:667
#6  qd_iterator_string (text=text@entry=0x7ffff229c960 "amqp:/_edge/", '0' <repeats 61 times>, "/temp.CIQP4TkQap1BE51",
    view=view@entry=ITER_VIEW_ADDRESS_HASH) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:597
#7  0x000000000047d4ff in qdr_lookup_terminus_address_CT (core=0x895490, dir=dir@entry=QD_OUTGOING, conn=conn@entry=0x7fffe401f588,
    terminus=terminus@entry=0x7fffe4040548, link_route=link_route@entry=0x7ffff229caa5, unavailable=unavailable@entry=0x7ffff229caa6,
    core_endpoint=0x7ffff229caa7, fallback=0x7ffff229caa8, accept_dynamic=true, create_if_not_found=true)
    at /root/build/qpid-dispatch-1.18.0/src/router_core/modules/address_lookup_client/lookup_client.c:198
#8  0x000000000047ddc0 in qcm_addr_lookup_CT (context=0x867e90, conn=0x7fffe401f588, link=0x7fffe404b448, dir=QD_OUTGOING, source=0x7fffe4040548,
    target=0x7fffe4040648) at /root/build/qpid-dispatch-1.18.0/src/router_core/modules/address_lookup_client/lookup_client.c:570
#9  0x0000000000462239 in qdr_link_inbound_first_attach_CT (core=0x895490, action=<optimized out>, discard=<optimized out>)
    at /root/build/qpid-dispatch-1.18.0/src/router_core/connections.c:1865
#10 0x0000000000472b4b in router_core_thread (arg=0x895490) at /root/build/qpid-dispatch-1.18.0/src/router_core/router_core_thread.c:236
#11 0x00007ffff77531cf in start_thread (arg=<optimized out>) at pthread_create.c:479
#12 0x00007ffff6a71dd3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment 12 Leif Madsen 2023-01-24 20:47:38 UTC
Closing this issue as this is a known issue with the version of metrics_qdr deployed with RHOSP 16.2. The workaround to avoid this issue is to limit the Router ID to be less than 62 characters. A fix is implemented in AMQ Interconnect 1.19.0 and is targeting RHOSP 17.1 at https://bugzilla.redhat.com/show_bug.cgi?id=2136302

For more information, see the RHOSP 16.2 release notes at https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/release_notes/index#known_issues_5


Note You need to log in before you can comment on or make changes to this bug.