Description of problem: We have no metric on the Openshift Prometheus dashboard for 3xOpenStack controller node. All OpenStack compute node are able to send metrics to the Openshift Prometheus dashboard but none of the three controller are able to do it. Looking at the "metrics_qdr" pod running on the three controller node, we see the "qdrouterd" process consume 100% of 1xCPU core in user land. Version-Release number of selected component (if applicable): Kernel : 4.18.0-305.57.1.el8_4.x86_64 metric_qdr : quay.apps.mgmt1.mde1.XXXX.YYYYY.ZZZZ.AAAA.BBBB/openstack/qdrouterd:16.2 How reproducible: If we reboot one OSP controller, we still see the same behavior. qdrouterd goes to 100%. Steps to Reproduce: N/A Actual results: No metrics on prometheus dashboard on the openshift cluster for the 3xcontroller. Metrics on prometheus dashboard are visible for all compute node of the same openstack cluster. Expected results: Metric from the 3xcontroller are available. Additional info: Here we can see one process per controller consuming 100% of cpu from sosreport : $ grep -h sbin/qdrouterd *-sosreport-controller-?-03326725-2022-10-11-*/*/sos_commands/process/ps_auxwww 42465 125572 99.3 0.3 618336 244256 ? Sl Oct05 8694:30 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf 42465 24735 99.3 0.3 609024 244804 ? Sl Oct05 8698:22 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf 42465 849188 99.4 0.1 1027804 89284 ? Sl Oct10 1740:41 /usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf We see internal connection from ceilometer or collectd client to qdrouterd process failling with failed: amqp:connection:framing-error connection aborted If we try a manual connection using "qdstat" command from any of the three controller we get the following result : [root@controller-0 ~]# sudo podman exec -it metrics_qdr qdstat --bus=192.168.251.16:5666 --connections Timeout: Connection amqp://192.168.251.16:5666/$management timed out: Opening link 6c78330d-6813-41b2-893d-1db189a43683-$management And logs of qdrouterd are filled with : $ grep amqp:connection:framing-error 0160-sosreport-controller-2-03326725-2022-10-11-iantxij.tar.xz/sosreport-controller-2-03326725-2022-10-11-iantxij/var/log/containers//metrics_qdr/metrics_qdr.log | tail -3 2022-10-11 16:37:58.289340 +0000 SERVER (info) [C7568] Connection from 192.168.251.18:45416 (to 192.168.251.18:5666) failed: amqp:connection:framing-error connection aborted 2022-10-11 16:39:19.432773 +0000 SERVER (info) [C7572] Connection from 192.168.251.18:59880 (to 192.168.251.18:5666) failed: amqp:connection:framing-error connection aborted 2022-10-11 16:41:00.738640 +0000 SERVER (info) [C7573] Connection from 192.168.251.18:52244 (to 192.168.251.18:5666) failed: amqp:connection:framing-error connection aborted We see that TCP connection are able to be in ESTABLISHED, so nothing is blocking on TCP/IP side For example, we have the qdrouterd client side to OCP and ceilometer and collectd connected to qdrouterd $ grep -E '(qdrouterd|45448|49788|45440|45452)' 0160-sosreport-controller-2-03326725-2022-10-11-iantxij.tar.xz/sosreport-controller-2-03326725-2022-10-11-iantxij/sos_commands/networking/netstat_-W_-neopa | sort -k 4 tcp 0 0 10.142.22.18:50688 10.142.30.130:443 ESTABLISHED 42465 367721150 125572/qdrouterd off (0.00/0/0) tcp 0 0 192.168.251.18:45440 192.168.251.18:5666 ESTABLISHED 0 369032911 872490/collectd off (0.00/0/0) tcp 0 0 192.168.251.18:45448 192.168.251.18:5666 ESTABLISHED 42409 369040673 872513/collectd-sen keepalive (1.74/0/0) tcp 0 0 192.168.251.18:45452 192.168.251.18:5666 ESTABLISHED 42409 369040674 872513/collectd-sen keepalive (1.74/0/0) tcp 0 0 192.168.251.18:49788 192.168.251.18:5666 ESTABLISHED 42405 223800939 131496/ceilometer-a off (0.00/0/0) tcp 0 0 192.168.251.18:5666 0.0.0.0:* LISTEN 42465 203775263 125572/qdrouterd off (0.00/0/0) tcp 0 0 192.168.251.18:5666 192.168.251.18:45440 ESTABLISHED 42465 369034138 125572/qdrouterd off (0.00/0/0) tcp 0 0 192.168.251.18:5666 192.168.251.18:45448 ESTABLISHED 42465 369034141 125572/qdrouterd off (0.00/0/0) tcp 0 0 192.168.251.18:5666 192.168.251.18:45452 ESTABLISHED 42465 369034143 125572/qdrouterd off (0.00/0/0) tcp 0 0 192.168.251.18:5666 192.168.251.18:49788 ESTABLISHED 42465 223817857 125572/qdrouterd off (0.00/0/0) On the openshift side the "qdrouterd" process run in the "default-interconnect" pods in the "service-telemetry" namespace. We can see "amqp:connection:framing-error", but they are quite difficult to match to the client side error. $ grep "amqp:connection:framing-error" 0040-service-telemetry.tar.gz/inspect.local.5556256313638590178/namespaces/service-telemetry/pods/default-interconnect-6455cfbcc6-xxjqn/default-interconnect/default-interconnect/logs/current.log 2022-10-03T16:29:49.197534214+02:00 2022-10-03 14:29:49.197424 +0000 SERVER (info) [C6] Connection from 10.128.0.1:56848 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.197608534+02:00 2022-10-03 14:29:49.197527 +0000 SERVER (info) [C12] Connection from 10.128.0.1:56860 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.197661458+02:00 2022-10-03 14:29:49.197605 +0000 SERVER (info) [C18] Connection from 10.128.0.1:56872 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.197720532+02:00 2022-10-03 14:29:49.197666 +0000 SERVER (info) [C20] Connection from 10.128.0.1:56876 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.200231695+02:00 2022-10-03 14:29:49.200184 +0000 SERVER (info) [C4] Connection from 10.128.0.1:56846 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.201779462+02:00 2022-10-03 14:29:49.201725 +0000 SERVER (info) [C15] Connection from 10.128.0.1:56866 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.202030342+02:00 2022-10-03 14:29:49.201986 +0000 SERVER (info) [C11] Connection from 10.128.0.1:56858 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.202250444+02:00 2022-10-03 14:29:49.202206 +0000 SERVER (info) [C10] Connection from 10.128.0.1:56856 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.204471001+02:00 2022-10-03 14:29:49.204423 +0000 SERVER (info) [C14] Connection from 10.128.0.1:56864 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.206614403+02:00 2022-10-03 14:29:49.206586 +0000 SERVER (info) [C3] Connection from 10.128.0.1:56840 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.207063106+02:00 2022-10-03 14:29:49.207018 +0000 SERVER (info) [C19] Connection from 10.128.0.1:56874 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.207490161+02:00 2022-10-03 14:29:49.207460 +0000 SERVER (info) [C8] Connection from 10.128.0.1:56852 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.210344158+02:00 2022-10-03 14:29:49.210314 +0000 SERVER (info) [C16] Connection from 10.128.0.1:56868 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.210713052+02:00 2022-10-03 14:29:49.210688 +0000 SERVER (info) [C17] Connection from 10.128.0.1:56870 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.212937406+02:00 2022-10-03 14:29:49.212907 +0000 SERVER (info) [C2] Connection from 10.128.0.1:56838 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.213396263+02:00 2022-10-03 14:29:49.213358 +0000 SERVER (info) [C5] Connection from 10.128.0.1:56844 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.214186660+02:00 2022-10-03 14:29:49.214134 +0000 SERVER (info) [C9] Connection from 10.128.0.1:56854 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.215128109+02:00 2022-10-03 14:29:49.215100 +0000 SERVER (info) [C21] Connection from 10.128.0.1:56878 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.217424559+02:00 2022-10-03 14:29:49.217384 +0000 SERVER (info) [C13] Connection from 10.128.0.1:56862 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error 2022-10-03T16:29:49.217424559+02:00 2022-10-03 14:29:49.217410 +0000 SERVER (info) [C7] Connection from 10.128.0.1:56850 (to 0.0.0.0:5671) failed: amqp:connection:framing-error SSL Failure: Unknown error
Here is a way to reproduce the behavior 100% of the time. Any 62 character length id will reproduce it. # cat /etc/qpid-dispatch/qdrouterd.conf router { mode: edge id: 0000000000000000000000000000000000000000000000000000000000000 workerThreads: 6 } listener { host: 192.168.1.42 port: 5667 } address { prefix: unicast distribution: closest } # qdstat --bus=192.168.1.42:5667 Timeout: Connection amqp://192.168.1.42:5667 timed out: Closing connection top - 16:23:41 up 31 days, 4:12, 4 users, load average: 0.26, 0.56, 0.61 Tasks: 176 total, 1 running, 174 sleeping, 1 stopped, 0 zombie %Cpu(s): 25.2 us, 0.1 sy, 0.0 ni, 74.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3731.3 total, 540.1 free, 1238.6 used, 1952.6 buff/cache MiB Swap: 4060.0 total, 4039.1 free, 20.9 used. 2147.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 112210 root 20 0 544268 20576 10276 S 99.7 0.5 0:04.06 qdrouterd So qrouterd will spin with a 62 character in "router" "id" he does not like that. Let me see from where it came.
Testing multiple release I discover the following qpid-dispatch-1.14.0 has issue qpid-dispatch-1.17.0 has issue qpid-dispatch-1.18.0 has issue qpid-dispatch-1.19.0 works fine qpid-dispatch "github master" works fine perf give us this overview Samples: 105K of event 'cycles', 4000 Hz, Event count (approx.): 3943553532 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 36.19% qdrouterd [.] qd_iterator_octet 18.69% qdrouterd [.] qd_iterator_remove_trailing_separator 7.16% qdrouterd [.] qd_iterator_prefix 4.70% qdrouterd [.] qd_iterator_reset_view.part.6 From gdb, the second thread is looping in the following function : (gdb) bt #0 0x0000000000445f53 in qd_iterator_octet (iter=iter@entry=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:739 #1 0x00000000004465ad in qd_iterator_prefix (iter=iter@entry=0x7fffe4032d88, prefix=<optimized out>) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:885 #2 0x00000000004469d8 in parse_address_view (iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:191 #3 view_initialize (iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:374 #4 qd_iterator_reset_view (iter=iter@entry=0x7fffe4032d88, view=view@entry=ITER_VIEW_ADDRESS_HASH) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:670 #5 0x0000000000446af9 in qd_iterator_reset_view (view=ITER_VIEW_ADDRESS_HASH, iter=0x7fffe4032d88) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:667 #6 qd_iterator_string (text=text@entry=0x7ffff229c960 "amqp:/_edge/", '0' <repeats 61 times>, "/temp.CIQP4TkQap1BE51", view=view@entry=ITER_VIEW_ADDRESS_HASH) at /root/build/qpid-dispatch-1.18.0/src/iterator.c:597 #7 0x000000000047d4ff in qdr_lookup_terminus_address_CT (core=0x895490, dir=dir@entry=QD_OUTGOING, conn=conn@entry=0x7fffe401f588, terminus=terminus@entry=0x7fffe4040548, link_route=link_route@entry=0x7ffff229caa5, unavailable=unavailable@entry=0x7ffff229caa6, core_endpoint=0x7ffff229caa7, fallback=0x7ffff229caa8, accept_dynamic=true, create_if_not_found=true) at /root/build/qpid-dispatch-1.18.0/src/router_core/modules/address_lookup_client/lookup_client.c:198 #8 0x000000000047ddc0 in qcm_addr_lookup_CT (context=0x867e90, conn=0x7fffe401f588, link=0x7fffe404b448, dir=QD_OUTGOING, source=0x7fffe4040548, target=0x7fffe4040648) at /root/build/qpid-dispatch-1.18.0/src/router_core/modules/address_lookup_client/lookup_client.c:570 #9 0x0000000000462239 in qdr_link_inbound_first_attach_CT (core=0x895490, action=<optimized out>, discard=<optimized out>) at /root/build/qpid-dispatch-1.18.0/src/router_core/connections.c:1865 #10 0x0000000000472b4b in router_core_thread (arg=0x895490) at /root/build/qpid-dispatch-1.18.0/src/router_core/router_core_thread.c:236 #11 0x00007ffff77531cf in start_thread (arg=<optimized out>) at pthread_create.c:479 #12 0x00007ffff6a71dd3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Closing this issue as this is a known issue with the version of metrics_qdr deployed with RHOSP 16.2. The workaround to avoid this issue is to limit the Router ID to be less than 62 characters. A fix is implemented in AMQ Interconnect 1.19.0 and is targeting RHOSP 17.1 at https://bugzilla.redhat.com/show_bug.cgi?id=2136302 For more information, see the RHOSP 16.2 release notes at https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/release_notes/index#known_issues_5