Red Hat Bugzilla – Bug 1264509
qdrouterd utilizes 120-150% cpu time
Last modified: 2017-02-23 14:46:01 EST
Description of problem:
After applying the workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1249890 as documented in comment #1 of KCS article
after some time, the qdrouterd process on the satellite can be seen
to be consuming 120-150% cpu (according to top)
Version-Release number of selected component (if applicable):
I've reproduced it once from one try.
Steps to Reproduce:
1. Use iptables on capsule to drop all port 5647 traffic to and from the satellite
2. Allow the existing ongoing connections to die off on the capsule side.
I left mine running in this state overnight
Observe qdrouterd on satellite hitting 120-150% cpu time
qdrouterd does not spin on cpu
pmoravec has discussed with tross and identified
as a likely fix for this issue
(In reply to Stuart Auchterlonie from comment #0)
> pmoravec has discussed with tross and identified
> as a likely fix for this issue
The fix for DISPATCH-134 has already been back-ported into the qpid-dispatch-router-0.4-7 packages. This must be a separate issue.
Very trivial reproducer:
1. heartbeats in goferd enabled
2. run few capsule sync / package install / whatever, just to generate some traffic on qdrouterd<->goferd connection.
3. Due to bz1264461, qdrouterd is left with several tens of AMQP connections (number depends on # of syncs done, the more the better for reproducer, 20 is enough).
4. On every such connection, qdrouterd sends "heartbeats" (empty AMQP frames I think) every second - none of the heartbeat is responded by goferd as goferd abandoned the connections (but didnt close).
Nothing more is required (i.e. no iptables trick) :-/ Just have sufficiently many AMQPS connections (SSL might play role, can verify if so) where the client does respond on TCP level only.
Just a hint:
how expensive is gettimeofday function? Noticed in gdb several times:
#0 0x00007ffffd0d5ddf in gettimeofday ()
#1 0x00007f2577a94a5e in pn_i_now () at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:159
#2 0x00007f2577a9593a in qdpn_connector_process (c=c@entry=0x7f2558004240) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:735
#3 0x00007f2577a9f3dc in process_connector (cxtr=0x7f2558004240, qd_server=0x1680e50)
#4 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:622
#5 0x00007f2577611df5 in start_thread (arg=0x7f25689ed700) at pthread_create.c:308
#6 0x00007f2576b6d1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
This is _not_ fixed by bz1264518 / in qpid-dispatch-router-0.4-9 .
use script from  and wait 10 seconds - just very few "abandoned" connections with heartbeats can cause this
Closing this BZ as it should be fixed in Satellite 6.1.3, due to "Fixed In Version: qpid-dispatch-0.4-10". That package version is in 6.1.3 errata .