Bug 1264509 - qdrouterd utilizes 120-150% cpu time
qdrouterd utilizes 120-150% cpu time
Status: CLOSED CURRENTRELEASE
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Infrastructure (Show other bugs)
6.1.1
Unspecified Unspecified
unspecified Severity high (vote)
: 6.1.4
: --
Assigned To: Ted Ross
Katello QA List
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-18 12:02 EDT by Stuart Auchterlonie
Modified: 2017-02-23 14:46 EST (History)
9 users (show)

See Also:
Fixed In Version: qpid-dispatch-0.4-10
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-17 05:01:00 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Stuart Auchterlonie 2015-09-18 12:02:04 EDT
Description of problem:

After applying the workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1249890 as documented in comment #1 of KCS article
https://access.redhat.com/solutions/1554243

after some time, the qdrouterd process on the satellite can be seen
to be consuming 120-150% cpu (according to top)


Version-Release number of selected component (if applicable):

qpid-dispatch-router-0.4-7.el7.x86_64


How reproducible:

I've reproduced it once from one try.


Steps to Reproduce:
1. Use iptables on capsule to drop all port 5647 traffic to and from the satellite
2. Allow the existing ongoing connections to die off on the capsule side.
I left mine running in this state overnight
3. 

Actual results:

Observe qdrouterd on satellite hitting 120-150% cpu time

Expected results:

qdrouterd does not spin on cpu



Additional info:

pmoravec has discussed with tross and identified
https://issues.apache.org/jira/browse/DISPATCH-134

as a likely fix for this issue
Comment 2 Ted Ross 2015-09-18 12:13:46 EDT
(In reply to Stuart Auchterlonie from comment #0)
> 
> pmoravec has discussed with tross and identified
> https://issues.apache.org/jira/browse/DISPATCH-134
> 
> as a likely fix for this issue

The fix for DISPATCH-134 has already been back-ported into the qpid-dispatch-router-0.4-7 packages.  This must be a separate issue.
Comment 3 Pavel Moravec 2015-09-18 16:15:39 EDT
Very trivial reproducer:

1. heartbeats in goferd enabled
2. run few capsule sync / package install / whatever, just to generate some traffic on qdrouterd<->goferd connection.
3. Due to bz1264461, qdrouterd is left with several tens of AMQP connections (number depends on # of syncs done, the more the better for reproducer, 20 is enough).
4. On every such connection, qdrouterd sends "heartbeats" (empty AMQP frames I think) every second - none of the heartbeat is responded by goferd as goferd abandoned the connections (but didnt close).

Nothing more is required (i.e. no iptables trick) :-/ Just have sufficiently many AMQPS connections (SSL might play role, can verify if so) where the client does respond on TCP level only.
Comment 4 Pavel Moravec 2015-09-18 16:22:09 EDT
Just a hint:

how expensive is gettimeofday function? Noticed in gdb several times:


#0  0x00007ffffd0d5ddf in gettimeofday ()
#1  0x00007f2577a94a5e in pn_i_now () at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:159
#2  0x00007f2577a9593a in qdpn_connector_process (c=c@entry=0x7f2558004240) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:735
#3  0x00007f2577a9f3dc in process_connector (cxtr=0x7f2558004240, qd_server=0x1680e50)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:324
#4  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:622
#5  0x00007f2577611df5 in start_thread (arg=0x7f25689ed700) at pthread_create.c:308
#6  0x00007f2576b6d1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Comment 5 Pavel Moravec 2015-09-26 08:46:09 EDT
This is _not_ fixed by bz1264518 / in qpid-dispatch-router-0.4-9 .

Reproducer:
use script from [1] and wait 10 seconds - just very few "abandoned" connections with heartbeats can cause this

[1] https://issues.apache.org/jira/browse/PROTON-1000?focusedCommentId=14909238&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14909238
Comment 6 Pavel Moravec 2015-10-17 05:01:00 EDT
Closing this BZ as it should be fixed in Satellite 6.1.3, due to "Fixed In Version: qpid-dispatch-0.4-10". That package version is in 6.1.3 errata [1].


[1] https://access.redhat.com/errata/RHBA-2015:1911

Note You need to log in before you can comment on or make changes to this bug.