Bug 1264509

Summary: qdrouterd utilizes 120-150% cpu time
Product: Red Hat Satellite Reporter: Stuart Auchterlonie <sauchter>
Component: InfrastructureAssignee: Ted Ross <tross>
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1.1CC: bbuckingham, bkearney, bugzilla_rhn, chorn, cwelton, ddevra, pmoravec, tross, tscherf
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-dispatch-0.4-10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-17 09:01:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stuart Auchterlonie 2015-09-18 16:02:04 UTC
Description of problem:

After applying the workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1249890 as documented in comment #1 of KCS article
https://access.redhat.com/solutions/1554243

after some time, the qdrouterd process on the satellite can be seen
to be consuming 120-150% cpu (according to top)


Version-Release number of selected component (if applicable):

qpid-dispatch-router-0.4-7.el7.x86_64


How reproducible:

I've reproduced it once from one try.


Steps to Reproduce:
1. Use iptables on capsule to drop all port 5647 traffic to and from the satellite
2. Allow the existing ongoing connections to die off on the capsule side.
I left mine running in this state overnight
3. 

Actual results:

Observe qdrouterd on satellite hitting 120-150% cpu time

Expected results:

qdrouterd does not spin on cpu



Additional info:

pmoravec has discussed with tross and identified
https://issues.apache.org/jira/browse/DISPATCH-134

as a likely fix for this issue

Comment 2 Ted Ross 2015-09-18 16:13:46 UTC
(In reply to Stuart Auchterlonie from comment #0)
> 
> pmoravec has discussed with tross and identified
> https://issues.apache.org/jira/browse/DISPATCH-134
> 
> as a likely fix for this issue

The fix for DISPATCH-134 has already been back-ported into the qpid-dispatch-router-0.4-7 packages.  This must be a separate issue.

Comment 3 Pavel Moravec 2015-09-18 20:15:39 UTC
Very trivial reproducer:

1. heartbeats in goferd enabled
2. run few capsule sync / package install / whatever, just to generate some traffic on qdrouterd<->goferd connection.
3. Due to bz1264461, qdrouterd is left with several tens of AMQP connections (number depends on # of syncs done, the more the better for reproducer, 20 is enough).
4. On every such connection, qdrouterd sends "heartbeats" (empty AMQP frames I think) every second - none of the heartbeat is responded by goferd as goferd abandoned the connections (but didnt close).

Nothing more is required (i.e. no iptables trick) :-/ Just have sufficiently many AMQPS connections (SSL might play role, can verify if so) where the client does respond on TCP level only.

Comment 4 Pavel Moravec 2015-09-18 20:22:09 UTC
Just a hint:

how expensive is gettimeofday function? Noticed in gdb several times:


#0  0x00007ffffd0d5ddf in gettimeofday ()
#1  0x00007f2577a94a5e in pn_i_now () at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:159
#2  0x00007f2577a9593a in qdpn_connector_process (c=c@entry=0x7f2558004240) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:735
#3  0x00007f2577a9f3dc in process_connector (cxtr=0x7f2558004240, qd_server=0x1680e50)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:324
#4  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:622
#5  0x00007f2577611df5 in start_thread (arg=0x7f25689ed700) at pthread_create.c:308
#6  0x00007f2576b6d1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 5 Pavel Moravec 2015-09-26 12:46:09 UTC
This is _not_ fixed by bz1264518 / in qpid-dispatch-router-0.4-9 .

Reproducer:
use script from [1] and wait 10 seconds - just very few "abandoned" connections with heartbeats can cause this

[1] https://issues.apache.org/jira/browse/PROTON-1000?focusedCommentId=14909238&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14909238

Comment 6 Pavel Moravec 2015-10-17 09:01:00 UTC
Closing this BZ as it should be fixed in Satellite 6.1.3, due to "Fixed In Version: qpid-dispatch-0.4-10". That package version is in 6.1.3 errata [1].


[1] https://access.redhat.com/errata/RHBA-2015:1911