1264509 – qdrouterd utilizes 120-150% cpu time

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1264509 - qdrouterd utilizes 120-150% cpu time

Summary: qdrouterd utilizes 120-150% cpu time

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Infrastructure
Sub Component:
Version:	6.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	Ted Ross
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-18 16:02 UTC by Stuart Auchterlonie
Modified:	2019-09-12 08:56 UTC (History)
CC List:	9 users (show)
Fixed In Version:	qpid-dispatch-0.4-10
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-10-17 09:01:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Stuart Auchterlonie 2015-09-18 16:02:04 UTC

Description of problem:

After applying the workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1249890 as documented in comment #1 of KCS article
https://access.redhat.com/solutions/1554243

after some time, the qdrouterd process on the satellite can be seen
to be consuming 120-150% cpu (according to top)


Version-Release number of selected component (if applicable):

qpid-dispatch-router-0.4-7.el7.x86_64


How reproducible:

I've reproduced it once from one try.


Steps to Reproduce:
1. Use iptables on capsule to drop all port 5647 traffic to and from the satellite
2. Allow the existing ongoing connections to die off on the capsule side.
I left mine running in this state overnight
3. 

Actual results:

Observe qdrouterd on satellite hitting 120-150% cpu time

Expected results:

qdrouterd does not spin on cpu



Additional info:

pmoravec has discussed with tross and identified
https://issues.apache.org/jira/browse/DISPATCH-134

as a likely fix for this issue

Comment 2 Ted Ross 2015-09-18 16:13:46 UTC

(In reply to Stuart Auchterlonie from comment #0)
> 
> pmoravec has discussed with tross and identified
> https://issues.apache.org/jira/browse/DISPATCH-134
> 
> as a likely fix for this issue

The fix for DISPATCH-134 has already been back-ported into the qpid-dispatch-router-0.4-7 packages.  This must be a separate issue.

Comment 3 Pavel Moravec 2015-09-18 20:15:39 UTC

Very trivial reproducer:

1. heartbeats in goferd enabled
2. run few capsule sync / package install / whatever, just to generate some traffic on qdrouterd<->goferd connection.
3. Due to bz1264461, qdrouterd is left with several tens of AMQP connections (number depends on # of syncs done, the more the better for reproducer, 20 is enough).
4. On every such connection, qdrouterd sends "heartbeats" (empty AMQP frames I think) every second - none of the heartbeat is responded by goferd as goferd abandoned the connections (but didnt close).

Nothing more is required (i.e. no iptables trick) :-/ Just have sufficiently many AMQPS connections (SSL might play role, can verify if so) where the client does respond on TCP level only.

Comment 4 Pavel Moravec 2015-09-18 20:22:09 UTC

Just a hint:

how expensive is gettimeofday function? Noticed in gdb several times:


#0  0x00007ffffd0d5ddf in gettimeofday ()
#1  0x00007f2577a94a5e in pn_i_now () at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:159
#2  0x00007f2577a9593a in qdpn_connector_process (c=c@entry=0x7f2558004240) at /usr/src/debug/qpid-dispatch-0.4/src/posix/driver.c:735
#3  0x00007f2577a9f3dc in process_connector (cxtr=0x7f2558004240, qd_server=0x1680e50)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:324
#4  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:622
#5  0x00007f2577611df5 in start_thread (arg=0x7f25689ed700) at pthread_create.c:308
#6  0x00007f2576b6d1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 5 Pavel Moravec 2015-09-26 12:46:09 UTC

This is _not_ fixed by bz1264518 / in qpid-dispatch-router-0.4-9 .

Reproducer:
use script from [1] and wait 10 seconds - just very few "abandoned" connections with heartbeats can cause this

[1] https://issues.apache.org/jira/browse/PROTON-1000?focusedCommentId=14909238&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14909238

Comment 6 Pavel Moravec 2015-10-17 09:01:00 UTC

Closing this BZ as it should be fixed in Satellite 6.1.3, due to "Fixed In Version: qpid-dispatch-0.4-10". That package version is in 6.1.3 errata [1].


[1] https://access.redhat.com/errata/RHBA-2015:1911

Note You need to log in before you can comment on or make changes to this bug.