1295957 – goferd taking 100% CPU after successful reconnect to qdrouterd after a longer time

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1295957 - goferd taking 100% CPU after successful reconnect to qdrouterd after a longer time

Summary: goferd taking 100% CPU after successful reconnect to qdrouterd after a longer...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	katello-agent
Sub Component:
Version:	6.1.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high with 3 votes
Target Milestone:	Unspecified
Assignee:	Katello Bug Bin
QA Contact:	Sachin Ghai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1122832 1301103
TreeView+	depends on / blocked

Reported:	2016-01-05 22:14 UTC by Pavel Moravec
Modified:	2019-12-16 05:15 UTC (History)
CC List:	27 users (show)
Fixed In Version:	qpid-proton-0.9-12
Doc Type:	Bug Fix
Doc Text:	Cause: proton reactor didn't clean up internal pipes correctly when a connection was reset Consequence: on reconnect, the reactor thread could get into a loop, causing high cpu Fix: the pipes are now cleaned up properly Result: and the cpu now does not spin on reconnect
Clone Of:
Clones:	1301103 (view as bug list)
Environment:
Last Closed:	2016-01-28 07:46:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Apache JIRA	PROTON-1090	None	None	None	2016-01-07 07:46:38 UTC
Red Hat Knowledge Base (Solution)	2113371	None	None	None	2016-01-06 09:04:05 UTC
Red Hat Product Errata	RHBA-2016:0077	normal	SHIPPED_LIVE	Satellite Tools 6.1 bug fix update	2016-01-28 12:45:28 UTC

Description Pavel Moravec 2016-01-05 22:14:48 UTC

Description of problem:
Having qdrouterd shut down for some time (such that goferd retries several times to reconnect) and starting it again, goferd connects to it properly. BUT there is a chance it _will_ start consume 100% CPU after a while (within 1 minute since successful reconnect).

The bug is supposed to be in proton reactor - see below for analysis.


Version-Release number of selected component (if applicable):
qpid-proton-c-0.9-11.el7.x86_64
python-gofer-2.6.8-1.el7sat.noarch
python-gofer-proton-2.6.8-1.el7sat.noarch
qpid-proton-debuginfo-0.9-7.el7.x86_64
gofer-2.6.8-1.el7sat.noarch
python-qpid-proton-0.9-11.el7.x86_64


How reproducible:
??? (100% within several minutes when one knows the tricks)


Steps to Reproduce:
1. Follow reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1272596 but stop the qdrouterd bouncing after some time - or increase the delay when qdrouterd is running substantially
2. Monitor CPU usage of goferd - notice that the high CPU usage (once hit) is put to normal values after qdrouterd restart (i.e. qdrouterd restart is a workaround) - thats why stop the qdrouterd bouncing


Actual results:
- goferd taking 100% CPU
- it has 1 established TCP connection to qdrouterd (i.e. not the original BZ where no connection was opened at the end)


Expected results:
- no high CPU usage


Additional info:
The only busy thread is Thread 21 with usual backtrace:

/usr/lib64/python2.7/site-packages/proton/__init__.py, line 3632, in wrap
/usr/lib64/python2.7/site-packages/proton/__init__.py, line 3736, in dispatch
/usr/lib64/python2.7/site-packages/proton/reactor.py, line 142, in process
/usr/lib64/python2.7/site-packages/proton/utils.py, line 252, in wait
/usr/lib64/python2.7/site-packages/proton/utils.py, line 144, in receive
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py, line 106, in get
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py, line 44, in _fn
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 614, in get
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 39, in _fn
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 648, in next
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 39, in _fn
/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py, line 88, in _read
/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py, line 58, in run
/usr/lib/python2.7/site-packages/gofer/common.py, line 267, in _fn
/usr/lib64/python2.7/threading.py, line 811, in __bootstrap_inner
/usr/lib64/python2.7/threading.py, line 784, in __bootstrap

Adding a logging event to /usr/lib64/python2.7/site-packages/proton/reactor.py, line 142, it is already inside the busy loop

Adding a logging event to /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, just one such event is logged every 10 seconds (hint, hint, why 10 seconds? be aware I have in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py "MAX_DELAY = 1" so not _this_ parameter).

Therefore, the busy loop is:
/usr/lib64/python2.7/site-packages/proton/utils.py:

                deadline = time.time() + timeout
                while not condition() and not self.disconnected:
                    self.container.process()
                    if deadline < time.time():
                        txt = "Connection %s timed out" % self.url
                        if msg: txt += ": " + msg
                        raise Timeout(txt)


I can try preparing proton reactor standalone reproducer, if/once confirmed this is reactor bug.

Comment 2 Pavel Moravec 2016-01-06 08:40:36 UTC

(In reply to Pavel Moravec from comment #0)
> Adding a logging event to
> /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, just one such
> event is logged every 10 seconds (hint, hint, why 10 seconds? be aware I
> have in
> /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.
> py "MAX_DELAY = 1" so not _this_ parameter).

this is easy:

/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py: line 106:          message, document = self._reader.next(10)

Comment 3 Pavel Moravec 2016-01-06 09:01:03 UTC

Another hint (rather stressing what is hidden in the text):

once goferd / proton reactor is in the busy loop, restarting qdrouterd breaks it - after the restart, goferd starts to behave normally.

(it feels to me like proton reactor did not update some information about lost connection/link (being restored - _this_ isnt updated), busy waiting to reconnect/reattach event that however had already happened - and qdrouterd restart generates new such event that clears the busy loop)

Comment 6 Pavel Moravec 2016-01-07 07:46:39 UTC

It has been confirmed as proton reactor bug, thanks kgiusti++ .

See upstream https://issues.apache.org/jira/browse/PROTON-1090

Comment 19 Sachin Ghai 2016-01-27 10:48:03 UTC

Verified with satellite 6.1.6 compose 7 (async errata).

I was able to reproduce the reported issue with:

]# rpm -qa | grep qpid-proton
qpid-proton-c-0.9-11.el7.x86_64
python-qpid-proton-0.9-11.el7.x86_64

---
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                           
28593 root      20   0 1535020  39796   9772 S 100.0  0.1   2:18.76 python 

root     28593     1  5 04:14 ?        00:00:40 python /usr/bin/goferd --foreground
---


Later, I upgraded the rhel7 client with latest compose7 and with following packages:
[root@apollo ~]# rpm -qa | grep qpid
python-qpid-proton-0.9-12.el7sat.x86_64
qpid-proton-c-0.9-12.el7sat.x86_64

And this time I don't see any process like python/goferd consuming more that 100% CPU on re-connection to qdrouterd. here is the result after fix:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
17107 root      20   0 1534464  39516   9796 S   0.3  0.1   0:00.92 python   

root     17107     1  0 04:53 ?        00:00:01 python /usr/bin/goferd --foreground

Will test the fix with rhel6 and rhel5 clients and update here

Comment 20 Sachin Ghai 2016-01-27 10:51:25 UTC

One more observation, while installing packages, I see a sudden increase in CPU usage by 40-50%. Is this expected ?

Comment 21 Sachin Ghai 2016-01-27 12:05:23 UTC

Here is the result on installing a single package on rhel6. CPU usage increases by 70%

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
 8566 root      20   0 2074m 153m  13m S 70.5  0.6   0:07.85 python  

root      8566     1  7 13:01 ?        00:00:10 python /usr/bin/goferd

Comment 22 Sachin Ghai 2016-01-27 12:09:12 UTC

Ok, I verified with rhel5.11 and rhel67 and rhel72 clients. I don't see python/goferd consuming more that 100% CPU on re-connection to qdrouterd. However, I do see sudden rise in CPU usage when we install packages on client from sat6 server. 

@Mike: could you please confirm if its expected ?

Comment 24 Mike McCune 2016-01-27 15:02:01 UTC

it is expected that the gofer pytyhon process uses CPU while installing packages as this process is what is interacting with the yum api (not the yum CLI) to download and interact with RPM.

Comment 25 Sachin Ghai 2016-01-27 15:04:18 UTC

Thanks Mike. Moving this bz to verified as per comments 20, 21, 23 and 24.

Comment 27 errata-xmlrpc 2016-01-28 07:46:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0077

Note You need to log in before you can comment on or make changes to this bug.

ahuchcha
bkearney
cdonnell
chrobert
cwelton
david.pinel
ddevra
fjayalat
gabicr
hasingh
hklein
jentrena
kgiusti
kshravag
mcressma
mmccune
mshimura
nmiao
peter.vreman
pmoravec
pmutha
rmahique
sghai
smeyer
sreber
thamilto
wharris