Bug 1295957 - goferd taking 100% CPU after successful reconnect to qdrouterd after a longer time
goferd taking 100% CPU after successful reconnect to qdrouterd after a longer...
Product: Red Hat Satellite 6
Classification: Red Hat
Component: katello-agent (Show other bugs)
All Linux
urgent Severity high with 3 votes (vote)
: 6.1.6
: --
Assigned To: Katello Bug Bin
Sachin Ghai
: Triaged
Depends On:
Blocks: 1122832 1301103
  Show dependency treegraph
Reported: 2016-01-05 17:14 EST by Pavel Moravec
Modified: 2017-02-23 14:39 EST (History)
27 users (show)

See Also:
Fixed In Version: qpid-proton-0.9-12
Doc Type: Bug Fix
Doc Text:
Cause: proton reactor didn't clean up internal pipes correctly when a connection was reset Consequence: on reconnect, the reactor thread could get into a loop, causing high cpu Fix: the pipes are now cleaned up properly Result: and the cpu now does not spin on reconnect
Story Points: ---
Clone Of:
: 1301103 (view as bug list)
Last Closed: 2016-01-28 02:46:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Apache JIRA PROTON-1090 None None None 2016-01-07 02:46 EST
Red Hat Knowledge Base (Solution) 2113371 None None None 2016-01-06 04:04 EST
Red Hat Product Errata RHBA-2016:0077 normal SHIPPED_LIVE Satellite Tools 6.1 bug fix update 2016-01-28 07:45:28 EST

  None (edit)
Description Pavel Moravec 2016-01-05 17:14:48 EST
Description of problem:
Having qdrouterd shut down for some time (such that goferd retries several times to reconnect) and starting it again, goferd connects to it properly. BUT there is a chance it _will_ start consume 100% CPU after a while (within 1 minute since successful reconnect).

The bug is supposed to be in proton reactor - see below for analysis.

Version-Release number of selected component (if applicable):

How reproducible:
??? (100% within several minutes when one knows the tricks)

Steps to Reproduce:
1. Follow reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1272596 but stop the qdrouterd bouncing after some time - or increase the delay when qdrouterd is running substantially
2. Monitor CPU usage of goferd - notice that the high CPU usage (once hit) is put to normal values after qdrouterd restart (i.e. qdrouterd restart is a workaround) - thats why stop the qdrouterd bouncing

Actual results:
- goferd taking 100% CPU
- it has 1 established TCP connection to qdrouterd (i.e. not the original BZ where no connection was opened at the end)

Expected results:
- no high CPU usage

Additional info:
The only busy thread is Thread 21 with usual backtrace:

/usr/lib64/python2.7/site-packages/proton/__init__.py, line 3632, in wrap
/usr/lib64/python2.7/site-packages/proton/__init__.py, line 3736, in dispatch
/usr/lib64/python2.7/site-packages/proton/reactor.py, line 142, in process
/usr/lib64/python2.7/site-packages/proton/utils.py, line 252, in wait
/usr/lib64/python2.7/site-packages/proton/utils.py, line 144, in receive
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py, line 106, in get
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py, line 44, in _fn
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 614, in get
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 39, in _fn
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 648, in next
/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 39, in _fn
/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py, line 88, in _read
/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py, line 58, in run
/usr/lib/python2.7/site-packages/gofer/common.py, line 267, in _fn
/usr/lib64/python2.7/threading.py, line 811, in __bootstrap_inner
/usr/lib64/python2.7/threading.py, line 784, in __bootstrap

Adding a logging event to /usr/lib64/python2.7/site-packages/proton/reactor.py, line 142, it is already inside the busy loop

Adding a logging event to /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, just one such event is logged every 10 seconds (hint, hint, why 10 seconds? be aware I have in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py "MAX_DELAY = 1" so not _this_ parameter).

Therefore, the busy loop is:

                deadline = time.time() + timeout
                while not condition() and not self.disconnected:
                    if deadline < time.time():
                        txt = "Connection %s timed out" % self.url
                        if msg: txt += ": " + msg
                        raise Timeout(txt)

I can try preparing proton reactor standalone reproducer, if/once confirmed this is reactor bug.
Comment 2 Pavel Moravec 2016-01-06 03:40:36 EST
(In reply to Pavel Moravec from comment #0)
> Adding a logging event to
> /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, just one such
> event is logged every 10 seconds (hint, hint, why 10 seconds? be aware I
> have in
> /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.
> py "MAX_DELAY = 1" so not _this_ parameter).

this is easy:

/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py: line 106:          message, document = self._reader.next(10)
Comment 3 Pavel Moravec 2016-01-06 04:01:03 EST
Another hint (rather stressing what is hidden in the text):

once goferd / proton reactor is in the busy loop, restarting qdrouterd breaks it - after the restart, goferd starts to behave normally.

(it feels to me like proton reactor did not update some information about lost connection/link (being restored - _this_ isnt updated), busy waiting to reconnect/reattach event that however had already happened - and qdrouterd restart generates new such event that clears the busy loop)
Comment 6 Pavel Moravec 2016-01-07 02:46:39 EST
It has been confirmed as proton reactor bug, thanks kgiusti++ .

See upstream https://issues.apache.org/jira/browse/PROTON-1090
Comment 19 Sachin Ghai 2016-01-27 05:48:03 EST
Verified with satellite 6.1.6 compose 7 (async errata).

I was able to reproduce the reported issue with:

]# rpm -qa | grep qpid-proton

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                           
28593 root      20   0 1535020  39796   9772 S 100.0  0.1   2:18.76 python 

root     28593     1  5 04:14 ?        00:00:40 python /usr/bin/goferd --foreground

Later, I upgraded the rhel7 client with latest compose7 and with following packages:
[root@apollo ~]# rpm -qa | grep qpid

And this time I don't see any process like python/goferd consuming more that 100% CPU on re-connection to qdrouterd. here is the result after fix:

17107 root      20   0 1534464  39516   9796 S   0.3  0.1   0:00.92 python   

root     17107     1  0 04:53 ?        00:00:01 python /usr/bin/goferd --foreground

Will test the fix with rhel6 and rhel5 clients and update here
Comment 20 Sachin Ghai 2016-01-27 05:51:25 EST
One more observation, while installing packages, I see a sudden increase in CPU usage by 40-50%. Is this expected ?
Comment 21 Sachin Ghai 2016-01-27 07:05:23 EST
Here is the result on installing a single package on rhel6. CPU usage increases by 70%

 8566 root      20   0 2074m 153m  13m S 70.5  0.6   0:07.85 python  

root      8566     1  7 13:01 ?        00:00:10 python /usr/bin/goferd
Comment 22 Sachin Ghai 2016-01-27 07:09:12 EST
Ok, I verified with rhel5.11 and rhel67 and rhel72 clients. I don't see python/goferd consuming more that 100% CPU on re-connection to qdrouterd. However, I do see sudden rise in CPU usage when we install packages on client from sat6 server. 

@Mike: could you please confirm if its expected ?
Comment 24 Mike McCune 2016-01-27 10:02:01 EST
it is expected that the gofer pytyhon process uses CPU while installing packages as this process is what is interacting with the yum api (not the yum CLI) to download and interact with RPM.
Comment 25 Sachin Ghai 2016-01-27 10:04:18 EST
Thanks Mike. Moving this bz to verified as per comments 20, 21, 23 and 24.
Comment 27 errata-xmlrpc 2016-01-28 02:46:59 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.