Red Hat Bugzilla – Bug 1295957
goferd taking 100% CPU after successful reconnect to qdrouterd after a longer time
Last modified: 2017-02-23 14:39:37 EST
Description of problem: Having qdrouterd shut down for some time (such that goferd retries several times to reconnect) and starting it again, goferd connects to it properly. BUT there is a chance it _will_ start consume 100% CPU after a while (within 1 minute since successful reconnect). The bug is supposed to be in proton reactor - see below for analysis. Version-Release number of selected component (if applicable): qpid-proton-c-0.9-11.el7.x86_64 python-gofer-2.6.8-1.el7sat.noarch python-gofer-proton-2.6.8-1.el7sat.noarch qpid-proton-debuginfo-0.9-7.el7.x86_64 gofer-2.6.8-1.el7sat.noarch python-qpid-proton-0.9-11.el7.x86_64 How reproducible: ??? (100% within several minutes when one knows the tricks) Steps to Reproduce: 1. Follow reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=1272596 but stop the qdrouterd bouncing after some time - or increase the delay when qdrouterd is running substantially 2. Monitor CPU usage of goferd - notice that the high CPU usage (once hit) is put to normal values after qdrouterd restart (i.e. qdrouterd restart is a workaround) - thats why stop the qdrouterd bouncing Actual results: - goferd taking 100% CPU - it has 1 established TCP connection to qdrouterd (i.e. not the original BZ where no connection was opened at the end) Expected results: - no high CPU usage Additional info: The only busy thread is Thread 21 with usual backtrace: /usr/lib64/python2.7/site-packages/proton/__init__.py, line 3632, in wrap /usr/lib64/python2.7/site-packages/proton/__init__.py, line 3736, in dispatch /usr/lib64/python2.7/site-packages/proton/reactor.py, line 142, in process /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, in wait /usr/lib64/python2.7/site-packages/proton/utils.py, line 144, in receive /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/consumer.py, line 106, in get /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/reliability.py, line 44, in _fn /usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 614, in get /usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 39, in _fn /usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 648, in next /usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py, line 39, in _fn /usr/lib/python2.7/site-packages/gofer/messaging/consumer.py, line 88, in _read /usr/lib/python2.7/site-packages/gofer/messaging/consumer.py, line 58, in run /usr/lib/python2.7/site-packages/gofer/common.py, line 267, in _fn /usr/lib64/python2.7/threading.py, line 811, in __bootstrap_inner /usr/lib64/python2.7/threading.py, line 784, in __bootstrap Adding a logging event to /usr/lib64/python2.7/site-packages/proton/reactor.py, line 142, it is already inside the busy loop Adding a logging event to /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, just one such event is logged every 10 seconds (hint, hint, why 10 seconds? be aware I have in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py "MAX_DELAY = 1" so not _this_ parameter). Therefore, the busy loop is: /usr/lib64/python2.7/site-packages/proton/utils.py: deadline = time.time() + timeout while not condition() and not self.disconnected: self.container.process() if deadline < time.time(): txt = "Connection %s timed out" % self.url if msg: txt += ": " + msg raise Timeout(txt) I can try preparing proton reactor standalone reproducer, if/once confirmed this is reactor bug.
(In reply to Pavel Moravec from comment #0) > Adding a logging event to > /usr/lib64/python2.7/site-packages/proton/utils.py, line 252, just one such > event is logged every 10 seconds (hint, hint, why 10 seconds? be aware I > have in > /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection. > py "MAX_DELAY = 1" so not _this_ parameter). this is easy: /usr/lib/python2.7/site-packages/gofer/messaging/consumer.py: line 106: message, document = self._reader.next(10)
Another hint (rather stressing what is hidden in the text): once goferd / proton reactor is in the busy loop, restarting qdrouterd breaks it - after the restart, goferd starts to behave normally. (it feels to me like proton reactor did not update some information about lost connection/link (being restored - _this_ isnt updated), busy waiting to reconnect/reattach event that however had already happened - and qdrouterd restart generates new such event that clears the busy loop)
It has been confirmed as proton reactor bug, thanks kgiusti++ . See upstream https://issues.apache.org/jira/browse/PROTON-1090
Verified with satellite 6.1.6 compose 7 (async errata). I was able to reproduce the reported issue with: ]# rpm -qa | grep qpid-proton qpid-proton-c-0.9-11.el7.x86_64 python-qpid-proton-0.9-11.el7.x86_64 --- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28593 root 20 0 1535020 39796 9772 S 100.0 0.1 2:18.76 python root 28593 1 5 04:14 ? 00:00:40 python /usr/bin/goferd --foreground --- Later, I upgraded the rhel7 client with latest compose7 and with following packages: [root@apollo ~]# rpm -qa | grep qpid python-qpid-proton-0.9-12.el7sat.x86_64 qpid-proton-c-0.9-12.el7sat.x86_64 And this time I don't see any process like python/goferd consuming more that 100% CPU on re-connection to qdrouterd. here is the result after fix: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17107 root 20 0 1534464 39516 9796 S 0.3 0.1 0:00.92 python root 17107 1 0 04:53 ? 00:00:01 python /usr/bin/goferd --foreground Will test the fix with rhel6 and rhel5 clients and update here
One more observation, while installing packages, I see a sudden increase in CPU usage by 40-50%. Is this expected ?
Here is the result on installing a single package on rhel6. CPU usage increases by 70% PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8566 root 20 0 2074m 153m 13m S 70.5 0.6 0:07.85 python root 8566 1 7 13:01 ? 00:00:10 python /usr/bin/goferd
Ok, I verified with rhel5.11 and rhel67 and rhel72 clients. I don't see python/goferd consuming more that 100% CPU on re-connection to qdrouterd. However, I do see sudden rise in CPU usage when we install packages on client from sat6 server. @Mike: could you please confirm if its expected ?
it is expected that the gofer pytyhon process uses CPU while installing packages as this process is what is interacting with the yum api (not the yum CLI) to download and interact with RPM.
Thanks Mike. Moving this bz to verified as per comments 20, 21, 23 and 24.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0077