Red Hat Bugzilla – Bug 1272596
goferd taking 100% CPU after reconnection to qdrouterd failed
Last modified: 2017-02-23 14:44:19 EST
Description of problem: There is a race condition causing goferd process taking 100% CPU but not attempting to reconnect to qdrouterd. In nutshell, goferd is busy waiting to get an event that remote end is not further active, in order to gracefully close timeouted connection. The bug lies in proton reactor API or in proton - see details below. Version-Release number of selected component (if applicable): python-qpid-proton-0.9-7.el7.x86_64 python-gofer-qpid-2.6.6-2.el7sat.noarch qpid-proton-debuginfo-0.9-7.el7.x86_64 qpid-proton-c-0.9-7.el7.x86_64 python-gofer-proton-2.6.6-2.el7sat.noarch gofer-2.6.6-2.el7sat.noarch python-gofer-2.6.6-2.el7sat.noarch How reproducible: 100% after some time Steps to Reproduce: 1. Just to speed up reproducer, shorten reconnect delays of goferd in a content host. Update in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py: DELAY = 1 MAX_DELAY = 10 2. Restart goferd service on the content host. 3. On Satellite, let qdrouterd be flapping in operation, by running simple script: while true; do date service qdrouterd stop sleep 10; sleep $(($((RANDOM))%20)) date service qdrouterd start sleep 10; sleep $(($((RANDOM))%20)) done 4. Monitor CPU usage of goferd on the content host, and TCP connections from it to Satellite's port 5647 (i.e. run tcpdump). Actual results: After several minutes: - CPU usage starts to grow - no TCP connection attempt being made - tcpdump wont show any activity, all connections are closed or in CLOSE_WAIT state - the situation does not change even when letting qdrouterd running all the time (after hitting the CPU usage growth already) Expected results: - no CPU usage growth - new TCP connection attempts made all the time, per connection retries specified in connection.py Additional info: - does not depend on using SSL or not. I.e. when I disabled SSL per https://access.redhat.com/solutions/1594323 , I was able to reproduce as well - adding some debugs, goferd is busy waiting in /usr/lib64/python2.7/site-packages/proton/utils.py : def wait(self, condition, timeout=False, msg=None): """Call process until condition() is true""" if timeout is False: timeout = self.timeout if timeout is None: while not condition() and not self.disconnected: self.container.process() after some time, proton reactor enters the while loop and never finishes it (checked by adding debugs outside and inside the loop) - backtrace of relevant thread is often like below: (gdb) bt #0 0x00007fff8ad5e7d2 in clock_gettime () #1 0x00007ff890681dfd in clock_gettime () from /lib64/libc.so.6 #2 0x00007ff83514365e in pn_i_now () at /usr/src/debug/qpid-proton-0.9/proton-c/src/platform.c:31 #3 0x00007ff83513b039 in pn_reactor_mark (reactor=reactor@entry=0x7ff85403fb10) at /usr/src/debug/qpid-proton-0.9/proton-c/src/reactor/reactor.c:58 #4 0x00007ff83513b8e5 in pn_reactor_process (reactor=reactor@entry=0x7ff85403fb10) at /usr/src/debug/qpid-proton-0.9/proton-c/src/reactor/reactor.c:387 #5 0x00007ff835399c33 in _wrap_pn_reactor_process (self=<optimized out>, args=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/bindings/python/cprotonPYTHON_wrap.c:23694 #6 0x00007ff89133eb74 in call_function (oparg=<optimized out>, pp_stack=0x7ff833fd2f50) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4098 #7 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f0d0c6748e0, for file /usr/lib64/python2.7/site-packages/proton/reactor.py, line 143, in process (self=<Container(_impl=<SwigPyObject at remote 0x7f0d0c6c38d0>, _attrs={'ssl': <SSLConfig(client=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3cf0>) at remote 0x7f0d0c6cd910>, server=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3210>) at remote 0x7f0d0c6d2390>) at remote 0x7f0d0c6cd8d0>, 'errors': [], 'subclass': <type at remote 0x7f0d0c1eb010>, 'container_id': '6c7286d5-cdb1-47ad-a74c-832733df3fcb', 'trigger': None}) at remote 0x7f0d0c6cd590>), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740 #8 0x00007f0d51550930 in fast_function (nk=<optimized out>, na=1, n=1, pp_stack=0x7f0cf3fd30b0, func=<function at remote 0x7f0d0c12ac80>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4184 #9 call_function (oparg=<optimized out>, pp_stack=0x7f0cf3fd30b0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4119 #10 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f0d0c3826a0, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 237, in wait (self=<BlockingConnection(url=<Url(_url=<SwigPyObject at remote 0x7f0d0c6c32a0>) at remote 0x7f0d0c6d2410>, container=<Container(_impl=<SwigPyObject at remote 0x7f0d0c6c38d0>, _attrs={'ssl': <SSLConfig(client=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3cf0>) at remote 0x7f0d0c6cd910>, server=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3210>) at remote 0x7f0d0c6d2390>) at remote 0x7f0d0c6cd8d0>, 'errors': [], 'subclass': <type at remote 0x7f0d0c1eb010>, 'container_id': '6c7286d5-cdb1-47ad-a74c-832733df3fcb', 'trigger': None}) at remote 0x7f0d0c6cd590>, disconnected=False, timeout=None, conn=<Connection(_impl=<SwigPyObject at remote 0x7f0d0c6c3ba0>, _attrs={'_session_policy': <SessionPerConnection(_default_session=<Session(_impl=<SwigPyObject at remote 0x7f0d0c6c3ae0>, _attrs={'context': <...>, 'condition': None}) at remote 0x7f0d0c6d27d0>) at remote 0x7f0d0c6d2690>, 'desired_capabilities': No...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740 #11 0x00007f0d5155218d in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, argcount=argcount@entry=2, kws=0x7f0d0c367c08, kwcount=1, defs=0x7f0d0c09a338, defcount=2, closure=closure@entry=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330 #12 0x00007f0d5155083f in fast_function (nk=<optimized out>, na=2, n=4, pp_stack=0x7f0cf3fd32b0, func=<function at remote 0x7f0d0c130230>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4194 #13 call_function (oparg=<optimized out>, pp_stack=0x7f0cf3fd32b0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4119 #14 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f0d0c367a70, for file /usr/lib64/python2.7/site-packages/proton/utils.py, line 220, in close (self=<BlockingConnection(url=<Url(_url=<SwigPyObject at remote 0x7f0d0c6c32a0>) at remote 0x7f0d0c6d2410>, container=<Container(_impl=<SwigPyObject at remote 0x7f0d0c6c38d0>, _attrs={'ssl': <SSLConfig(client=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3cf0>) at remote 0x7f0d0c6cd910>, server=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3210>) at remote 0x7f0d0c6d2390>) at remote 0x7f0d0c6cd8d0>, 'errors': [], 'subclass': <type at remote 0x7f0d0c1eb010>, 'container_id': '6c7286d5-cdb1-47ad-a74c-832733df3fcb', 'trigger': None}) at remote 0x7f0d0c6cd590>, disconnected=False, timeout=None, conn=<Connection(_impl=<SwigPyObject at remote 0x7f0d0c6c3ba0>, _attrs={'_session_policy': <SessionPerConnection(_default_session=<Session(_impl=<SwigPyObject at remote 0x7f0d0c6c3ae0>, _attrs={'context': <...>, 'condition': None}) at remote 0x7f0d0c6d27d0>) at remote 0x7f0d0c6d2690>, 'desired_capabilities': N...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740 #15 0x00007f0d5155218d in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, argcount=argcount@entry=1, kws=0x7f0d0c377748, kwcount=0, defs=0x0, defcount=0, closure=closure@entry=0x0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330 #16 0x00007f0d5155083f in fast_function (nk=<optimized out>, na=1, n=1, pp_stack=0x7f0cf3fd34b0, func=<function at remote 0x7f0d0c1300c8>) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4194 #17 call_function (oparg=<optimized out>, pp_stack=0x7f0cf3fd34b0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4119 #18 PyEval_EvalFrameEx ( f=f@entry=Frame 0x7f0d0c3775b0, for file /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py, line 156, in close (self=<Connection(url='proton+amqp://pmoravec-sat61-rhel7.gsslab.brq.redhat.com:5647', _impl=None) at remote 0x7f0d0c316690>, connection=<BlockingConnection(url=<Url(_url=<SwigPyObject at remote 0x7f0d0c6c32a0>) at remote 0x7f0d0c6d2410>, container=<Container(_impl=<SwigPyObject at remote 0x7f0d0c6c38d0>, _attrs={'ssl': <SSLConfig(client=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3cf0>) at remote 0x7f0d0c6cd910>, server=<SSLDomain(_domain=<SwigPyObject at remote 0x7f0d0c6c3210>) at remote 0x7f0d0c6d2390>) at remote 0x7f0d0c6cd8d0>, 'errors': [], 'subclass': <type at remote 0x7f0d0c1eb010>, 'container_id': '6c7286d5-cdb1-47ad-a74c-832733df3fcb', 'trigger': None}) at remote 0x7f0d0c6cd590>, disconnected=False, timeout=None, conn=<Connection(_impl=<SwigPyObject at remote 0x7f0d0c6c3ba0>, _attrs={'_session_policy': <SessionPerConnection(_default_session=<Session(_impl=<SwigPyO...(truncated), throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740 - tcpdump of one such event (on nonSSL traffic) to be attached
Created attachment 1083808 [details] tcpdump during reproducer test See latest 10 frames there - self-explanatory what circumstances trigger the bug.
(In reply to Pavel Moravec from comment #1) > Created attachment 1083808 [details] > tcpdump during reproducer test > > See latest 10 frames there - self-explanatory what circumstances trigger the > bug. Hi Gordon, would you be so kind to check this bug? (let me know if you prefer standalone reproducer without Satellite bits, should not be complex)
(In reply to Pavel Moravec from comment #2) > (In reply to Pavel Moravec from comment #1) > > Created attachment 1083808 [details] > > tcpdump during reproducer test > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > bug. > > Hi Gordon, > would you be so kind to check this bug? (let me know if you prefer > standalone reproducer without Satellite bits, should not be complex) Is it possible that this is caused by https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The protocol trace shows that the last attach attempt is failing due to the router not being able to route it, and depending exactly how that gets processed it could result in an AttributeError being raised instead of the expected LinkDetached or LinkException. If not, a standalone reproducer would indeed be very useful.
(In reply to Gordon Sim from comment #4) > (In reply to Pavel Moravec from comment #2) > > (In reply to Pavel Moravec from comment #1) > > > Created attachment 1083808 [details] > > > tcpdump during reproducer test > > > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > > bug. > > > > Hi Gordon, > > would you be so kind to check this bug? (let me know if you prefer > > standalone reproducer without Satellite bits, should not be complex) > > Is it possible that this is caused by > https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The > protocol trace shows that the last attach attempt is failing due to the > router not being able to route it, and depending exactly how that gets > processed it could result in an AttributeError being raised instead of the > expected LinkDetached or LinkException. > > If not, a standalone reproducer would indeed be very useful. Applying that patch helped but did not fully prevented the bug. I will prepare standalone reproducer, should be matter of very few (working) hours.
(In reply to Gordon Sim from comment #4) > (In reply to Pavel Moravec from comment #2) > > (In reply to Pavel Moravec from comment #1) > > > Created attachment 1083808 [details] > > > tcpdump during reproducer test > > > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > > bug. > > > > Hi Gordon, > > would you be so kind to check this bug? (let me know if you prefer > > standalone reproducer without Satellite bits, should not be complex) > > Is it possible that this is caused by > https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The > protocol trace shows that the last attach attempt is failing due to the > router not being able to route it, and depending exactly how that gets > processed it could result in an AttributeError being raised instead of the > expected LinkDetached or LinkException. > > If not, a standalone reproducer would indeed be very useful. Here is a reproducer script - interestingly un-commenting either consumer close line, I was unable to reproduce (or I was not patient enough): (in parallel, bounce qdrouterd every 5 seconds) #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton.utils import BlockingConnection import traceback import random while True: sleep(random.uniform(0.3,3)) try: conn = BlockingConnection("proton+amqp://localhost:5672", ssl_domain=None, heartbeat=2) rec = conn.create_receiver("another_address", name=str(uuid4()), dynamic=False, options=None) print "sleeping.." sleep(random.uniform(0.3,3)) rec2 = conn.create_receiver("some_address", name=str(uuid4()), dynamic=False, options=None) except ConnectionException: try: # rec.close() # rec2.close() if conn: conn.close() except Exception, e: print e print(traceback.format_exc())
(In reply to Pavel Moravec from comment #5) > (In reply to Gordon Sim from comment #4) > > (In reply to Pavel Moravec from comment #2) > > > (In reply to Pavel Moravec from comment #1) > > > > Created attachment 1083808 [details] > > > > tcpdump during reproducer test > > > > > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > > > bug. > > > > > > Hi Gordon, > > > would you be so kind to check this bug? (let me know if you prefer > > > standalone reproducer without Satellite bits, should not be complex) > > > > Is it possible that this is caused by > > https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The > > protocol trace shows that the last attach attempt is failing due to the > > router not being able to route it, and depending exactly how that gets > > processed it could result in an AttributeError being raised instead of the > > expected LinkDetached or LinkException. > > > > If not, a standalone reproducer would indeed be very useful. > > Applying that patch helped but did not fully prevented the bug. I will > prepare standalone reproducer, should be matter of very few (working) hours. I've committed a further fix for a slightly different case now: https://issues.apache.org/jira/browse/PROTON-1024 This was what I was hitting when running your standalone reproducer. With that fix in place I haven't yet been able to reproduce.
(In reply to Gordon Sim from comment #7) > > Applying that patch helped but did not fully prevented the bug. I will > > prepare standalone reproducer, should be matter of very few (working) hours. > > I've committed a further fix for a slightly different case now: > https://issues.apache.org/jira/browse/PROTON-1024 > > This was what I was hitting when running your standalone reproducer. With > that fix in place I haven't yet been able to reproduce. Testing both standalone reproducer and/or goferd with both fixes of PROTON-1023 and PROTON-1024 does not trigger this bug further. Good catch! Anyway, when running the standalone reproducer regardless of the patches applied or not, I spotted https://issues.apache.org/jira/browse/PROTON-1025. This close_wait leak is _not_ observed in goferd reproducer, therefore I do not plan to raise Satellite bugzilla for it - just reported to proton upstream. I.e. for Satellite purposes, backport of PROTON-1023 and PROTON-1024 is the proper/sufficient fix.
VERIFIED: # rpm -qa | grep foreman foreman-compute-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman-redhat_access-0.2.4-1.el7sat.noarch rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch ruby193-rubygem-foreman_bootdisk-4.0.2.13-1.el7sat.noarch ruby193-rubygem-foreman_docker-1.2.0.24-1.el7sat.noarch rubygem-hammer_cli_foreman_docker-0.0.3.10-1.el7sat.noarch foreman-debug-1.7.2.48-1.el7sat.noarch foreman-postgresql-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman_openscap-0.3.2.10-1.el7sat.noarch ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch foreman-gce-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman_discovery-2.0.0.22-1.el7sat.noarch rubygem-hammer_cli_foreman_tasks-0.0.3.5-1.el7sat.noarch puppet-foreman_scap_client-0.3.3-9.el7sat.noarch foreman-selinux-1.7.2.17-1.el7sat.noarch foreman-vmware-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch foreman-proxy-1.7.2.6-1.el7sat.noarch ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com-foreman-proxy-client-1.0-1.noarch foreman-discovery-image-3.0.5-3.el7sat.noarch foreman-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman-tasks-0.6.15.7-1.el7sat.noarch rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch foreman-libvirt-1.7.2.48-1.el7sat.noarch ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com-foreman-client-1.0-1.noarch ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com-foreman-proxy-1.0-1.noarch foreman-ovirt-1.7.2.48-1.el7sat.noarch rubygem-hammer_cli_foreman-0.1.4.14-1.el7sat.noarch 1. Just to speed up reproducer, shorten reconnect delays of goferd in a content host. Update in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py: DELAY = 1 MAX_DELAY = 10 2. Restart goferd service on the content host. 3. On Satellite, let qdrouterd be flapping in operation, by running simple script: while true; do date service qdrouterd stop sleep 10; sleep $(($((RANDOM))%20)) date service qdrouterd start sleep 10; sleep $(($((RANDOM))%20)) done 4. Monitor CPU usage of goferd on the content host # while true; do date; ps aux | grep python | grep goferd; slp 5; done Tue Dec 8 07:26:30 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:35 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:40 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:45 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:50 EST 2015 root 30877 0.5 1.1 1936004 43292 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:55 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:00 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:05 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:10 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:15 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:20 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:25 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:30 EST 2015 root 30877 0.5 1.1 1936004 43516 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:36 EST 2015 root 30877 0.5 1.1 1936004 43732 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:41 EST 2015 root 30877 0.5 1.1 1936004 43872 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:46 EST 2015 root 30877 0.5 1.1 1936004 43936 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:51 EST 2015 root 30877 0.5 1.1 1936004 43996 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:56 EST 2015 root 30877 0.5 1.1 1936004 44076 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:28:01 EST 2015 root 30877 0.5 1.1 1936004 44076 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:28:06 EST 2015 root 30877 0.5 1.1 1936004 44076 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:28:11 EST 2015 root 30877 0.5 1.1 1936004 44308 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground So, as per the expected result : - no CPU usage growth it is constant
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:2621
*** Bug 1276176 has been marked as a duplicate of this bug. ***