Bug 1272596
Summary: | goferd taking 100% CPU after reconnection to qdrouterd failed | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
Component: | katello-agent | Assignee: | Jeff Ortel <jortel> |
Status: | CLOSED ERRATA | QA Contact: | Tazim Kolhar <tkolhar> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.1.1 | CC: | aupadhye, bbuckingham, chrobert, cwelton, ddevra, gsim, iboverma, jswensso, kshravag, me, mmccune, mshimura, mtaru, nmiao, nstiasni, peter.vreman, sauchter, sreber, tkolhar |
Target Milestone: | Unspecified | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qpid-proton-0.9-10 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-15 09:32:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1122832 |
Description
Pavel Moravec
2015-10-16 20:34:45 UTC
Created attachment 1083808 [details]
tcpdump during reproducer test
See latest 10 frames there - self-explanatory what circumstances trigger the bug.
(In reply to Pavel Moravec from comment #1) > Created attachment 1083808 [details] > tcpdump during reproducer test > > See latest 10 frames there - self-explanatory what circumstances trigger the > bug. Hi Gordon, would you be so kind to check this bug? (let me know if you prefer standalone reproducer without Satellite bits, should not be complex) (In reply to Pavel Moravec from comment #2) > (In reply to Pavel Moravec from comment #1) > > Created attachment 1083808 [details] > > tcpdump during reproducer test > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > bug. > > Hi Gordon, > would you be so kind to check this bug? (let me know if you prefer > standalone reproducer without Satellite bits, should not be complex) Is it possible that this is caused by https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The protocol trace shows that the last attach attempt is failing due to the router not being able to route it, and depending exactly how that gets processed it could result in an AttributeError being raised instead of the expected LinkDetached or LinkException. If not, a standalone reproducer would indeed be very useful. (In reply to Gordon Sim from comment #4) > (In reply to Pavel Moravec from comment #2) > > (In reply to Pavel Moravec from comment #1) > > > Created attachment 1083808 [details] > > > tcpdump during reproducer test > > > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > > bug. > > > > Hi Gordon, > > would you be so kind to check this bug? (let me know if you prefer > > standalone reproducer without Satellite bits, should not be complex) > > Is it possible that this is caused by > https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The > protocol trace shows that the last attach attempt is failing due to the > router not being able to route it, and depending exactly how that gets > processed it could result in an AttributeError being raised instead of the > expected LinkDetached or LinkException. > > If not, a standalone reproducer would indeed be very useful. Applying that patch helped but did not fully prevented the bug. I will prepare standalone reproducer, should be matter of very few (working) hours. (In reply to Gordon Sim from comment #4) > (In reply to Pavel Moravec from comment #2) > > (In reply to Pavel Moravec from comment #1) > > > Created attachment 1083808 [details] > > > tcpdump during reproducer test > > > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > > bug. > > > > Hi Gordon, > > would you be so kind to check this bug? (let me know if you prefer > > standalone reproducer without Satellite bits, should not be complex) > > Is it possible that this is caused by > https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The > protocol trace shows that the last attach attempt is failing due to the > router not being able to route it, and depending exactly how that gets > processed it could result in an AttributeError being raised instead of the > expected LinkDetached or LinkException. > > If not, a standalone reproducer would indeed be very useful. Here is a reproducer script - interestingly un-commenting either consumer close line, I was unable to reproduce (or I was not patient enough): (in parallel, bounce qdrouterd every 5 seconds) #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton.utils import BlockingConnection import traceback import random while True: sleep(random.uniform(0.3,3)) try: conn = BlockingConnection("proton+amqp://localhost:5672", ssl_domain=None, heartbeat=2) rec = conn.create_receiver("another_address", name=str(uuid4()), dynamic=False, options=None) print "sleeping.." sleep(random.uniform(0.3,3)) rec2 = conn.create_receiver("some_address", name=str(uuid4()), dynamic=False, options=None) except ConnectionException: try: # rec.close() # rec2.close() if conn: conn.close() except Exception, e: print e print(traceback.format_exc()) (In reply to Pavel Moravec from comment #5) > (In reply to Gordon Sim from comment #4) > > (In reply to Pavel Moravec from comment #2) > > > (In reply to Pavel Moravec from comment #1) > > > > Created attachment 1083808 [details] > > > > tcpdump during reproducer test > > > > > > > > See latest 10 frames there - self-explanatory what circumstances trigger the > > > > bug. > > > > > > Hi Gordon, > > > would you be so kind to check this bug? (let me know if you prefer > > > standalone reproducer without Satellite bits, should not be complex) > > > > Is it possible that this is caused by > > https://issues.apache.org/jira/browse/PROTON-1023 (now fixed upstream)? The > > protocol trace shows that the last attach attempt is failing due to the > > router not being able to route it, and depending exactly how that gets > > processed it could result in an AttributeError being raised instead of the > > expected LinkDetached or LinkException. > > > > If not, a standalone reproducer would indeed be very useful. > > Applying that patch helped but did not fully prevented the bug. I will > prepare standalone reproducer, should be matter of very few (working) hours. I've committed a further fix for a slightly different case now: https://issues.apache.org/jira/browse/PROTON-1024 This was what I was hitting when running your standalone reproducer. With that fix in place I haven't yet been able to reproduce. (In reply to Gordon Sim from comment #7) > > Applying that patch helped but did not fully prevented the bug. I will > > prepare standalone reproducer, should be matter of very few (working) hours. > > I've committed a further fix for a slightly different case now: > https://issues.apache.org/jira/browse/PROTON-1024 > > This was what I was hitting when running your standalone reproducer. With > that fix in place I haven't yet been able to reproduce. Testing both standalone reproducer and/or goferd with both fixes of PROTON-1023 and PROTON-1024 does not trigger this bug further. Good catch! Anyway, when running the standalone reproducer regardless of the patches applied or not, I spotted https://issues.apache.org/jira/browse/PROTON-1025. This close_wait leak is _not_ observed in goferd reproducer, therefore I do not plan to raise Satellite bugzilla for it - just reported to proton upstream. I.e. for Satellite purposes, backport of PROTON-1023 and PROTON-1024 is the proper/sufficient fix. VERIFIED: # rpm -qa | grep foreman foreman-compute-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman-redhat_access-0.2.4-1.el7sat.noarch rubygem-hammer_cli_foreman_discovery-0.0.1.10-1.el7sat.noarch ruby193-rubygem-foreman_bootdisk-4.0.2.13-1.el7sat.noarch ruby193-rubygem-foreman_docker-1.2.0.24-1.el7sat.noarch rubygem-hammer_cli_foreman_docker-0.0.3.10-1.el7sat.noarch foreman-debug-1.7.2.48-1.el7sat.noarch foreman-postgresql-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman_openscap-0.3.2.10-1.el7sat.noarch ruby193-rubygem-foreman_hooks-0.3.7-2.el7sat.noarch foreman-gce-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman_discovery-2.0.0.22-1.el7sat.noarch rubygem-hammer_cli_foreman_tasks-0.0.3.5-1.el7sat.noarch puppet-foreman_scap_client-0.3.3-9.el7sat.noarch foreman-selinux-1.7.2.17-1.el7sat.noarch foreman-vmware-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman_gutterball-0.0.1.9-1.el7sat.noarch foreman-proxy-1.7.2.6-1.el7sat.noarch ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com-foreman-proxy-client-1.0-1.noarch foreman-discovery-image-3.0.5-3.el7sat.noarch foreman-1.7.2.48-1.el7sat.noarch ruby193-rubygem-foreman-tasks-0.6.15.7-1.el7sat.noarch rubygem-hammer_cli_foreman_bootdisk-0.1.2.7-1.el7sat.noarch foreman-libvirt-1.7.2.48-1.el7sat.noarch ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com-foreman-client-1.0-1.noarch ibm-x3655-04.ovirt.rhts.eng.bos.redhat.com-foreman-proxy-1.0-1.noarch foreman-ovirt-1.7.2.48-1.el7sat.noarch rubygem-hammer_cli_foreman-0.1.4.14-1.el7sat.noarch 1. Just to speed up reproducer, shorten reconnect delays of goferd in a content host. Update in /usr/lib/python2.7/site-packages/gofer/messaging/adapter/proton/connection.py: DELAY = 1 MAX_DELAY = 10 2. Restart goferd service on the content host. 3. On Satellite, let qdrouterd be flapping in operation, by running simple script: while true; do date service qdrouterd stop sleep 10; sleep $(($((RANDOM))%20)) date service qdrouterd start sleep 10; sleep $(($((RANDOM))%20)) done 4. Monitor CPU usage of goferd on the content host # while true; do date; ps aux | grep python | grep goferd; slp 5; done Tue Dec 8 07:26:30 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:35 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:40 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:45 EST 2015 root 30877 0.5 1.1 1936004 43172 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:50 EST 2015 root 30877 0.5 1.1 1936004 43292 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:26:55 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:00 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:05 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:10 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:15 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:20 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:25 EST 2015 root 30877 0.5 1.1 1936004 43396 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:30 EST 2015 root 30877 0.5 1.1 1936004 43516 ? Ssl 07:21 0:01 python /usr/bin/goferd --foreground Tue Dec 8 07:27:36 EST 2015 root 30877 0.5 1.1 1936004 43732 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:41 EST 2015 root 30877 0.5 1.1 1936004 43872 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:46 EST 2015 root 30877 0.5 1.1 1936004 43936 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:51 EST 2015 root 30877 0.5 1.1 1936004 43996 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:27:56 EST 2015 root 30877 0.5 1.1 1936004 44076 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:28:01 EST 2015 root 30877 0.5 1.1 1936004 44076 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:28:06 EST 2015 root 30877 0.5 1.1 1936004 44076 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground Tue Dec 8 07:28:11 EST 2015 root 30877 0.5 1.1 1936004 44308 ? Ssl 07:21 0:02 python /usr/bin/goferd --foreground So, as per the expected result : - no CPU usage growth it is constant Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:2621 *** Bug 1276176 has been marked as a duplicate of this bug. *** |