Description of problem: Under unknown circumstances, qdrouterd starts to consume lot of CPU, stops responding to clients and logs "Hello peer expired" (if there is external capsule). That lasts randomly long time (longest period seen so far: 80 seconds). Taking coredump during that time shows backtraces like: management/schema.py,:263, in validate management/schema.py,:263, in validate management/schema.py,:443, in validate management/schema.py,:582, in validate_entity management/schema.py,:605, in validate_all management/qdrouter.py,:56, in validate_full management/agent.py,:360, in add management/agent.py,:370, in _add_implementation management/agent.py,:423, in refresh_from_c management/agent.py,:581, in receive (path /usr/lib/qpid-dispatch/python/qpid_dispatch_internal/ skipped for the sake of readiness) I.e. when validating attribute_maps in management agent cache. For Sat6, >90% of the cache is ConnectionEntity objects. It was pre-agreed with qpid engineering that stop adding the objects to the cache wont harm (anything other than "qdstat -c" empty content) and it was verified doing so in a primitive way really helps. Please remove adding the connection objects to the cache (within router C code, ideally). Version-Release number of selected component (if applicable): qpid-dispatch-router 0.4-20 How reproducible: n.a. (at the moment) Steps to Reproduce: ??? Actual results: see above observations Expected results: no such busy & not responding qdrouterd Additional info: indirect way of detecting the issue / fix: add to /usr/lib/qpid-dispatch/python/qpid_dispatch_internal/management/schema.py around line 605: if check_singleton: check_singleton = set() if check_unique: check_unique = set() print "TEST: attribute_maps size: %s START" % len(attribute_maps) for e in attribute_maps: print "TEST: attribute class: %s" % e.__class__.__name__ self.validate_entity(e, check_required=check_required, add_default=add_default, check_unique=check_unique, check_singleton=check_singleton) print "TEST: attribute_maps size: %s DONE" % len(attribute_maps) and restart qdrouterd. Usually, ConnectionEntity attribute class will be often printed to /var/log/messages (the more connections from goferd / any qdrouterd client you have, the more logs. If you manage to reproduce the issue, then there will be time lag between the "START" and "DONE" adjacent logs.
Reproducer in Satellite: have >1000 Content Hosts with goferd running and restart goferd with some frequency (say every goferd once per 10minutes) reproducer outside Satellite: 1) add queue pulp.agent to qpidd 2) Have below script and run it many times, like: for i in $(seq 1 20); do python just_consume.py 10000.0 100 & sleep 1; done for i in $(seq 1 20); do python just_consume.py 10000.0 200 & sleep 1; done python just_consume.py 1.0 100 & (it shall create many consumers of the pulp.agent queue, with seldom re-connect and some with frequent reconnects) 3) wait until below logs appear in /var/log/messages (here external Capsule is required): Nov 24 22:08:36 pmoravec-sat62-rhel7 qdrouterd: Thu Nov 24 22:08:36 2016 ROUTER_HELLO (info) HELLO peer expired: pmoravec-caps62-rhel7.gsslab.brq.redhat.com 4) optionally to 3, take tcpdump on port 5647 during the test, apply filter "tcp.srcport eq 5647 and tcp.flags.push ==1" in wireshark, and find biggest time lag in (usually frequent) responses - the best is to set View -> Time display format -> Seconds since previous displayed packet, and then sort per column Time and finx max - thats the max delay in response by qdrouterd to its clients The 4) was seen at customers 80s or even 180s, above reproducer can get 30-40s (still more than enough). 5) Optionally, add the debugs per comment #c0 and find where START is followed by DONE after a loong time - meantime router might not respond at all Script itself: #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException, Timeout from proton import SSLDomain, SSLException #from proton import Message from proton.utils import BlockingConnection import threading import traceback import os, sys import random SSL = True ROUTER_ADDRESS_NOSSL = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648" ROUTER_ADDRESS_SSL = "amqps://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5647" ROUTER_ADDRESS = ROUTER_ADDRESS_SSL if SSL else ROUTER_ADDRESS_NOSSL #ROUTER_ADDRESS = "proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5648" ADDRESS = "pulp.agent" #ADDRESS = "queue" HEARTBEAT = 30 SLEEP = float(sys.argv[1]) THREADS = int(sys.argv[2]) START_DELAYS = 1 if SLEEP/THREADS else 1 class ReceiverThread(threading.Thread): def __init__(self, _id, address=ADDRESS, domain=None): super(ReceiverThread, self).__init__() self._id = _id self.address = address print self.address self.domain = domain self.running = True self.conn = None def connect(self): try: self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT) self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None) except Exception: self.conn = None def run(self): while self.running: while self.conn == None: self.connect() sleep(random.uniform(SLEEP/2,3*SLEEP/2)) try: print "%s: reconnecting.." % self.address self.conn.close() except Exception, e: print e pass self.conn = None def stop(self): self.running = False ca_certificate='/etc/rhsm/ca/katello-server-ca.pem' client_certificate='/etc/pki/consumer/bundle.pem' client_key=None domain = SSLDomain(SSLDomain.MODE_CLIENT) domain.set_trusted_ca_db(ca_certificate) domain.set_credentials(client_certificate, client_key or client_certificate, None) domain.set_peer_authentication(SSLDomain.VERIFY_PEER) random.seed() threads = [] for i in range(THREADS): threads.append(ReceiverThread(i, ADDRESS, domain if SSL else None)) threads[i].start() sleep(START_DELAYS) while True: sleep(10) _in = raw_input("Press Enter to exit:") for i in range(THREADS): threads[i].stop() for i in range(THREADS): threads[i].join()
/me verifies 0.4-21 fixes this. With 0.4-20, on Sat with Capsule (such that qdrouterd processes are inter-connected), mimicking the goferd connections showed 1) HELLO peer timeouts, 2) high delays in communication in tcpdump , 3) after adding the debugs, cache size roughly corresponded to the number of client connections. With 0.40-21, on the same setup, no timeouts happened, no high delays in tcpdump, after adding the debugs the cache size was constantly below 100, tracking just some internal objects. (in neither of above tests, the python code workaround was applied)
(In reply to Pavel Moravec from comment #5) > /me verifies 0.4-21 fixes this. > > With 0.4-20, on Sat with Capsule (such that qdrouterd processes are > inter-connected), mimicking the goferd connections showed 1) HELLO peer > timeouts, 2) high delays in communication in tcpdump , 3) after adding the > debugs, cache size roughly corresponded to the number of client connections. > > With 0.40-21, on the same setup, no timeouts happened, no high delays in > tcpdump, after adding the debugs the cache size was constantly below 100, > tracking just some internal objects. > > (in neither of above tests, the python code workaround was applied) Do we have builds of 0.4-21 that are safe to hand off to customers to workaround this issue yet?
** HOTFIX AVAILABLE ** This set of hotfix packages are currently undergoing testing by Red Hat Quality Assurance but are available for users who want to get early access. This hotfix includes fixes for bugs 1398377 and 1398536 Instructions for hotfixing: 1) Download to your Satellite : http://people.redhat.com/~mmccune/hotfix/HOTFIX-1398377-1398536.tar.gz 2) Expand the archive, tar xvf HOTFIX-1398377-1398536.tar.gz 3) cd to directory depending on version of base OS you are using 4) Upgrade RPMS: rpm -Uvh qpid-dispatch-router-0.4-21.el7sat.x86_64.rpm libqpid-dispatch-0.4-21.el7sat.x86_64.rpm (optionally include qpid-dispatch-tools if you get dependency errors and have the older version of this package already installed) 5) katello-service restart proceed with operations.
Verified in Satellite 6.2.4 async based on #5 and the no-break automation test results. # rpm -qa | grep qpid-dispatch libqpid-dispatch-0.4-21.el6sat.x86_64 qpid-dispatch-debuginfo-0.4-21.el6sat.x86_64 qpid-dispatch-router-0.4-21.el6sat.x86_64 qpid-dispatch-tools-0.4-21.el6sat.x86_64 # rpm -qa | grep qpid-dispatch qpid-dispatch-debuginfo-0.4-21.el7sat.x86_64 qpid-dispatch-tools-0.4-21.el7sat.x86_64 libqpid-dispatch-0.4-21.el7sat.x86_64 qpid-dispatch-router-0.4-21.el7sat.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2855