1398377 – qdrouterd randomly stuck for random time due to validating management agent cache

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1398377 - qdrouterd randomly stuck for random time due to validating management agent cache

Summary: qdrouterd randomly stuck for random time due to validating management agent c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	katello-agent
Sub Component:
Version:	6.2.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1400665
TreeView+	depends on / blocked

Reported:	2016-11-24 15:32 UTC by Pavel Moravec
Modified:	2021-08-30 12:16 UTC (History)
CC List:	10 users (show)
Fixed In Version:	qpid-dispatch-0.4-21
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1400665 (view as bug list)
Environment:
Last Closed:	2016-12-05 20:54:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2778111	0	None	None	None	2016-11-24 15:51:03 UTC
Red Hat Product Errata	RHBA-2016:2855	0	normal	SHIPPED_LIVE	Satellite 6.2 Dispatch Router Async Errata	2016-12-06 01:53:43 UTC

Description Pavel Moravec 2016-11-24 15:32:25 UTC

Description of problem:
Under unknown circumstances, qdrouterd starts to consume lot of CPU, stops responding to clients and logs "Hello peer expired" (if there is external capsule). That lasts randomly long time (longest period seen so far: 80 seconds).

Taking coredump during that time shows backtraces like:

management/schema.py,:263, in validate
management/schema.py,:263, in validate
management/schema.py,:443, in validate
management/schema.py,:582, in validate_entity
management/schema.py,:605, in validate_all
management/qdrouter.py,:56, in validate_full
management/agent.py,:360, in add
management/agent.py,:370, in _add_implementation
management/agent.py,:423, in refresh_from_c
management/agent.py,:581, in receive

(path /usr/lib/qpid-dispatch/python/qpid_dispatch_internal/ skipped for the sake of readiness)

I.e. when validating attribute_maps in management agent cache. For Sat6, >90% of the cache is ConnectionEntity objects. It was pre-agreed with qpid engineering that stop adding the objects to the cache wont harm (anything other than "qdstat -c" empty content) and it was verified doing so in a primitive way really helps.

Please remove adding the connection objects to the cache (within router C code, ideally).


Version-Release number of selected component (if applicable):
qpid-dispatch-router 0.4-20


How reproducible:
n.a. (at the moment)


Steps to Reproduce:
???



Actual results:
see above observations


Expected results:
no such busy & not responding qdrouterd


Additional info:
indirect way of detecting the issue / fix:

add to /usr/lib/qpid-dispatch/python/qpid_dispatch_internal/management/schema.py around line 605:

        if check_singleton: check_singleton = set()
        if check_unique: check_unique = set()

        print "TEST: attribute_maps size: %s START" % len(attribute_maps)
        for e in attribute_maps:
                print "TEST: attribute class: %s" % e.__class__.__name__
                self.validate_entity(e,
                                     check_required=check_required,
                                     add_default=add_default,
                                     check_unique=check_unique,
                                     check_singleton=check_singleton)
        print "TEST: attribute_maps size: %s DONE" % len(attribute_maps)

and restart qdrouterd. Usually, ConnectionEntity attribute class will be often printed to /var/log/messages (the more connections from goferd / any qdrouterd client you have, the more logs.

If you manage to reproduce the issue, then there will be time lag between the "START" and "DONE" adjacent logs.

Comment 2 Pavel Moravec 2016-11-24 21:19:14 UTC

Reproducer in Satellite:

have >1000 Content Hosts with goferd running and restart goferd with some frequency (say every goferd once per 10minutes)

reproducer outside Satellite:
1) add queue pulp.agent to qpidd

2) Have below script and run it many times, like:

for i in $(seq 1 20); do python just_consume.py 10000.0 100 & sleep 1; done
for i in $(seq 1 20); do python just_consume.py 10000.0 200 & sleep 1; done
python just_consume.py 1.0 100 &

(it shall create many consumers of the pulp.agent queue, with seldom re-connect and some with frequent reconnects)

3) wait until below logs appear in /var/log/messages (here external Capsule is required):

Nov 24 22:08:36 pmoravec-sat62-rhel7 qdrouterd: Thu Nov 24 22:08:36 2016 ROUTER_HELLO (info) HELLO peer expired: pmoravec-caps62-rhel7.gsslab.brq.redhat.com

4) optionally to 3, take tcpdump on port 5647 during the test, apply filter "tcp.srcport eq 5647 and tcp.flags.push ==1" in wireshark, and find biggest time lag in (usually frequent) responses - the best is to set View -> Time display format -> Seconds since previous displayed packet, and then sort per column Time and finx max - thats the max delay in response by qdrouterd to its clients

The 4) was seen at customers 80s or even 180s, above reproducer can get 30-40s (still more than enough).

5) Optionally, add the debugs per comment #c0 and find where START is followed by DONE after a loong time - meantime router might not respond at all


Script itself:

#!/usr/bin/python

from time import sleep
from uuid import uuid4

from proton import ConnectionException, Timeout
from proton import SSLDomain, SSLException
#from proton import Message

from proton.utils import BlockingConnection

import threading
import traceback
import os, sys
import random

SSL = True
ROUTER_ADDRESS_NOSSL = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5648"
ROUTER_ADDRESS_SSL = "amqps://pmoravec-sat62-rhel7.gsslab.brq.redhat.com:5647"
ROUTER_ADDRESS = ROUTER_ADDRESS_SSL if SSL else ROUTER_ADDRESS_NOSSL
#ROUTER_ADDRESS = "proton+amqp://toledo-capsule.gsslab.brq.redhat.com:5648"
ADDRESS = "pulp.agent"
#ADDRESS = "queue"
HEARTBEAT = 30
SLEEP = float(sys.argv[1])
THREADS = int(sys.argv[2])
START_DELAYS = 1 if SLEEP/THREADS else 1

class ReceiverThread(threading.Thread):
    def __init__(self, _id, address=ADDRESS, domain=None):
        super(ReceiverThread, self).__init__()
        self._id = _id
        self.address = address
        print self.address
        self.domain = domain
        self.running = True
        self.conn = None

    def connect(self):
        try:
            self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT)
            self.conn.create_receiver(self.address, name=str(uuid4()), dynamic=False, options=None)
        except Exception:
            self.conn = None

    def run(self):
        while self.running:
            while self.conn == None:
                self.connect()
            sleep(random.uniform(SLEEP/2,3*SLEEP/2))
            try:
                print "%s: reconnecting.." % self.address
                self.conn.close()
            except Exception, e:
                print e
                pass
            self.conn = None

    def stop(self):
        self.running = False

ca_certificate='/etc/rhsm/ca/katello-server-ca.pem'
client_certificate='/etc/pki/consumer/bundle.pem'
client_key=None

domain = SSLDomain(SSLDomain.MODE_CLIENT)
domain.set_trusted_ca_db(ca_certificate)
domain.set_credentials(client_certificate, client_key or client_certificate, None)
domain.set_peer_authentication(SSLDomain.VERIFY_PEER)

random.seed()
threads = []
for i in range(THREADS):
  threads.append(ReceiverThread(i, ADDRESS, domain if SSL else None))
  threads[i].start()
  sleep(START_DELAYS)

while True:
  sleep(10)
_in = raw_input("Press Enter to exit:")

for i in range(THREADS):
    threads[i].stop()
for i in range(THREADS):
    threads[i].join()

Comment 5 Pavel Moravec 2016-11-30 19:17:45 UTC

/me verifies 0.4-21 fixes this.

With 0.4-20, on Sat with Capsule (such that qdrouterd processes are inter-connected), mimicking the goferd connections showed 1) HELLO peer timeouts, 2) high delays in communication in tcpdump , 3) after adding the debugs, cache size roughly corresponded to the number of client connections.

With 0.40-21, on the same setup, no timeouts happened, no high delays in tcpdump, after adding the debugs the cache size was constantly below 100, tracking just some internal objects.

(in neither of above tests, the python code workaround was applied)

Comment 6 Andrew Sanders 2016-11-30 21:46:24 UTC

(In reply to Pavel Moravec from comment #5)
> /me verifies 0.4-21 fixes this.
> 
> With 0.4-20, on Sat with Capsule (such that qdrouterd processes are
> inter-connected), mimicking the goferd connections showed 1) HELLO peer
> timeouts, 2) high delays in communication in tcpdump , 3) after adding the
> debugs, cache size roughly corresponded to the number of client connections.
> 
> With 0.40-21, on the same setup, no timeouts happened, no high delays in
> tcpdump, after adding the debugs the cache size was constantly below 100,
> tracking just some internal objects.
> 
> (in neither of above tests, the python code workaround was applied)

Do we have builds of 0.4-21 that are safe to hand off to customers to workaround this issue yet?

Comment 8 Mike McCune 2016-12-01 18:34:04 UTC

** HOTFIX AVAILABLE **

This set of hotfix packages are currently undergoing testing by Red Hat Quality Assurance but are available for users who want to get early access. This hotfix includes fixes for bugs 1398377 and 1398536

Instructions for hotfixing:

1) Download to your Satellite : http://people.redhat.com/~mmccune/hotfix/HOTFIX-1398377-1398536.tar.gz

2) Expand the archive, tar xvf HOTFIX-1398377-1398536.tar.gz

3) cd to directory depending on version of base OS you are using

4) Upgrade RPMS:

rpm -Uvh qpid-dispatch-router-0.4-21.el7sat.x86_64.rpm libqpid-dispatch-0.4-21.el7sat.x86_64.rpm

(optionally include qpid-dispatch-tools if you get dependency errors and have the older version of this package already installed)

5) katello-service restart

proceed with operations.

Comment 9 jcallaha 2016-12-05 13:41:30 UTC

Verified in Satellite 6.2.4 async based on #5 and the no-break automation test results.

# rpm -qa | grep qpid-dispatch
libqpid-dispatch-0.4-21.el6sat.x86_64
qpid-dispatch-debuginfo-0.4-21.el6sat.x86_64
qpid-dispatch-router-0.4-21.el6sat.x86_64
qpid-dispatch-tools-0.4-21.el6sat.x86_64

# rpm -qa | grep qpid-dispatch
qpid-dispatch-debuginfo-0.4-21.el7sat.x86_64
qpid-dispatch-tools-0.4-21.el7sat.x86_64
libqpid-dispatch-0.4-21.el7sat.x86_64
qpid-dispatch-router-0.4-21.el7sat.x86_64

Comment 11 errata-xmlrpc 2016-12-05 20:54:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2855

Note You need to log in before you can comment on or make changes to this bug.