Bug 1553116
Summary: | any qdrouterd 0.4 segfaults under load of clients (dis)connects | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||||
Component: | Qpid | Assignee: | Mike Cressman <mcressma> | ||||||
Status: | CLOSED EOL | QA Contact: | Radovan Drazny <rdrazny> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 6.2.14 | CC: | andrew.schofield, bkearney, cdonnell, cduryee, gmurthy, kabbott, ktordeur, mcressma, mlele, mmccune, mschwabe, pdwyer, pmoravec, rbeyel, rdrazny, rjerrido, smutkule | ||||||
Target Milestone: | Unspecified | Keywords: | FieldEngineering, PrioBumpField, PrioBumpPM, Triaged | ||||||
Target Release: | Unused | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | qpid-dispatch-0.4-31 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1553189 (view as bug list) | Environment: | |||||||
Last Closed: | 2019-11-05 19:19:36 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1553189 | ||||||||
Attachments: |
|
Description
Pavel Moravec
2018-03-08 10:51:06 UTC
Created attachment 1405823 [details]
reproducer configs and script
To reproduce:
- unpack and cd to the dir
- ulimit -c unlimited # to get coredumps, if interested)
- run start_reproducer.sh
- very first run will customize configs per your hostname
- it will start and provision qpidd + "hub" router + 10 Capsule routers
- then run on 1 or rather 2 systems the clients (per the script output)
- I recommend using a different machine(s) for the clients, as it generates load >10 :-/
- optionally, check "qpid-stat -q" - either queue should have (close to) 100 consumers for each system running the clients (i.e. if I use 2 systems running the clients, each queue will have 190-200 consumers). If # of consumers is higher, you hit the symptom with link leak already
- wait up to 10 minutes, a segfault will come (of either router)
Improved reproducer that leads to segfault even on 0.4-30 (where the original reproducer does not crash the router): 1) run same setup like before, just add some messages to the tested queues at the end: for i in $(seq 0 19); do qpid-config bind amq.fanout pulp.bz1553116.${i} done qpid-send -a amq.fanout -m10 --content-string="hello" 2) as a client script mimicking goferd, use this one: import random import threading import traceback import sys from proton import Timeout from proton import ConnectionException from proton.utils import BlockingConnection from proton import SSLDomain from time import sleep from uuid import uuid4 ROUTER_ADDRESS = "proton+amqp://pmoravec-sat62-rhel7.gsslab.brq2.redhat.com:%s" ADDRESS = "pulp.bz1553116" HEARTBEAT = 5 SLEEP_MIN = 1.0 SLEEP_MAX = 3.0 THREADS = 20 ADDRESSES = 20 class ReceiverThread(threading.Thread): def __init__(self, port, address=ADDRESS, domain=None): super(ReceiverThread, self).__init__() self.port = port self.address = address self.domain = domain self.running = True def subscribe(self): self.recv = self.conn.create_receiver('%s' %(self.address), name=str(uuid4()), dynamic=False, options=None) def connect(self): self.conn = BlockingConnection(ROUTER_ADDRESS % (self.port), ssl_domain=self.domain, heartbeat=HEARTBEAT) self.subscribe() def reconnect(self): try: self.recv.close() except: pass try: self.conn.close() except: pass self.connect() def run(self): self.connect() while self.running: try: msg = None timeout = random.uniform(SLEEP_MIN,SLEEP_MAX) msg = self.recv.receive(timeout) if msg: print "(%s): accepting msg %s" % (self.port, msg) self.recv.accept() sleep(timeout) except KeyboardInterrupt: self.running = False pass except Timeout: pass try: print "(%s): bouncing receiver" % self.port self.recv.close() self.subscribe() except: print "(%s): receiver thread: failed to close receiver" % self.port self.reconnect() pass def stop(self): self.running = False threads = [] for i in range(THREADS): threads.append(ReceiverThread(sys.argv[1], '%s.%s' %(ADDRESS, i%ADDRESSES))) threads[i].start() running = True while running: try: sleep(10) except KeyboardInterrupt: running = False pass #_in = raw_input("Press Enter to exit:") for i in range(THREADS): threads[i].stop() for i in range(THREADS): threads[i].join() 3) in very few minutes, segfaults will happen with either backtrace: #0 0x0000000000000000 in ?? () #1 0x00007f37ff244788 in pn_class_incref (clazz=<optimized out>, object=object@entry=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:75 #2 0x00007f37ff2449f2 in pn_incref (object=object@entry=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:247 #3 0x00007f37ff251a96 in pn_delivery_settle (delivery=0x7f37e81ce0a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1778 #4 0x00007f37ff494256 in qd_delivery_free_LH (delivery=0x7f37ec0153d0, final_disposition=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:1094 #5 0x00007f37ff4a03da in router_writable_link_handler (context=0x273c9b0, link=0x28d13d0) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:571 #6 0x00007f37ff493b9e in do_writable (pn_link=0x299fea0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234 #7 process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625 #8 handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #9 0x00007f37ff4a5c7c in process_connector (cxtr=0x7f37ec004180, qd_server=0x27a5f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #10 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #11 0x00007f37ff4a6cf0 in qd_server_run (qd=0x24ee030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:973 #12 0x0000000000401b38 in main_process (config_path=config_path@entry=0x7ffc50f4a695 "/root/bz1553116_reproducer/qdrouterd.caps03.conf", python_pkgdir=python_pkgdir@entry=0x402261 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135 #13 0x00000000004017b0 in main (argc=3, argv=0x7ffc50f49098) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335 or: #0 pni_record_find (record=0x7fe7846172d0, record=0x7fe7846172d0, key=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71 #1 pn_record_set (record=0x7fe7846172d0, key=key@entry=0, value=value@entry=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:132 #2 0x00007fe797fca351 in pn_delivery_set_context (delivery=<optimized out>, context=context@entry=0x0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1562 #3 0x00007fe79820d24d in qd_delivery_free_LH (delivery=0x7fe78402b8e0, final_disposition=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:1093 #4 0x00007fe7982193da in router_writable_link_handler (context=0x1e8d9b0, link=0x7fe77c341220) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:571 #5 0x00007fe79820cb9e in do_writable (pn_link=0x2381120) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234 #6 process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625 #7 handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #8 0x00007fe79821ec7c in process_connector (cxtr=0x7fe784004180, qd_server=0x1ef6f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #9 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #10 0x00007fe797d8fe25 in start_thread (arg=0x7fe78b1a6700) at pthread_create.c:308 #11 0x00007fe7972e534d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 or also (quite different - maybe b'cos I started multiple routers with the same config?): #0 0x00007f89fdaad4d9 in router_disposition_handler (context=0x21c99b0, link=<optimized out>, delivery=0x7f89e0241560) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1110 #1 0x00007f89fdaa15ce in do_updated (pnd=0x7f89d8464040) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:286 #2 process_handler (unused=<optimized out>, qd_conn=0x2585430, container=0x21941d0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:582 #3 handler (handler_context=0x21941d0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x2585430) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #4 0x00007f89fdab3c7c in process_connector (cxtr=0x7f89e08056f0, qd_server=0x2232f80) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #5 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #6 0x00007f89fd624e25 in start_thread (arg=0x7f89f023a700) at pthread_create.c:308 #7 0x00007f89fcb7a34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 or (at customer only): #0 pn_add_tpwork (delivery=0x7fdf44a2fc90) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:677 #1 0x00007fdf70029d3e in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1928 #2 0x00007fdf702783b1 in router_writable_link_handler (context=0x25263f0, link=0x7fdf2444e8d0) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:562 #3 0x00007fdf7026bb9e in do_writable (pn_link=0x26a2360) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:234 #4 process_handler (unused=<optimized out>, qd_conn=<optimized out>, container=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:625 #5 handler (handler_context=<optimized out>, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:675 #6 0x00007fdf7027dc7c in process_connector (cxtr=0x7fdf2c000dd0, qd_server=0x22e13a0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:399 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 #8 0x00007fdf6fdeee25 in start_thread (arg=0x7fdf58ff9700) at pthread_create.c:308 #9 0x00007fdf6f34434d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (technicaly the message accept doesnt work, despite I call the method, _no_ message is really discarded from the broker - I guess this causes the difference between mine and customer's BT) qdrouterd 0.4-31 fixes the above reproducer from #c4 . ** Satellite 6.2 Hotfix Instructions ** 1) Download qpid-dispatch-0.4-31.tar attachment from this bugzilla to your Satellite 6.2 server 2) Untar archive 3) rpm -Uvh *.rpm 4) katello-service restart 5) Resume operations Created attachment 1470793 [details]
qpid-dispatch-0.4-31.tar
Mike, thank you very much for providing this! Best regards, Michael Removing 6.2.z flag since 6.2.z is now in maintenance phase 2 [1]. The bug will remain flagged for 6.3.z. [1] https://access.redhat.com/support/policy/updates/satellite This was addressed in 6.6 by https://bugzilla.redhat.com/show_bug.cgi?id=1561876. Since 6.3 is EOL this will not be backported. |