Bug 1366232

Summary: qdrouterd segfault with "double free or corruption" in pn_class_decref
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: katello-agentAssignee: Ted Ross <tross>
Status: CLOSED ERRATA QA Contact: Perry Gagne <pgagne>
Severity: high Docs Contact:
Priority: medium    
Version: 6.2.0CC: adam, bkearney, cduryee, chartwel, chrobert, egolov, erinn.looneytriggs, jberry86, jbubeck, jhutar, ktordeur, mcressma, mmccune, nmiao, omaciel, oshtaier, pgagne, pmoravec, prsharma, psuriset, sramacha, tross
Target Milestone: Unspecified   
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-dispatch-0.4-17 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-10 08:13:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
(gdb) thread apply all bt none

Description Pavel Moravec 2016-08-11 10:38:48 UTC
Description of problem:
Under unknown circumstances (some events pointed below), qdrouterd segfaulted when connecting many clients to it.



Version-Release number of selected component (if applicable):
libqpid-dispatch-0.4-13.el7sat.x86_64
qpid-dispatch-router-0.4-13.el7sat.x86_64
qpid-proton-c-0.9-16.el7.x86_64


How reproducible:
???


Steps to Reproduce:
???


Actual results:
segfault with backtrace:

(gdb) bt
#0  0x00007f9a412f95f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f9a412face8 in __GI_abort () at abort.c:90
#2  0x00007f9a41339327 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f9a41443488 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007f9a41341053 in malloc_printerr (ar_ptr=0x7f99b4000020, ptr=<optimized out>, 
    str=0x7f9a41443588 "double free or corruption (!prev)", action=3) at malloc.c:5022
#4  _int_free (av=0x7f99b4000020, p=<optimized out>, have_lock=0) at malloc.c:3842
#5  0x00007f9a4208d806 in pn_class_decref (clazz=0x7f9a422c12e0 <clazz.4933>, object=0x7f99b402af60)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:103
#6  0x00007f9a4209b580 in pn_event_finalize (event=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:190
#7  pn_event_finalize_cast (object=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:235
#8  0x00007f9a4208d7e8 in pn_class_decref (clazz=0x7f9a422c1460 <clazz.2272>, object=0x7f99d40847f0)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:97
#9  0x00007f9a4208da12 in pn_decref (object=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:252
#10 0x00007f9a4209b722 in pn_collector_pop (collector=collector@entry=0x20dad80)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:167
#11 0x00007f9a422daf00 in process_handler (unused=<optimized out>, qd_conn=0x7f9a2800cb30, container=0x1fd3e20)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:422
#12 handler (handler_context=0x1fd3e20, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f9a2800cb30)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:486
#13 0x00007f9a422edb9c in process_connector (cxtr=0x7f9a28010270, qd_server=0x1fe37d0)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#14 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#15 0x00007f9a41e5fdc5 in start_thread (arg=0x7f9a227f4700) at pthread_create.c:308
#16 0x00007f9a413baced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Expected results:
no segfault


Additional info:
/var/log/messages from relevant time:

Aug 10 08:29:10 ip-10-1-1-2 qdrouterd: Wed Aug 10 08:29:10 2016 ROUTER_LS (info) Router Link Lost - link_id=3
Aug 10 08:29:10 ip-10-1-1-2 qpidd: 2016-08-10 08:29:10 [Protocol] error Error on attach: Node not found: pulp.agent.bb79963e-92e2-4020-a0db-d34d082b0eb7

(the error on attach repeated multiple times, until..)
Aug 10 08:29:11 ip-10-1-1-2 qdrouterd: *** Error in `/usr/sbin/qdrouterd': double free or corruption (!prev): 0x00007f99b402af50 ***

Comment 3 Pavel Moravec 2016-08-11 11:00:30 UTC
Standalone reproducer:

1) Link routing to qpidd to route pulp.*
2) Run below script 10 times in parallel - it tries to create a receiver to qdrouterd/qpidd but the broker does not have such a queue (i.e. "Node not found" error printed by qpidd):

#!/usr/bin/python

from time import sleep
from uuid import uuid4
from proton.utils import BlockingConnection, LinkDetached

routerURL = "proton+amqp://0.0.0.0:5648"

conn = BlockingConnection(routerURL, ssl_domain=None, heartbeat=2)

while True:
  sleep(0.05)
  try:
    rcv = conn.create_receiver("pulp."+str(uuid4()), name=str(uuid4()))
    rcv.close()
  except LinkDetached, e:
    print e
    if conn:
      conn.close()
      conn = BlockingConnection(routerURL, ssl_domain=None, heartbeat=2)

<end-of-the-script>


This segfault is usually not expected to happen in Sat6 environment. Since it relies on _missing_ pulp.agent.* queue that goferd tries to subscribe to. Usually, goferd should create its queue during startup..

Comment 4 Pavel Moravec 2016-08-11 11:18:33 UTC
*** Bug 1366231 has been marked as a duplicate of this bug. ***

Comment 8 Jeff Ortel 2016-08-16 15:09:25 UTC
May need to keep this assigned to tross.  The mitigation possible by goferd is to re-create the queue when getting LinkDetached with condition = amqp:not-found.  This means goferd could still try to create a receiver (Link) when the queue does not exist and crash the router.

Note: This can only happen in cases where the queue existed (or was created by goferd on startup) and then disappeared.

Comment 9 Pavel Moravec 2016-08-16 17:54:30 UTC
(In reply to Jeff Ortel from comment #8)
> May need to keep this assigned to tross.  The mitigation possible by goferd
> is to re-create the queue when getting LinkDetached with condition =
> amqp:not-found.  This means goferd could still try to create a receiver
> (Link) when the queue does not exist and crash the router.
> 
> Note: This can only happen in cases where the queue existed (or was created
> by goferd on startup) and then disappeared.

+1.

The primary problem is qdrouterd segfaulting in some scenario. goferd can be improved like Jeff suggests since the repeated link failures from the same agent increased probability of the failure/segfault.

Comment 14 Jan Hutaƙ 2016-08-25 07:54:16 UTC
Created attachment 1193891 [details]
(gdb) thread apply all bt

Comment 27 errata-xmlrpc 2016-11-10 08:13:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2699

Comment 28 Andrew Kofink 2017-01-05 14:26:34 UTC
*** Bug 1385890 has been marked as a duplicate of this bug. ***