Bug 1366232 - qdrouterd segfault with "double free or corruption" in pn_class_decref
Summary: qdrouterd segfault with "double free or corruption" in pn_class_decref
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: katello-agent
Version: 6.2.0
Hardware: x86_64
OS: Linux
medium
high with 1 vote
Target Milestone: Unspecified
Assignee: Ted Ross
QA Contact: Perry Gagne
URL:
Whiteboard:
: 1366231 1385890 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-11 10:38 UTC by Pavel Moravec
Modified: 2020-04-15 14:36 UTC (History)
22 users (show)

Fixed In Version: qpid-dispatch-0.4-17
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-10 08:13:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
(gdb) thread apply all bt (59.42 KB, text/plain)
2016-08-25 07:54 UTC, Jan Hutař
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2699 0 normal SHIPPED_LIVE Satellite 6.2.4 Async Bug Release 2016-11-10 13:12:22 UTC

Description Pavel Moravec 2016-08-11 10:38:48 UTC
Description of problem:
Under unknown circumstances (some events pointed below), qdrouterd segfaulted when connecting many clients to it.



Version-Release number of selected component (if applicable):
libqpid-dispatch-0.4-13.el7sat.x86_64
qpid-dispatch-router-0.4-13.el7sat.x86_64
qpid-proton-c-0.9-16.el7.x86_64


How reproducible:
???


Steps to Reproduce:
???


Actual results:
segfault with backtrace:

(gdb) bt
#0  0x00007f9a412f95f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f9a412face8 in __GI_abort () at abort.c:90
#2  0x00007f9a41339327 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f9a41443488 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007f9a41341053 in malloc_printerr (ar_ptr=0x7f99b4000020, ptr=<optimized out>, 
    str=0x7f9a41443588 "double free or corruption (!prev)", action=3) at malloc.c:5022
#4  _int_free (av=0x7f99b4000020, p=<optimized out>, have_lock=0) at malloc.c:3842
#5  0x00007f9a4208d806 in pn_class_decref (clazz=0x7f9a422c12e0 <clazz.4933>, object=0x7f99b402af60)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:103
#6  0x00007f9a4209b580 in pn_event_finalize (event=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:190
#7  pn_event_finalize_cast (object=0x7f99d40847f0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:235
#8  0x00007f9a4208d7e8 in pn_class_decref (clazz=0x7f9a422c1460 <clazz.2272>, object=0x7f99d40847f0)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:97
#9  0x00007f9a4208da12 in pn_decref (object=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/object.c:252
#10 0x00007f9a4209b722 in pn_collector_pop (collector=collector@entry=0x20dad80)
    at /usr/src/debug/qpid-proton-0.9/proton-c/src/events/event.c:167
#11 0x00007f9a422daf00 in process_handler (unused=<optimized out>, qd_conn=0x7f9a2800cb30, container=0x1fd3e20)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:422
#12 handler (handler_context=0x1fd3e20, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f9a2800cb30)
    at /usr/src/debug/qpid-dispatch-0.4/src/container.c:486
#13 0x00007f9a422edb9c in process_connector (cxtr=0x7f9a28010270, qd_server=0x1fe37d0)
    at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398
#14 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626
#15 0x00007f9a41e5fdc5 in start_thread (arg=0x7f9a227f4700) at pthread_create.c:308
#16 0x00007f9a413baced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Expected results:
no segfault


Additional info:
/var/log/messages from relevant time:

Aug 10 08:29:10 ip-10-1-1-2 qdrouterd: Wed Aug 10 08:29:10 2016 ROUTER_LS (info) Router Link Lost - link_id=3
Aug 10 08:29:10 ip-10-1-1-2 qpidd: 2016-08-10 08:29:10 [Protocol] error Error on attach: Node not found: pulp.agent.bb79963e-92e2-4020-a0db-d34d082b0eb7

(the error on attach repeated multiple times, until..)
Aug 10 08:29:11 ip-10-1-1-2 qdrouterd: *** Error in `/usr/sbin/qdrouterd': double free or corruption (!prev): 0x00007f99b402af50 ***

Comment 3 Pavel Moravec 2016-08-11 11:00:30 UTC
Standalone reproducer:

1) Link routing to qpidd to route pulp.*
2) Run below script 10 times in parallel - it tries to create a receiver to qdrouterd/qpidd but the broker does not have such a queue (i.e. "Node not found" error printed by qpidd):

#!/usr/bin/python

from time import sleep
from uuid import uuid4
from proton.utils import BlockingConnection, LinkDetached

routerURL = "proton+amqp://0.0.0.0:5648"

conn = BlockingConnection(routerURL, ssl_domain=None, heartbeat=2)

while True:
  sleep(0.05)
  try:
    rcv = conn.create_receiver("pulp."+str(uuid4()), name=str(uuid4()))
    rcv.close()
  except LinkDetached, e:
    print e
    if conn:
      conn.close()
      conn = BlockingConnection(routerURL, ssl_domain=None, heartbeat=2)

<end-of-the-script>


This segfault is usually not expected to happen in Sat6 environment. Since it relies on _missing_ pulp.agent.* queue that goferd tries to subscribe to. Usually, goferd should create its queue during startup..

Comment 4 Pavel Moravec 2016-08-11 11:18:33 UTC
*** Bug 1366231 has been marked as a duplicate of this bug. ***

Comment 8 Jeff Ortel 2016-08-16 15:09:25 UTC
May need to keep this assigned to tross.  The mitigation possible by goferd is to re-create the queue when getting LinkDetached with condition = amqp:not-found.  This means goferd could still try to create a receiver (Link) when the queue does not exist and crash the router.

Note: This can only happen in cases where the queue existed (or was created by goferd on startup) and then disappeared.

Comment 9 Pavel Moravec 2016-08-16 17:54:30 UTC
(In reply to Jeff Ortel from comment #8)
> May need to keep this assigned to tross.  The mitigation possible by goferd
> is to re-create the queue when getting LinkDetached with condition =
> amqp:not-found.  This means goferd could still try to create a receiver
> (Link) when the queue does not exist and crash the router.
> 
> Note: This can only happen in cases where the queue existed (or was created
> by goferd on startup) and then disappeared.

+1.

The primary problem is qdrouterd segfaulting in some scenario. goferd can be improved like Jeff suggests since the repeated link failures from the same agent increased probability of the failure/segfault.

Comment 14 Jan Hutař 2016-08-25 07:54:16 UTC
Created attachment 1193891 [details]
(gdb) thread apply all bt

Comment 27 errata-xmlrpc 2016-11-10 08:13:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:2699

Comment 28 Andrew Kofink 2017-01-05 14:26:34 UTC
*** Bug 1385890 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.