Bug 1519140
| Summary: | Qdrouterd continues to fail with segfault | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | sbadhwar | ||||
| Component: | Qpid | Assignee: | Mike Cressman <mcressma> | ||||
| Status: | CLOSED ERRATA | QA Contact: | jcallaha | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.3.0 | CC: | bbuckingham, bkearney, cduryee, cwelton, ehelms, gmurthy, jhutar, mmccune, pmoravec, psuriset, sbadhwar, smane, tony | ||||
| Target Milestone: | Unspecified | Keywords: | Triaged | ||||
| Target Release: | Unused | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | scale_lab | ||||||
| Fixed In Version: | qpid-dispatch-0.8.0-16.el7sat | Doc Type: | Bug Fix | ||||
| Doc Text: |
During scale testing, qdrouterd experienced segmentation faults in `libqpid.so`. This is now fixed.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-02-21 12:57:54 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
sbadhwar
2017-11-30 09:20:36 UTC
Can you attach the output of "rpm -qa | sort" from one of the capsules? Also, is this occurring on all capsules? Created attachment 1360938 [details]
from one capsule (it fails for 9 of 10, but they do not have equal load of connected clients)
Attached It would be helpful if you could provide some core files so we can investigate why it's crashing. Also, any more info on when this is happening. (In reply to Mike Cressman from comment #4) > It would be helpful if you could provide some core files so we can > investigate why it's crashing. Also, any more info on when this is > happening. Hello Mike, I am attaching the coredump collected using abrt in the bug report. The Segfault is happening randomly, we haven't made any changes yet to the capsule servers which can trigger the error. Once this segfault starts happening, it repeats at random times. For example, we will restart the qdrouterd service and it will work fine for a few minutes and will then crash. So, not sure what might be triggering this error. Regards, Saurabh Badhwar Is it possible to get an actual core file? I would like the see the route the code is taking before it crashes. Thanks. Hi Pavel, Thanks for the backtrace. The crash seems to originate from https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098 and it actually happens on https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine.c#L727 (pn_connection_t *connection = delivery->link->session->connection;) I see the following possibilities 1. The dispatch code freed the pn_delivery_t object and is trying to access it again. 2. link/session that the pn_delivery_t belongs to has been freed because the connection between the router and the client went down and hence automatically freeing all underlying links/sessions belonging to that connection. I am leaning towards #2 but would not rule out #1 Is is possible for you to see, using wireshark, if the connection between router and client abruptly went down? Or maybe the client intentionally closes the connection before receiving a disposition update. Also, if it is #1, can we start the router in debug mode and get the core dump to see if the pn_delivery_t has been freed or is null? (In reply to Ganesh Murthy from comment #10) > Hi Pavel, > Thanks for the backtrace. The crash seems to originate from > > https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098 > > and it actually happens on > > https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine. > c#L727 > > (pn_connection_t *connection = delivery->link->session->connection;) > > I see the following possibilities > > 1. The dispatch code freed the pn_delivery_t object and is trying to access > it again. > 2. link/session that the pn_delivery_t belongs to has been freed because the > connection between the router and the client went down and hence > automatically freeing all underlying links/sessions belonging to that > connection. > > I am leaning towards #2 but would not rule out #1 > > Is is possible for you to see, using wireshark, if the connection between > router and client abruptly went down? Or maybe the client intentionally > closes the connection before receiving a disposition update. > > Also, if it is #1, can we start the router in debug mode and get the core > dump to see if the pn_delivery_t has been freed or is null? Hey, So, I was trying to get some logs after enabling debug and trace logging in qdrouterd. But seems like, as soon as I enabled the debug and trace logging, qdrouterd started working perfectly. So, it becomes a bit hard to gather the trace in that case. Although, I am trying to figure out some way through which I can reproduce the crash and provide a log dump. Also, I was just trying to do a bit of analysis on one other capsule, which also had its qdrouterd failing and seems like, the delivery->link has been freed. (gdb) print delivery $6 = (pn_delivery_t *) 0x7f0b881447c0 (gdb) ptype pn_delivery_t type = struct pn_delivery_t { pn_disposition_t local; pn_disposition_t remote; pn_link_t *link; pn_buffer_t *tag; pn_delivery_t *unsettled_next; pn_delivery_t *unsettled_prev; pn_delivery_t *work_next; pn_delivery_t *work_prev; pn_delivery_t *tpwork_next; pn_delivery_t *tpwork_prev; pn_delivery_state_t state; pn_buffer_t *bytes; pn_record_t *context; _Bool updated; _Bool settled; _Bool work; _Bool tpwork; _Bool done; _Bool referenced; } (gdb) print *delivery $7 = { local = { condition = { name = 0x7f0b881ba660, description = 0x7f0b88213ef0, info = 0x7f0b8820dc50 }, type = 39, data = 0x7f0b881446f0, annotations = 0x7f0b88145480, section_offset = 0, section_number = 0, failed = true, undeliverable = false, settled = true }, remote = { condition = { name = 0x7f0b8820da60, description = 0x7f0b8820dab0, info = 0x7f0b8820db00 }, type = 0, data = 0x7f0b88144600, annotations = 0x7f0b88214010, section_offset = 0, section_number = 0, failed = false, undeliverable = false, settled = false }, link = 0x0, tag = 0x7f0b881ba920, unsettled_next = 0x0, unsettled_prev = 0x0, work_next = 0x7f0b88048cf0, work_prev = 0x0, tpwork_next = 0x7f0b8802c230, tpwork_prev = 0x0, state = { id = 0, sent = false, init = false }, bytes = 0x7f0b881ba770, context = 0x7f0b881badd0, updated = false, settled = true, work = false, tpwork = false, done = true, referenced = true } I tried doing some more analysis from the thread file, here is the output from per-thread basis using gdb: [root@c10-h19-r730xd-vm3 ~]# gdb qdrouterd /var/spool/abrt/ccpp-2017-12-05-09\:16\:21-14072/coredump -ex 'info thread' -ex 'thread apply all bt' -batch [New LWP 14075] [New LWP 14072] [New LWP 14073] [New LWP 14076] [New LWP 14074] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf'. Program terminated with signal 11, Segmentation fault. #0 pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727 727 pn_connection_t *connection = delivery->link->session->connection; Id Target Id Frame 5 Thread 0x7f0b9ed34700 (LWP 14074) 0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0 4 Thread 0x7f0b9dd32700 (LWP 14076) 0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0 3 Thread 0x7f0b9f748700 (LWP 14073) 0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 0x7f0bacb6b1c0 (LWP 14072) 0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10 * 1 Thread 0x7f0b9e533700 (LWP 14075) pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727 Thread 5 (Thread 0x7f0b9ed34700 (LWP 14074)): #0 0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0 #1 0x00007f0bac72a5e7 in connector_process (c=0x7f0b94098150) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/driver.c:776 #2 0x00007f0bac742d09 in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:734 #3 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012 #4 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f0b9dd32700 (LWP 14076)): #0 0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f0bac2a2c6c in _L_cond_lock_789 () from /lib64/libpthread.so.0 #2 0x00007f0bac2a2b01 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0 #3 0x00007f0bac29d9d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #4 0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91 #5 0x00007f0bac74292c in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:894 #6 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f0b9f748700 (LWP 14073)): #0 0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91 #2 0x00007f0bac7392d5 in router_core_thread (arg=0x1448e20) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/router_core_thread.c:66 #3 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f0bacb6b1c0 (LWP 14072)): #0 0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10 #1 0x00007f0baaed92c4 in bn_power5 () from /lib64/libcrypto.so.10 #2 0x000000000175b600 in ?? () #3 0x00007ffd21c408f0 in ?? () #4 0x000000000175b380 in ?? () #5 0x0000000000000000 in ?? () Thread 1 (Thread 0x7f0b9e533700 (LWP 14075)): #0 pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727 #1 0x00007f0bac4cdcae in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:2026 #2 0x00007f0bac73dcc8 in CORE_delivery_update (context=0x144b970, dlv=0x7f0b90074f10, disp=39, settled=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1068 #3 0x00007f0bac732c5d in qdr_connection_process (conn=0x7f0b88030bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:306 #4 0x00007f0bac71e1d8 in writable_handler (container=0xe658c0, container=0xe658c0, conn=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307 #5 handler (handler_context=0xe658c0, conn_context=0x1294f40, event=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614 #6 0x00007f0bac742ccb in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012 #8 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Moving to MODIFIED based on the existence of the build. Verified in Satellite 6.3 Snap 35. Based on discussion with scale team, we are no longer seeing this issue with the current version of qdrouterd shipped in 6.3. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0338 |