Red Hat Bugzilla – Bug 1519140
Qdrouterd continues to fail with segfault
Last modified: 2018-08-13 12:42:18 EDT
Description of problem: So, we are running scale experiments in Scale lab right now, and all of a sudden we have started experiencing that qdrouterd is failing on the capsules. On looking through the capsules, seems like, qdrouterd is experiencing a segfault in libqpid-proton.so A trace from the /var/log/messages [root@c10-h19-r730xd-vm1 qdrouterd]# cat /var/log/messages | grep qdrouterd Nov 28 00:59:50 c10-h19-r730xd-vm1 kernel: qdrouterd[28352]: segfault at 180 ip 00007fdd86d37291 sp 00007fdd7951c2b8 error 4 in libqpid-proton.so.10.0.0[7fdd86d19000+4b000] Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state. Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed. Nov 30 03:44:04 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd *** Nov 30 03:45:58 c10-h19-r730xd-vm1 kernel: qdrouterd[26607]: segfault at 180 ip 00007f9f4d230291 sp 00007f9f3fa152b8 error 4 in libqpid-proton.so.10.0.0[7f9f4d212000+4b000] Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state. Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed. Nov 30 03:50:12 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd *** Nov 30 04:06:07 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d owner=root follow=False group=root unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=0755 attributes=None backup=None Nov 30 04:06:10 c10-h19-r730xd-vm1 ansible-stat: Invoked with checksum_algorithm=sha1 get_checksum=True path=/etc/systemd/system/qdrouterd.service.d/limits.conf checksum_algo=sha1 follow=False get_md5=False get_mime=True get_attributes=True Nov 30 04:06:13 c10-h19-r730xd-vm1 ansible-copy: Invoked with directory_mode=None force=True remote_src=None owner=root follow=False local_follow=None group=root unsafe_writes=None setype=None content=NOT_LOGGING_PARAMETER serole=None dest=/etc/systemd/system/qdrouterd.service.d/limits.conf selevel=None original_basename=limits-qdrouterd.conf regexp=None validate=None src=/root/.ansible/tmp/ansible-tmp-1512032767.73-18163995376563/source seuser=None delimiter=None mode=0644 attributes=None backup=False Nov 30 04:06:15 c10-h19-r730xd-vm1 ansible-command: Invoked with warn=True executable=None _uses_shell=False _raw_params=restorecon -vR /etc/systemd/system/qdrouterd.service.d removes=None creates=None chdir=None stdin=None Nov 30 04:06:42 c10-h19-r730xd-vm1 kernel: qdrouterd[29818]: segfault at 180 ip 00007f0863215291 sp 00007ffd3666b678 error 4 in libqpid-proton.so.10.0.0[7f08631f7000+4b000] Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state. Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed. Nov 30 04:06:58 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd *** Nov 30 04:08:02 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d owner=root follow=False group=root unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=0755 attributes=None backup=None Nov 30 04:08:05 c10-h19-r730xd-vm1 ansible-stat: Invoked with checksum_algorithm=sha1 get_checksum=True path=/etc/systemd/system/qdrouterd.service.d/limits.conf checksum_algo=sha1 follow=False get_md5=False get_mime=True get_attributes=True Nov 30 04:08:07 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d/limits.conf owner=root follow=False group=root unsafe_writes=None state=file content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None dest=/etc/systemd/system/qdrouterd.service.d/limits.conf selevel=None original_basename=limits-qdrouterd.conf regexp=None validate=None src=limits-qdrouterd.conf seuser=None recurse=False delimiter=None mode=0644 attributes=None backup=None Nov 30 04:08:09 c10-h19-r730xd-vm1 ansible-command: Invoked with warn=True executable=None _uses_shell=False _raw_params=restorecon -vR /etc/systemd/system/qdrouterd.service.d removes=None creates=None chdir=None stdin=None Nov 30 04:08:20 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd *** Nov 30 04:08:47 c10-h19-r730xd-vm1 kernel: qdrouterd[31775]: segfault at 180 ip 00007f56880e4291 sp 00007f56799852b8 error 4 in libqpid-proton.so.10.0.0[7f56880c6000+4b000] Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state. Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed. Nov 30 04:09:06 c10-h19-r730xd-vm1 kernel: qdrouterd[32150]: segfault at 180 ip 00007fb455eea291 sp 00007fb446f8a2b8 error 4 in libqpid-proton.so.10.0.0[7fb455ecc000+4b000] Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state. Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed. Nov 30 04:09:21 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd *** Nov 30 04:11:20 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Can you attach the output of "rpm -qa | sort" from one of the capsules? Also, is this occurring on all capsules?
Created attachment 1360938 [details] from one capsule (it fails for 9 of 10, but they do not have equal load of connected clients)
Attached
It would be helpful if you could provide some core files so we can investigate why it's crashing. Also, any more info on when this is happening.
(In reply to Mike Cressman from comment #4) > It would be helpful if you could provide some core files so we can > investigate why it's crashing. Also, any more info on when this is > happening. Hello Mike, I am attaching the coredump collected using abrt in the bug report. The Segfault is happening randomly, we haven't made any changes yet to the capsule servers which can trigger the error. Once this segfault starts happening, it repeats at random times. For example, we will restart the qdrouterd service and it will work fine for a few minutes and will then crash. So, not sure what might be triggering this error. Regards, Saurabh Badhwar
Is it possible to get an actual core file? I would like the see the route the code is taking before it crashes. Thanks.
Hi Pavel, Thanks for the backtrace. The crash seems to originate from https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098 and it actually happens on https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine.c#L727 (pn_connection_t *connection = delivery->link->session->connection;) I see the following possibilities 1. The dispatch code freed the pn_delivery_t object and is trying to access it again. 2. link/session that the pn_delivery_t belongs to has been freed because the connection between the router and the client went down and hence automatically freeing all underlying links/sessions belonging to that connection. I am leaning towards #2 but would not rule out #1 Is is possible for you to see, using wireshark, if the connection between router and client abruptly went down? Or maybe the client intentionally closes the connection before receiving a disposition update. Also, if it is #1, can we start the router in debug mode and get the core dump to see if the pn_delivery_t has been freed or is null?
(In reply to Ganesh Murthy from comment #10) > Hi Pavel, > Thanks for the backtrace. The crash seems to originate from > > https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098 > > and it actually happens on > > https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine. > c#L727 > > (pn_connection_t *connection = delivery->link->session->connection;) > > I see the following possibilities > > 1. The dispatch code freed the pn_delivery_t object and is trying to access > it again. > 2. link/session that the pn_delivery_t belongs to has been freed because the > connection between the router and the client went down and hence > automatically freeing all underlying links/sessions belonging to that > connection. > > I am leaning towards #2 but would not rule out #1 > > Is is possible for you to see, using wireshark, if the connection between > router and client abruptly went down? Or maybe the client intentionally > closes the connection before receiving a disposition update. > > Also, if it is #1, can we start the router in debug mode and get the core > dump to see if the pn_delivery_t has been freed or is null? Hey, So, I was trying to get some logs after enabling debug and trace logging in qdrouterd. But seems like, as soon as I enabled the debug and trace logging, qdrouterd started working perfectly. So, it becomes a bit hard to gather the trace in that case. Although, I am trying to figure out some way through which I can reproduce the crash and provide a log dump. Also, I was just trying to do a bit of analysis on one other capsule, which also had its qdrouterd failing and seems like, the delivery->link has been freed. (gdb) print delivery $6 = (pn_delivery_t *) 0x7f0b881447c0 (gdb) ptype pn_delivery_t type = struct pn_delivery_t { pn_disposition_t local; pn_disposition_t remote; pn_link_t *link; pn_buffer_t *tag; pn_delivery_t *unsettled_next; pn_delivery_t *unsettled_prev; pn_delivery_t *work_next; pn_delivery_t *work_prev; pn_delivery_t *tpwork_next; pn_delivery_t *tpwork_prev; pn_delivery_state_t state; pn_buffer_t *bytes; pn_record_t *context; _Bool updated; _Bool settled; _Bool work; _Bool tpwork; _Bool done; _Bool referenced; } (gdb) print *delivery $7 = { local = { condition = { name = 0x7f0b881ba660, description = 0x7f0b88213ef0, info = 0x7f0b8820dc50 }, type = 39, data = 0x7f0b881446f0, annotations = 0x7f0b88145480, section_offset = 0, section_number = 0, failed = true, undeliverable = false, settled = true }, remote = { condition = { name = 0x7f0b8820da60, description = 0x7f0b8820dab0, info = 0x7f0b8820db00 }, type = 0, data = 0x7f0b88144600, annotations = 0x7f0b88214010, section_offset = 0, section_number = 0, failed = false, undeliverable = false, settled = false }, link = 0x0, tag = 0x7f0b881ba920, unsettled_next = 0x0, unsettled_prev = 0x0, work_next = 0x7f0b88048cf0, work_prev = 0x0, tpwork_next = 0x7f0b8802c230, tpwork_prev = 0x0, state = { id = 0, sent = false, init = false }, bytes = 0x7f0b881ba770, context = 0x7f0b881badd0, updated = false, settled = true, work = false, tpwork = false, done = true, referenced = true }
I tried doing some more analysis from the thread file, here is the output from per-thread basis using gdb: [root@c10-h19-r730xd-vm3 ~]# gdb qdrouterd /var/spool/abrt/ccpp-2017-12-05-09\:16\:21-14072/coredump -ex 'info thread' -ex 'thread apply all bt' -batch [New LWP 14075] [New LWP 14072] [New LWP 14073] [New LWP 14076] [New LWP 14074] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf'. Program terminated with signal 11, Segmentation fault. #0 pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727 727 pn_connection_t *connection = delivery->link->session->connection; Id Target Id Frame 5 Thread 0x7f0b9ed34700 (LWP 14074) 0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0 4 Thread 0x7f0b9dd32700 (LWP 14076) 0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0 3 Thread 0x7f0b9f748700 (LWP 14073) 0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 0x7f0bacb6b1c0 (LWP 14072) 0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10 * 1 Thread 0x7f0b9e533700 (LWP 14075) pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727 Thread 5 (Thread 0x7f0b9ed34700 (LWP 14074)): #0 0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0 #1 0x00007f0bac72a5e7 in connector_process (c=0x7f0b94098150) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/driver.c:776 #2 0x00007f0bac742d09 in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:734 #3 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012 #4 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f0b9dd32700 (LWP 14076)): #0 0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f0bac2a2c6c in _L_cond_lock_789 () from /lib64/libpthread.so.0 #2 0x00007f0bac2a2b01 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0 #3 0x00007f0bac29d9d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #4 0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91 #5 0x00007f0bac74292c in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:894 #6 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f0b9f748700 (LWP 14073)): #0 0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91 #2 0x00007f0bac7392d5 in router_core_thread (arg=0x1448e20) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/router_core_thread.c:66 #3 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f0bacb6b1c0 (LWP 14072)): #0 0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10 #1 0x00007f0baaed92c4 in bn_power5 () from /lib64/libcrypto.so.10 #2 0x000000000175b600 in ?? () #3 0x00007ffd21c408f0 in ?? () #4 0x000000000175b380 in ?? () #5 0x0000000000000000 in ?? () Thread 1 (Thread 0x7f0b9e533700 (LWP 14075)): #0 pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727 #1 0x00007f0bac4cdcae in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:2026 #2 0x00007f0bac73dcc8 in CORE_delivery_update (context=0x144b970, dlv=0x7f0b90074f10, disp=39, settled=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1068 #3 0x00007f0bac732c5d in qdr_connection_process (conn=0x7f0b88030bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:306 #4 0x00007f0bac71e1d8 in writable_handler (container=0xe658c0, container=0xe658c0, conn=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307 #5 handler (handler_context=0xe658c0, conn_context=0x1294f40, event=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614 #6 0x00007f0bac742ccb in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012 #8 0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f0bab5cd34d in clone () from /lib64/libc.so.6
Moving to MODIFIED based on the existence of the build.
Verified in Satellite 6.3 Snap 35. Based on discussion with scale team, we are no longer seeing this issue with the current version of qdrouterd shipped in 6.3.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0338