Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1519140 - Qdrouterd continues to fail with segfault
Qdrouterd continues to fail with segfault
Status: CLOSED ERRATA
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Qpid (Show other bugs)
6.3.0
Unspecified Unspecified
high Severity high (vote)
: GA
: Unused
Assigned To: Mike Cressman
jcallaha
scale_lab
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-11-30 04:20 EST by sbadhwar
Modified: 2018-08-13 12:42 EDT (History)
13 users (show)

See Also:
Fixed In Version: qpid-dispatch-0.8.0-16.el7sat
Doc Type: Bug Fix
Doc Text:
During scale testing, qdrouterd experienced segmentation faults in `libqpid.so`. This is now fixed.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-02-21 07:57:54 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
from one capsule (it fails for 9 of 10, but they do not have equal load of connected clients) (24.17 KB, text/plain)
2017-11-30 08:42 EST, Jan Hutař
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0338 normal SHIPPED_LIVE Red Hat Satellite 6.3 Tools 2018-02-21 14:30:02 EST

  None (edit)
Description sbadhwar 2017-11-30 04:20:36 EST
Description of problem:
So, we are running scale experiments in Scale lab right now, and all of a sudden we have started experiencing that qdrouterd is failing on the capsules.
On looking through the capsules, seems like, qdrouterd is experiencing a segfault in libqpid-proton.so

A trace from the /var/log/messages

[root@c10-h19-r730xd-vm1 qdrouterd]# cat /var/log/messages | grep qdrouterd
Nov 28 00:59:50 c10-h19-r730xd-vm1 kernel: qdrouterd[28352]: segfault at 180 ip 00007fdd86d37291 sp 00007fdd7951c2b8 error 4 in libqpid-proton.so.10.0.0[7fdd86d19000+4b000]
Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 03:44:04 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 03:45:58 c10-h19-r730xd-vm1 kernel: qdrouterd[26607]: segfault at 180 ip 00007f9f4d230291 sp 00007f9f3fa152b8 error 4 in libqpid-proton.so.10.0.0[7f9f4d212000+4b000]
Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 03:50:12 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:06:07 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d owner=root follow=False group=root unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=0755 attributes=None backup=None
Nov 30 04:06:10 c10-h19-r730xd-vm1 ansible-stat: Invoked with checksum_algorithm=sha1 get_checksum=True path=/etc/systemd/system/qdrouterd.service.d/limits.conf checksum_algo=sha1 follow=False get_md5=False get_mime=True get_attributes=True
Nov 30 04:06:13 c10-h19-r730xd-vm1 ansible-copy: Invoked with directory_mode=None force=True remote_src=None owner=root follow=False local_follow=None group=root unsafe_writes=None setype=None content=NOT_LOGGING_PARAMETER serole=None dest=/etc/systemd/system/qdrouterd.service.d/limits.conf selevel=None original_basename=limits-qdrouterd.conf regexp=None validate=None src=/root/.ansible/tmp/ansible-tmp-1512032767.73-18163995376563/source seuser=None delimiter=None mode=0644 attributes=None backup=False
Nov 30 04:06:15 c10-h19-r730xd-vm1 ansible-command: Invoked with warn=True executable=None _uses_shell=False _raw_params=restorecon -vR /etc/systemd/system/qdrouterd.service.d removes=None creates=None chdir=None stdin=None
Nov 30 04:06:42 c10-h19-r730xd-vm1 kernel: qdrouterd[29818]: segfault at 180 ip 00007f0863215291 sp 00007ffd3666b678 error 4 in libqpid-proton.so.10.0.0[7f08631f7000+4b000]
Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 04:06:58 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:08:02 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d owner=root follow=False group=root unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=0755 attributes=None backup=None
Nov 30 04:08:05 c10-h19-r730xd-vm1 ansible-stat: Invoked with checksum_algorithm=sha1 get_checksum=True path=/etc/systemd/system/qdrouterd.service.d/limits.conf checksum_algo=sha1 follow=False get_md5=False get_mime=True get_attributes=True
Nov 30 04:08:07 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d/limits.conf owner=root follow=False group=root unsafe_writes=None state=file content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None dest=/etc/systemd/system/qdrouterd.service.d/limits.conf selevel=None original_basename=limits-qdrouterd.conf regexp=None validate=None src=limits-qdrouterd.conf seuser=None recurse=False delimiter=None mode=0644 attributes=None backup=None
Nov 30 04:08:09 c10-h19-r730xd-vm1 ansible-command: Invoked with warn=True executable=None _uses_shell=False _raw_params=restorecon -vR /etc/systemd/system/qdrouterd.service.d removes=None creates=None chdir=None stdin=None
Nov 30 04:08:20 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:08:47 c10-h19-r730xd-vm1 kernel: qdrouterd[31775]: segfault at 180 ip 00007f56880e4291 sp 00007f56799852b8 error 4 in libqpid-proton.so.10.0.0[7f56880c6000+4b000]
Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 04:09:06 c10-h19-r730xd-vm1 kernel: qdrouterd[32150]: segfault at 180 ip 00007fb455eea291 sp 00007fb446f8a2b8 error 4 in libqpid-proton.so.10.0.0[7fb455ecc000+4b000]
Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 04:09:21 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:11:20 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Comment 1 Chris Duryee 2017-11-30 08:27:23 EST
Can you attach the output of "rpm -qa | sort" from one of the capsules?

Also, is this occurring on all capsules?
Comment 2 Jan Hutař 2017-11-30 08:42 EST
Created attachment 1360938 [details]
from one capsule (it fails for 9 of 10, but they do not have equal load of connected clients)
Comment 3 Jan Hutař 2017-11-30 08:42:28 EST
Attached
Comment 4 Mike Cressman 2017-11-30 16:47:06 EST
It would be helpful if you could provide some core files so we can investigate why it's crashing.  Also, any more info on when this is happening.
Comment 5 sbadhwar 2017-12-01 02:49:02 EST
(In reply to Mike Cressman from comment #4)
> It would be helpful if you could provide some core files so we can
> investigate why it's crashing.  Also, any more info on when this is
> happening.

Hello Mike,

I am attaching the coredump collected using abrt in the bug report.

The Segfault is happening randomly, we haven't made any changes yet to the capsule servers which can trigger the error. Once this segfault starts happening, it repeats at random times. For example, we will restart the qdrouterd service and it will work fine for a few minutes and will then crash. So, not sure what might be triggering this error.

Regards,
Saurabh Badhwar
Comment 8 Ganesh Murthy 2017-12-01 16:03:18 EST
Is it possible to get an actual core file? I would like the see the route the code is taking before it crashes.  Thanks.
Comment 10 Ganesh Murthy 2017-12-04 16:39:18 EST
Hi Pavel,
   Thanks for the backtrace. The crash seems to originate from 

https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098

and it actually happens on 

https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine.c#L727

(pn_connection_t *connection = delivery->link->session->connection;)

I see the following possibilities

1. The dispatch code freed the pn_delivery_t object and is trying to access it again. 
2. link/session that the pn_delivery_t belongs to has been freed because the connection between the router and the client went down and hence automatically freeing all underlying links/sessions belonging to that connection.

I am leaning towards #2 but would not rule out #1

Is is possible for you to see, using wireshark, if the connection between router and client abruptly went down? Or maybe the client intentionally closes the connection before receiving a disposition update.

Also, if it is #1, can we start the router in debug mode and get the core dump to see if the pn_delivery_t has been freed or is null?
Comment 11 sbadhwar 2017-12-05 10:33:10 EST
(In reply to Ganesh Murthy from comment #10)
> Hi Pavel,
>    Thanks for the backtrace. The crash seems to originate from 
> 
> https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098
> 
> and it actually happens on 
> 
> https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine.
> c#L727
> 
> (pn_connection_t *connection = delivery->link->session->connection;)
> 
> I see the following possibilities
> 
> 1. The dispatch code freed the pn_delivery_t object and is trying to access
> it again. 
> 2. link/session that the pn_delivery_t belongs to has been freed because the
> connection between the router and the client went down and hence
> automatically freeing all underlying links/sessions belonging to that
> connection.
> 
> I am leaning towards #2 but would not rule out #1
> 
> Is is possible for you to see, using wireshark, if the connection between
> router and client abruptly went down? Or maybe the client intentionally
> closes the connection before receiving a disposition update.
> 
> Also, if it is #1, can we start the router in debug mode and get the core
> dump to see if the pn_delivery_t has been freed or is null?

Hey,

So, I was trying to get some logs after enabling debug and trace logging in qdrouterd. But seems like, as soon as I enabled the debug and trace logging, qdrouterd started working perfectly. So, it becomes a bit hard to gather the trace in that case. Although, I am trying to figure out some way through which I can reproduce the crash and provide a log dump.

Also, I was just trying to do a bit of analysis on one other capsule, which also had its qdrouterd failing and seems like, the delivery->link has been freed.

(gdb) print delivery
$6 = (pn_delivery_t *) 0x7f0b881447c0
(gdb) ptype pn_delivery_t
type = struct pn_delivery_t {
    pn_disposition_t local;
    pn_disposition_t remote;
    pn_link_t *link;
    pn_buffer_t *tag;
    pn_delivery_t *unsettled_next;
    pn_delivery_t *unsettled_prev;
    pn_delivery_t *work_next;
    pn_delivery_t *work_prev;
    pn_delivery_t *tpwork_next;
    pn_delivery_t *tpwork_prev;
    pn_delivery_state_t state;
    pn_buffer_t *bytes;
    pn_record_t *context;
    _Bool updated;
    _Bool settled;
    _Bool work;
    _Bool tpwork;
    _Bool done;
    _Bool referenced;
}
(gdb) print *delivery
$7 = {
  local = {
    condition = {
      name = 0x7f0b881ba660, 
      description = 0x7f0b88213ef0, 
      info = 0x7f0b8820dc50
    }, 
    type = 39, 
    data = 0x7f0b881446f0, 
    annotations = 0x7f0b88145480, 
    section_offset = 0, 
    section_number = 0, 
    failed = true, 
    undeliverable = false, 
    settled = true
  }, 
  remote = {
    condition = {
      name = 0x7f0b8820da60, 
      description = 0x7f0b8820dab0, 
      info = 0x7f0b8820db00
    }, 
    type = 0, 
    data = 0x7f0b88144600, 
    annotations = 0x7f0b88214010, 
    section_offset = 0, 
    section_number = 0, 
    failed = false, 
    undeliverable = false, 
    settled = false
  }, 
  link = 0x0, 
  tag = 0x7f0b881ba920, 
  unsettled_next = 0x0, 
  unsettled_prev = 0x0, 
  work_next = 0x7f0b88048cf0, 
  work_prev = 0x0, 
  tpwork_next = 0x7f0b8802c230, 
  tpwork_prev = 0x0, 
  state = {
    id = 0, 
    sent = false, 
    init = false
  }, 
  bytes = 0x7f0b881ba770, 
  context = 0x7f0b881badd0, 
  updated = false, 
  settled = true, 
  work = false, 
  tpwork = false, 
  done = true, 
  referenced = true
}
Comment 12 sbadhwar 2017-12-06 22:39:00 EST
I tried doing some more analysis from the thread file, here is the output from per-thread basis using gdb:

[root@c10-h19-r730xd-vm3 ~]# gdb qdrouterd /var/spool/abrt/ccpp-2017-12-05-09\:16\:21-14072/coredump -ex 'info thread' -ex 'thread apply all bt' -batch
[New LWP 14075]
[New LWP 14072]
[New LWP 14073]
[New LWP 14076]
[New LWP 14074]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf'.
Program terminated with signal 11, Segmentation fault.
#0  pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727
727	  pn_connection_t *connection = delivery->link->session->connection;
  Id   Target Id         Frame 
  5    Thread 0x7f0b9ed34700 (LWP 14074) 0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0
  4    Thread 0x7f0b9dd32700 (LWP 14076) 0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0
  3    Thread 0x7f0b9f748700 (LWP 14073) 0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2    Thread 0x7f0bacb6b1c0 (LWP 14072) 0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10
* 1    Thread 0x7f0b9e533700 (LWP 14075) pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727

Thread 5 (Thread 0x7f0b9ed34700 (LWP 14074)):
#0  0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0
#1  0x00007f0bac72a5e7 in connector_process (c=0x7f0b94098150) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/driver.c:776
#2  0x00007f0bac742d09 in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:734
#3  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#4  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0b9dd32700 (LWP 14076)):
#0  0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f0bac2a2c6c in _L_cond_lock_789 () from /lib64/libpthread.so.0
#2  0x00007f0bac2a2b01 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#3  0x00007f0bac29d9d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4  0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91
#5  0x00007f0bac74292c in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:894
#6  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0b9f748700 (LWP 14073)):
#0  0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91
#2  0x00007f0bac7392d5 in router_core_thread (arg=0x1448e20) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/router_core_thread.c:66
#3  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0bacb6b1c0 (LWP 14072)):
#0  0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10
#1  0x00007f0baaed92c4 in bn_power5 () from /lib64/libcrypto.so.10
#2  0x000000000175b600 in ?? ()
#3  0x00007ffd21c408f0 in ?? ()
#4  0x000000000175b380 in ?? ()
#5  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f0b9e533700 (LWP 14075)):
#0  pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727
#1  0x00007f0bac4cdcae in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:2026
#2  0x00007f0bac73dcc8 in CORE_delivery_update (context=0x144b970, dlv=0x7f0b90074f10, disp=39, settled=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1068
#3  0x00007f0bac732c5d in qdr_connection_process (conn=0x7f0b88030bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:306
#4  0x00007f0bac71e1d8 in writable_handler (container=0xe658c0, container=0xe658c0, conn=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#5  handler (handler_context=0xe658c0, conn_context=0x1294f40, event=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614
#6  0x00007f0bac742ccb in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#8  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6
Comment 16 Bryan Kearney 2018-01-03 12:08:43 EST
Moving to MODIFIED based on the existence of the build.
Comment 17 jcallaha 2018-02-07 09:11:38 EST
Verified in Satellite 6.3 Snap 35. 

Based on discussion with scale team, we are no longer seeing this issue with the current version of qdrouterd shipped in 6.3.
Comment 20 errata-xmlrpc 2018-02-21 07:57:54 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0338

Note You need to log in before you can comment on or make changes to this bug.