1519140 – Qdrouterd continues to fail with segfault

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1519140 - Qdrouterd continues to fail with segfault

Summary: Qdrouterd continues to fail with segfault

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Qpid
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Unspecified
Assignee:	Mike Cressman
QA Contact:	jcallaha
Docs Contact:
URL:
Whiteboard:	scale_lab
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-30 09:20 UTC by sbadhwar
Modified:	2021-09-09 12:53 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qpid-dispatch-0.8.0-16.el7sat
Doc Type:	Bug Fix
Doc Text:	During scale testing, qdrouterd experienced segmentation faults in `libqpid.so`. This is now fixed.
Clone Of:
Environment:
Last Closed:	2018-02-21 12:57:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
from one capsule (it fails for 9 of 10, but they do not have equal load of connected clients) (24.17 KB, text/plain) 2017-11-30 13:42 UTC, Jan Hutař	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	DISPATCH-895	0	None	None	None	Never
Red Hat Product Errata	RHBA-2018:0338	0	normal	SHIPPED_LIVE	Red Hat Satellite 6.3 Tools	2018-02-21 19:30:02 UTC

Description sbadhwar 2017-11-30 09:20:36 UTC

Description of problem:
So, we are running scale experiments in Scale lab right now, and all of a sudden we have started experiencing that qdrouterd is failing on the capsules.
On looking through the capsules, seems like, qdrouterd is experiencing a segfault in libqpid-proton.so

A trace from the /var/log/messages

[root@c10-h19-r730xd-vm1 qdrouterd]# cat /var/log/messages | grep qdrouterd
Nov 28 00:59:50 c10-h19-r730xd-vm1 kernel: qdrouterd[28352]: segfault at 180 ip 00007fdd86d37291 sp 00007fdd7951c2b8 error 4 in libqpid-proton.so.10.0.0[7fdd86d19000+4b000]
Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 28 00:59:50 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 03:44:04 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 03:45:58 c10-h19-r730xd-vm1 kernel: qdrouterd[26607]: segfault at 180 ip 00007f9f4d230291 sp 00007f9f3fa152b8 error 4 in libqpid-proton.so.10.0.0[7f9f4d212000+4b000]
Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 03:45:58 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 03:50:12 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:06:07 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d owner=root follow=False group=root unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=0755 attributes=None backup=None
Nov 30 04:06:10 c10-h19-r730xd-vm1 ansible-stat: Invoked with checksum_algorithm=sha1 get_checksum=True path=/etc/systemd/system/qdrouterd.service.d/limits.conf checksum_algo=sha1 follow=False get_md5=False get_mime=True get_attributes=True
Nov 30 04:06:13 c10-h19-r730xd-vm1 ansible-copy: Invoked with directory_mode=None force=True remote_src=None owner=root follow=False local_follow=None group=root unsafe_writes=None setype=None content=NOT_LOGGING_PARAMETER serole=None dest=/etc/systemd/system/qdrouterd.service.d/limits.conf selevel=None original_basename=limits-qdrouterd.conf regexp=None validate=None src=/root/.ansible/tmp/ansible-tmp-1512032767.73-18163995376563/source seuser=None delimiter=None mode=0644 attributes=None backup=False
Nov 30 04:06:15 c10-h19-r730xd-vm1 ansible-command: Invoked with warn=True executable=None _uses_shell=False _raw_params=restorecon -vR /etc/systemd/system/qdrouterd.service.d removes=None creates=None chdir=None stdin=None
Nov 30 04:06:42 c10-h19-r730xd-vm1 kernel: qdrouterd[29818]: segfault at 180 ip 00007f0863215291 sp 00007ffd3666b678 error 4 in libqpid-proton.so.10.0.0[7f08631f7000+4b000]
Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 04:06:42 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 04:06:58 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:08:02 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d owner=root follow=False group=root unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=0755 attributes=None backup=None
Nov 30 04:08:05 c10-h19-r730xd-vm1 ansible-stat: Invoked with checksum_algorithm=sha1 get_checksum=True path=/etc/systemd/system/qdrouterd.service.d/limits.conf checksum_algo=sha1 follow=False get_md5=False get_mime=True get_attributes=True
Nov 30 04:08:07 c10-h19-r730xd-vm1 ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/etc/systemd/system/qdrouterd.service.d/limits.conf owner=root follow=False group=root unsafe_writes=None state=file content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None dest=/etc/systemd/system/qdrouterd.service.d/limits.conf selevel=None original_basename=limits-qdrouterd.conf regexp=None validate=None src=limits-qdrouterd.conf seuser=None recurse=False delimiter=None mode=0644 attributes=None backup=None
Nov 30 04:08:09 c10-h19-r730xd-vm1 ansible-command: Invoked with warn=True executable=None _uses_shell=False _raw_params=restorecon -vR /etc/systemd/system/qdrouterd.service.d removes=None creates=None chdir=None stdin=None
Nov 30 04:08:20 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:08:47 c10-h19-r730xd-vm1 kernel: qdrouterd[31775]: segfault at 180 ip 00007f56880e4291 sp 00007f56799852b8 error 4 in libqpid-proton.so.10.0.0[7f56880c6000+4b000]
Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 04:08:47 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 04:09:06 c10-h19-r730xd-vm1 kernel: qdrouterd[32150]: segfault at 180 ip 00007fb455eea291 sp 00007fb446f8a2b8 error 4 in libqpid-proton.so.10.0.0[7fb455ecc000+4b000]
Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: qdrouterd.service: main process exited, code=killed, status=11/SEGV
Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: Unit qdrouterd.service entered failed state.
Nov 30 04:09:06 c10-h19-r730xd-vm1 systemd: qdrouterd.service failed.
Nov 30 04:09:21 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***
Nov 30 04:11:20 c10-h19-r730xd-vm1 /usr/sbin/katello-service: *** status failed: qdrouterd ***

Comment 1 Chris Duryee 2017-11-30 13:27:23 UTC

Can you attach the output of "rpm -qa | sort" from one of the capsules?

Also, is this occurring on all capsules?

Comment 2 Jan Hutař 2017-11-30 13:42:07 UTC

Created attachment 1360938 [details]
from one capsule (it fails for 9 of 10, but they do not have equal load of connected clients)

Comment 3 Jan Hutař 2017-11-30 13:42:28 UTC

Attached

Comment 4 Mike Cressman 2017-11-30 21:47:06 UTC

It would be helpful if you could provide some core files so we can investigate why it's crashing.  Also, any more info on when this is happening.

Comment 5 sbadhwar 2017-12-01 07:49:02 UTC

(In reply to Mike Cressman from comment #4)
> It would be helpful if you could provide some core files so we can
> investigate why it's crashing.  Also, any more info on when this is
> happening.

Hello Mike,

I am attaching the coredump collected using abrt in the bug report.

The Segfault is happening randomly, we haven't made any changes yet to the capsule servers which can trigger the error. Once this segfault starts happening, it repeats at random times. For example, we will restart the qdrouterd service and it will work fine for a few minutes and will then crash. So, not sure what might be triggering this error.

Regards,
Saurabh Badhwar

Comment 8 Ganesh Murthy 2017-12-01 21:03:18 UTC

Is it possible to get an actual core file? I would like the see the route the code is taking before it crashes.  Thanks.

Comment 10 Ganesh Murthy 2017-12-04 21:39:18 UTC

Hi Pavel,
   Thanks for the backtrace. The crash seems to originate from 

https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098

and it actually happens on 

https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine.c#L727

(pn_connection_t *connection = delivery->link->session->connection;)

I see the following possibilities

1. The dispatch code freed the pn_delivery_t object and is trying to access it again. 
2. link/session that the pn_delivery_t belongs to has been freed because the connection between the router and the client went down and hence automatically freeing all underlying links/sessions belonging to that connection.

I am leaning towards #2 but would not rule out #1

Is is possible for you to see, using wireshark, if the connection between router and client abruptly went down? Or maybe the client intentionally closes the connection before receiving a disposition update.

Also, if it is #1, can we start the router in debug mode and get the core dump to see if the pn_delivery_t has been freed or is null?

Comment 11 sbadhwar 2017-12-05 15:33:10 UTC

(In reply to Ganesh Murthy from comment #10)
> Hi Pavel,
>    Thanks for the backtrace. The crash seems to originate from 
> 
> https://github.com/apache/qpid-dispatch/blob/0.8.x/src/router_node.c#L1098
> 
> and it actually happens on 
> 
> https://github.com/apache/qpid-proton/blob/master/proton-c/src/core/engine.
> c#L727
> 
> (pn_connection_t *connection = delivery->link->session->connection;)
> 
> I see the following possibilities
> 
> 1. The dispatch code freed the pn_delivery_t object and is trying to access
> it again. 
> 2. link/session that the pn_delivery_t belongs to has been freed because the
> connection between the router and the client went down and hence
> automatically freeing all underlying links/sessions belonging to that
> connection.
> 
> I am leaning towards #2 but would not rule out #1
> 
> Is is possible for you to see, using wireshark, if the connection between
> router and client abruptly went down? Or maybe the client intentionally
> closes the connection before receiving a disposition update.
> 
> Also, if it is #1, can we start the router in debug mode and get the core
> dump to see if the pn_delivery_t has been freed or is null?

Hey,

So, I was trying to get some logs after enabling debug and trace logging in qdrouterd. But seems like, as soon as I enabled the debug and trace logging, qdrouterd started working perfectly. So, it becomes a bit hard to gather the trace in that case. Although, I am trying to figure out some way through which I can reproduce the crash and provide a log dump.

Also, I was just trying to do a bit of analysis on one other capsule, which also had its qdrouterd failing and seems like, the delivery->link has been freed.

(gdb) print delivery
$6 = (pn_delivery_t *) 0x7f0b881447c0
(gdb) ptype pn_delivery_t
type = struct pn_delivery_t {
    pn_disposition_t local;
    pn_disposition_t remote;
    pn_link_t *link;
    pn_buffer_t *tag;
    pn_delivery_t *unsettled_next;
    pn_delivery_t *unsettled_prev;
    pn_delivery_t *work_next;
    pn_delivery_t *work_prev;
    pn_delivery_t *tpwork_next;
    pn_delivery_t *tpwork_prev;
    pn_delivery_state_t state;
    pn_buffer_t *bytes;
    pn_record_t *context;
    _Bool updated;
    _Bool settled;
    _Bool work;
    _Bool tpwork;
    _Bool done;
    _Bool referenced;
}
(gdb) print *delivery
$7 = {
  local = {
    condition = {
      name = 0x7f0b881ba660, 
      description = 0x7f0b88213ef0, 
      info = 0x7f0b8820dc50
    }, 
    type = 39, 
    data = 0x7f0b881446f0, 
    annotations = 0x7f0b88145480, 
    section_offset = 0, 
    section_number = 0, 
    failed = true, 
    undeliverable = false, 
    settled = true
  }, 
  remote = {
    condition = {
      name = 0x7f0b8820da60, 
      description = 0x7f0b8820dab0, 
      info = 0x7f0b8820db00
    }, 
    type = 0, 
    data = 0x7f0b88144600, 
    annotations = 0x7f0b88214010, 
    section_offset = 0, 
    section_number = 0, 
    failed = false, 
    undeliverable = false, 
    settled = false
  }, 
  link = 0x0, 
  tag = 0x7f0b881ba920, 
  unsettled_next = 0x0, 
  unsettled_prev = 0x0, 
  work_next = 0x7f0b88048cf0, 
  work_prev = 0x0, 
  tpwork_next = 0x7f0b8802c230, 
  tpwork_prev = 0x0, 
  state = {
    id = 0, 
    sent = false, 
    init = false
  }, 
  bytes = 0x7f0b881ba770, 
  context = 0x7f0b881badd0, 
  updated = false, 
  settled = true, 
  work = false, 
  tpwork = false, 
  done = true, 
  referenced = true
}

Comment 12 sbadhwar 2017-12-07 03:39:00 UTC

I tried doing some more analysis from the thread file, here is the output from per-thread basis using gdb:

[root@c10-h19-r730xd-vm3 ~]# gdb qdrouterd /var/spool/abrt/ccpp-2017-12-05-09\:16\:21-14072/coredump -ex 'info thread' -ex 'thread apply all bt' -batch
[New LWP 14075]
[New LWP 14072]
[New LWP 14073]
[New LWP 14076]
[New LWP 14074]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/qdrouterd -c /etc/qpid-dispatch/qdrouterd.conf'.
Program terminated with signal 11, Segmentation fault.
#0  pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727
727	  pn_connection_t *connection = delivery->link->session->connection;
  Id   Target Id         Frame 
  5    Thread 0x7f0b9ed34700 (LWP 14074) 0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0
  4    Thread 0x7f0b9dd32700 (LWP 14076) 0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0
  3    Thread 0x7f0b9f748700 (LWP 14073) 0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2    Thread 0x7f0bacb6b1c0 (LWP 14072) 0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10
* 1    Thread 0x7f0b9e533700 (LWP 14075) pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727

Thread 5 (Thread 0x7f0b9ed34700 (LWP 14074)):
#0  0x00007f0bac2a0c0b in send () from /lib64/libpthread.so.0
#1  0x00007f0bac72a5e7 in connector_process (c=0x7f0b94098150) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/driver.c:776
#2  0x00007f0bac742d09 in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:734
#3  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#4  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0b9dd32700 (LWP 14076)):
#0  0x00007f0bac2a042d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f0bac2a2c6c in _L_cond_lock_789 () from /lib64/libpthread.so.0
#2  0x00007f0bac2a2b01 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#3  0x00007f0bac29d9d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4  0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91
#5  0x00007f0bac74292c in thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:894
#6  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0b9f748700 (LWP 14073)):
#0  0x00007f0bac29d945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0bac72b319 in sys_cond_wait (cond=<optimized out>, held_mutex=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/posix/threading.c:91
#2  0x00007f0bac7392d5 in router_core_thread (arg=0x1448e20) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/router_core_thread.c:66
#3  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0bacb6b1c0 (LWP 14072)):
#0  0x00007f0baaed9052 in mul4x_internal () from /lib64/libcrypto.so.10
#1  0x00007f0baaed92c4 in bn_power5 () from /lib64/libcrypto.so.10
#2  0x000000000175b600 in ?? ()
#3  0x00007ffd21c408f0 in ?? ()
#4  0x000000000175b380 in ?? ()
#5  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f0b9e533700 (LWP 14075)):
#0  pni_add_tpwork (delivery=0x7f0b881447c0) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:727
#1  0x00007f0bac4cdcae in pn_delivery_update (delivery=<optimized out>, state=<optimized out>) at /usr/src/debug/qpid-proton-0.16.0/proton-c/src/core/engine.c:2026
#2  0x00007f0bac73dcc8 in CORE_delivery_update (context=0x144b970, dlv=0x7f0b90074f10, disp=39, settled=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_node.c:1068
#3  0x00007f0bac732c5d in qdr_connection_process (conn=0x7f0b88030bd0) at /usr/src/debug/qpid-dispatch-0.8.0/src/router_core/connections.c:306
#4  0x00007f0bac71e1d8 in writable_handler (container=0xe658c0, container=0xe658c0, conn=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:307
#5  handler (handler_context=0xe658c0, conn_context=0x1294f40, event=<optimized out>, qd_conn=0x7f0b90000b10) at /usr/src/debug/qpid-dispatch-0.8.0/src/container.c:614
#6  0x00007f0bac742ccb in process_connector (cxtr=<optimized out>, qd_server=0x141a370) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:792
#7  thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.8.0/src/server.c:1012
#8  0x00007f0bac299e25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f0bab5cd34d in clone () from /lib64/libc.so.6

Comment 16 Bryan Kearney 2018-01-03 17:08:43 UTC

Moving to MODIFIED based on the existence of the build.

Comment 17 jcallaha 2018-02-07 14:11:38 UTC

Verified in Satellite 6.3 Snap 35. 

Based on discussion with scale team, we are no longer seeing this issue with the current version of qdrouterd shipped in 6.3.

Comment 20 errata-xmlrpc 2018-02-21 12:57:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0338

Note You need to log in before you can comment on or make changes to this bug.