Bug 1393128
Summary: | qdrouterd 0.4-19 segfault when qpidd down for longer time and goferd restarted | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||||||||
Component: | katello-agent | Assignee: | Mike Cressman <mcressma> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | jcallaha | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 6.2.4 | CC: | alexandre.chanu, bbuckingham, bkearney, cdonnell, gmurthy, jcallaha, jentrena, jhutar, mmccune, oshtaier, paul.seymour, pdwyer, sthirugn | ||||||||||
Target Milestone: | Unspecified | Keywords: | PrioBumpField, PrioBumpGSS, PrioBumpQA, Triaged | ||||||||||
Target Release: | Unused | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
URL: | https://bugzilla.redhat.com/show_bug.cgi?id=1367735 | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | qpid-dispatch-0.4-20 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1395700 1396568 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2016-11-21 18:16:21 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1395700, 1396568 | ||||||||||||
Attachments: |
|
Description
Pavel Moravec
2016-11-08 22:30:18 UTC
Created attachment 1218737 [details]
coredump of one segfault
I forgot to add: - 0.4-16 does not exhibit that segfault - some similar segfault is in qdrouterd for longer time, see [1] (though with different backtrace) - since this segfault has not been reported by customers, this one based on similar scenario is also improbable to be hit [1] https://issues.jboss.org/browse/ENTMQIC-50 .. and hit by first customer with qpidd running (but backtrace matches). (below I describe maybe a different segfault - please investigate if the cause is same or not) Trying with some different scenarions (with qpidd "running"), I managed to get some segfault when freezing qpidd for a while (and running the same script multiple times meantime), simply running: gdb -p $(pgrep qpidd) $(which qpidd) waiting >10 seconds and detaching from the qpidd process - immediately after the detach, qpidd sends postponed heartbeats on AMQP1.0 to the router and it tries to match them to the already closed session/connection. Variant of above: "freeze" qpidd just for a while and meantime stop all the client processes - again sessions/links are deleted in qdrouterd while qpidd will send some traffic on them later on. Different backtraces seen in those scenarios: #0 pn_session_connection (session=0x1a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:232 232 return session->connection; (gdb) bt #0 pn_session_connection (session=0x1a0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:232 #1 0x00007f328f8d4e14 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:914 #2 0x00007f328f8e2cd5 in router_link_attach_handler (context=0x125a200, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686 #3 0x00007f328f8d4105 in handle_link_open (container=<optimized out>, pn_link=0x7f327c0bce20) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217 #4 process_handler (unused=<optimized out>, qd_conn=0x7f327c00cb30, container=0x114e6e0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:470 #5 handler (handler_context=0x114e6e0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f327c00cb30) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:624 #6 0x00007f328f8e69fc in process_connector (cxtr=0x7f327c010290, qd_server=0x1159d60) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #8 0x00007f328f457dc5 in start_thread (arg=0x7f3282f9b700) at pthread_create.c:308 #9 0x00007f328e9b2ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) p session $1 = (pn_session_t *) 0x1a0 (gdb) p session->connection Cannot access memory at address 0x208 (gdb) or: #0 pni_record_find (record=<optimized out>, record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71 71 if (field->key == key) { (gdb) bt #0 pni_record_find (record=<optimized out>, record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:71 #1 pn_record_get (record=<optimized out>, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:120 #2 0x00007f17168f1593 in pn_connection_get_context (conn=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184 #3 0x00007f1716b35e21 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:918 #4 0x00007f1716b43cd5 in router_link_attach_handler (context=0x19eac50, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686 #5 0x00007f1716b35105 in handle_link_open (container=<optimized out>, pn_link=0x7f1704099e60) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217 #6 process_handler (unused=<optimized out>, qd_conn=0x1b06b70, container=0x197e420) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:470 #7 handler (handler_context=0x197e420, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x1b06b70) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:624 #8 0x00007f1716b479fc in process_connector (cxtr=0x1b0a1e0, qd_server=0x19853f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #9 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #10 0x00007f17166b8dc5 in start_thread (arg=0x7f17091fa700) at pthread_create.c:308 #11 0x00007f1715c13ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) p field $1 = (pni_field_t *) 0x6e696c6f72614320 (gdb) p field->key Cannot access memory at address 0x6e696c6f72614320 (gdb) or: #0 0x00007fcdfd193585 in pn_connection_get_context (conn=0x30242d4b0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184 184 return conn ? pn_record_get(conn->context, PN_LEGCTX) : NULL; (gdb) bt #0 0x00007fcdfd193585 in pn_connection_get_context (conn=0x30242d4b0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184 #1 0x00007fcdfd3d7e21 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:918 #2 0x00007fcdfd3e5cd5 in router_link_attach_handler (context=0x2360200, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686 #3 0x00007fcdfd3d7105 in handle_link_open (container=<optimized out>, pn_link=0x7fcde00decd0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217 #4 process_handler (unused=<optimized out>, qd_conn=0x7fcde000cb30, container=0x22546e0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:470 #5 handler (handler_context=0x22546e0, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7fcde000cb30) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:624 #6 0x00007fcdfd3e99fc in process_connector (cxtr=0x7fcde0010290, qd_server=0x225fd60) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #7 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #8 0x00007fcdfcf5adc5 in start_thread (arg=0x7fcdf0a9e700) at pthread_create.c:308 #9 0x00007fcdfc4b5ced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) p conn $1 = (pn_connection_t *) 0x30242d4b0 (gdb) p conn->context Cannot access memory at address 0x30242d5a0 (gdb) or a combination of above two: #0 pn_record_get (record=0x379830885ace8618, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:118 118 { (gdb) bt #0 pn_record_get (record=0x379830885ace8618, key=key@entry=0) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/record.c:118 #1 0x00007f75e940d593 in pn_connection_get_context (conn=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:184 #2 0x00007f75e9651e21 in qd_link_connection (link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:918 #3 0x00007f75e965fcd5 in router_link_attach_handler (context=0x17e3c50, link=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1686 #4 0x00007f75e9651105 in handle_link_open (container=<optimized out>, pn_link=0x7f75d00ca720) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:217 #5 process_handler (unused=<optimized out>, qd_conn=0x7f75d400cb30, container=0x1777420) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:470 #6 handler (handler_context=0x1777420, conn_context=<optimized out>, event=event@entry=QD_CONN_EVENT_PROCESS, qd_conn=0x7f75d400cb30) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:624 #7 0x00007f75e96639fc in process_connector (cxtr=0x7f75d4010290, qd_server=0x177e3f0) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:398 #8 thread_run (arg=<optimized out>) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:626 #9 0x00007f75e9664a80 in qd_server_run (qd=0x1504030) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:971 #10 0x0000000000401cd8 in main_process (config_path=config_path@entry=0x7ffd15fad6ba "/etc/qpid-dispatch/qdrouterd.conf", python_pkgdir=python_pkgdir@entry=0x402401 "/usr/lib/qpid-dispatch/python", fd=fd@entry=2) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:135 #11 0x0000000000401950 in main (argc=3, argv=0x7ffd15fac768) at /usr/src/debug/qpid-dispatch-0.4/router/src/main.c:335 (gdb) I am getting various qdrouterd segfaults since the 6.2.4 update:- kernel: [166585.935944] qdrouterd[17142]: segfault at 98 ip 00007f16d1cc3ef0 sp 00007f16b37fc328 error 4 in libqpid-proton.so.2.0.0[7f16d1c9f000+4d000] kernel: qdrouterd[17142]: segfault at 98 ip 00007f16d1cc3ef0 sp 00007f16b37fc328 error 4 in libqpid-proton.so.2.0.0[7f16d1c9f000+4d000] systemd[1]: qdrouterd.service: main process exited, code=killed, status=11/SEGV systemd[1]: Unit qdrouterd.service entered failed state. lrprdrhs001 systemd[1]: qdrouterd.service failed. & kernel: [169696.067382] traps: qdrouterd[25173] general protection ip:7f665b8ece71 sp:7f664d8052d8 error:0 in libc-2.17.so[7f665b789000+1b6000] kernel: traps: qdrouterd[25173] general protection ip:7f665b8ece71 sp:7f664d8052d8 error:0 in libc-2.17.so[7f665b789000+1b6000] systemd[1]: qdrouterd.service: main process exited, code=killed, status=11/SEGV systemd[1]: Unit qdrouterd.service entered failed state. systemd[1]: qdrouterd.service failed. Any ideas or workarounds ? (In reply to Paul Seymour from comment #5) > I am getting various qdrouterd segfaults since the 6.2.4 update:- > > kernel: [166585.935944] qdrouterd[17142]: segfault at 98 ip 00007f16d1cc3ef0 > sp 00007f16b37fc328 error 4 in libqpid-proton.so.2.0.0[7f16d1c9f000+4d000] > kernel: qdrouterd[17142]: segfault at 98 ip 00007f16d1cc3ef0 sp > 00007f16b37fc328 error 4 in libqpid-proton.so.2.0.0[7f16d1c9f000+4d000] > systemd[1]: qdrouterd.service: main process exited, code=killed, > status=11/SEGV > systemd[1]: Unit qdrouterd.service entered failed state. > lrprdrhs001 systemd[1]: qdrouterd.service failed. > > & > > kernel: [169696.067382] traps: qdrouterd[25173] general protection > ip:7f665b8ece71 sp:7f664d8052d8 error:0 in libc-2.17.so[7f665b789000+1b6000] > kernel: traps: qdrouterd[25173] general protection ip:7f665b8ece71 > sp:7f664d8052d8 error:0 in libc-2.17.so[7f665b789000+1b6000] > systemd[1]: qdrouterd.service: main process exited, code=killed, > status=11/SEGV > systemd[1]: Unit qdrouterd.service entered failed state. > systemd[1]: qdrouterd.service failed. > > Any ideas or workarounds ? Bugzilla is not the (primary) tool for troubleshooting customer issues - please raise customer case for that. If you could provide a coredump from the segfault (i.e. via abrt report) to confirm you hit really this bug, that would be great. Currently no workaround is known. What *might* (but definitely not need to) help is installing packages or erratas to fewer systems in parallel - at least there is one _fixed_ segfault in that area, maybe not 100% correct. Another workaround is to use lower version of qdrouterd packages. 6.2.3 contains 0.4-16 that has big mem.leak, it should be safe to rollback/downgrade qpid-dispatch-router and libqpid-dispatch to that version. Yet another very similar backtrace from a customer: (gdb) bt #0 pn_string_get (string=0x25) at /usr/src/debug/qpid-proton-0.9/proton-c/src/object/string.c:120 #1 0x00007f7c77abb01c in pn_link_name (link=<optimized out>) at /usr/src/debug/qpid-proton-0.9/proton-c/src/engine/engine.c:1316 #2 0x00007f7c77d07fd4 in qd_router_link_name (link=0x7f7c1800a150) at /usr/src/debug/qpid-dispatch-0.4/src/router_agent.c:90 #3 qd_entity_refresh_router_link (entity=0x7f7c08950010, impl=0x7f7c1800a150) at /usr/src/debug/qpid-dispatch-0.4/src/router_agent.c:98 #4 0x00007f7c6b834dcc in ffi_call_unix64 () at ../src/x86/unix64.S:90 #5 0x000000000000000c in ?? () #6 0x00007f7c577f44a0 in ?? () #7 0x00007f7c577f4450 in ?? () #8 0x00007f7c6b8346f5 in ffi_call (cif=<optimized out>, fn=<optimized out>, rvalue=0x7f7c771239c9 <insertdict+25>, avalue=0x7f7c577f4480) at ../src/x86/ffi64.c:524 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) frame 2 #2 0x00007f7c77d07fd4 in qd_router_link_name (link=0x7f7c1800a150) at /usr/src/debug/qpid-dispatch-0.4/src/router_agent.c:90 90 return pn_link_name(qd_link_pn(link->link)); (gdb) p link->link $1 = (qd_link_t *) 0x7f7c18008110 (gdb) p link->link->pn_link $2 = (pn_link_t *) 0x7f7c18002f60 (gdb) p link->link->pn_link->name $3 = (pn_string_t *) 0x25 (gdb) (and btw an attempt to workaround by disabling logging did not help: log { enable: critical module: DEFAULT } in qdrouterd.conf still leads to the same segfault) Best reproducer for Satellite6 QE (still little bit tricky / not scenario one would see at field): - have all Sat services running - in 1st terminal, freeze qpidd process for at least 11 seconds and then unfreeze it: kill -SIGSTOP $(pgrep qpidd); sleep 11; kill -SIGCONT $(pgrep qpidd) - immediately after the expect script is running, restart goferd on some Content Host _twice_ : service goferd restart; sleep 3; service goferd restart - check qdrouterd service status Created attachment 1222002 [details]
qpid-dispatch-router-0.4-20.el7sat.x86_64.rpm
Created attachment 1222003 [details]
qpid-dispatch-tools-0.4-20.el7sat.x86_64.rpm
Created attachment 1222004 [details]
libqpid-dispatch-0.4-20.el7sat.x86_64.rpm
== HOTFIX Instructions == We are releasing version 0.4-20 to fix the segfault issue. To install just download from this bugzilla and install via RPM or yum locally. Then run katello-service restart This hotfix will be released in a formal Errata soon. Verified in Satellite 6.2.4.a I've been testing this over the weekend, culminating in an automation run against my remaining RHEL6 and RHEL7 systems. After reviewing the automation results, everything is looking good. I was unable to get the reproducer script to work properly (issues with python-qpid-proton). But I did not notice any impact to performance or breaks during my stress testing on Saturday. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2811 |