Bug 1535891
Summary: | qdrouterd 0.4-22 can still return qd:no-route-to-dest, with independent segfault bug | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> | ||||
Component: | Qpid | Assignee: | Mike Cressman <mcressma> | ||||
Status: | CLOSED ERRATA | QA Contact: | jcallaha | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.2.13 | CC: | akarimi, bbuckingham, bkearney, daniele, gkonda, gmurthy, hmore, janarula, mcressma, mmccune, mmello, pwaghmar, rraghuwa, smutkule, zhunting | ||||
Target Milestone: | Unspecified | Keywords: | Triaged | ||||
Target Release: | Unused | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qpid-dispatch-0.4-29 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-05-21 20:16:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Pavel Moravec
2018-01-18 08:30:14 UTC
Reproducer - outside Satellite: 1) Have qpidd with pulp.test queue created 2) Have hub router (like Satelltie's one) inter-routing everything (like prefix pulp.) to qpidd 3) have 10 routers (like Capsules' ones) inter-connecting to the hub router and accepting clients 4) randomly freeze (kill -SIGSTOP) and unfreeze (kill -SIGCONT) random router - and do it 5times concurrently 5) run a client (concurrently against each Capsule's router) to try to link-route via the routers to pulp.test queue in qpidd - if you permanently start to get 'qd:no-route-to-dest' errors from an instance of the client, you reproduced the problem Hi Ganesh, could you please investigate the problem (see previous update with reproducer machine)? I always noticed: Tue Jan 23 12:22:21 2018 ROUTER (critical) Outgoing router link closing but not in index: bit=0 before error started. So I *think* the hub router forgets about a link to/from Capsule's router (and closing the link later on raises the above error) - and if/as that link is used for sending MAU updates, if no comes on time, old data are purged away as obsolete? And why the hub router forgets the link - I guess due to a race condition when link is dropped after 15-16s of blocked Capsule's router - but when the router unfreezes, the hub router wrongly reacts on the 2 concurrent activities (some timeout happened leading to link drop vs. link got new traffic). Created attachment 1385295 [details]
tcpdump between problematic routers
tcpdump between qdrouterd hub (port 5646) and qdrouterd on Capsule (client port 41996) - affected router is caps-3.
Decode port 5646 as AMQP to dissect the traffic properly.
The problem started at: Tue Jan 23 21:13:21 CET 2018 since when the qdrouterd on Caps started returning 'qd:no-route-to-dest' forever.
Hub's logs:
Tue Jan 23 21:12:33 2018 ROUTER (critical) Outgoing router link closing but not in index: bit=0
Tue Jan 23 21:11:21 2018 ROUTER_LS (info) Router Link Lost - link_id=0
(this might be irrelevant)
..
Tue Jan 23 21:13:01 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-7.gsslab.brq2.redhat.com
Tue Jan 23 21:13:05 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:09 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-4.gsslab.brq2.redhat.com
Tue Jan 23 21:13:10 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-0.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-6.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-8.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-5.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-7.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-1.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-3.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:24 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-6.gsslab.brq2.redhat.com
Tue Jan 23 21:13:24 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-1.gsslab.brq2.redhat.com
Tue Jan 23 21:13:30 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-7.gsslab.brq2.redhat.com
Tue Jan 23 21:13:31 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-5.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-0.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-4.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-8.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-3.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:51 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:51 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:54 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-3.gsslab.brq2.redhat.com
Tue Jan 23 21:13:55 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:58 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-8.gsslab.brq2.redhat.com
Hub's router was frozen since Tue Jan 23 21:13:31 CET 2018 for 15.95 s.
Capsule's logs:
Tue Jan 23 21:13:15 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-hub.gsslab.brq2.redhat.com
Tue Jan 23 21:13:33 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-hub.gsslab.brq2.redhat.com
Tue Jan 23 21:13:54 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-hub.gsslab.brq2.redhat.com
The Capsule's router was frozen from 21:12:44 till 21:13:01
(but the frozen times makes little sense compared to communication problems in tcpdump..)
Per my testing, the BZ is properly fixed in 0.4-29. No permanent qd:route-to-dest error, no segfault or other flaw when running the reproducer for few hours. I can confirm Sat6.3 / qdrouterd 0.8 does _not_ contain this bug. Running the reproducer for >1 hour, no issue found. (In reply to Pavel Moravec from comment #8) > Testing 0.4-28: > > - no further permanent qd:no-route-to-dest (until temporal cases when a > router is paused - but that is expected) - so the original flaw seems fixed. > - BUT letting the reproducer running for a longer time (30mins), a segfault > happens: .. > (gdb) bt > #0 qd_link_closed (link=0x0) at > /usr/src/debug/qpid-dispatch-0.4/src/container.c:982 > #1 0x00007fef5667065f in router_link_detach_handler (context=0xaaa830, > link=0xb96920, closed=0) at > /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1778 > #2 0x00007fef56662381 in close_links (state=0, pn_link=0x7fef3c05dba0) at > /usr/src/debug/qpid-dispatch-0.4/src/container.c:297 > #3 close_handler (unused=<optimized out>, qd_conn=0xb82240, > conn=0x7fef3c036de0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:333 > #4 handler (handler_context=0xa7b200, conn_context=<optimized out>, > event=event@entry=QD_CONN_EVENT_CLOSE, qd_conn=0xb82240) at > /usr/src/debug/qpid-dispatch-0.4/src/container.c:659 > #5 0x00007fef56674ddc in process_connector (cxtr=0xb86640, > qd_server=0xafea30) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:392 > #6 thread_run (arg=<optimized out>) at > /usr/src/debug/qpid-dispatch-0.4/src/server.c:627 > #7 0x00007fef561e5e25 in start_thread () from /lib64/libpthread.so.0 > #8 0x00007fef5573b34d in clone () from /lib64/libc.so.6 > (gdb) Seems this segfault happens also on 0.4-27 (so it isnt regression introduced by the qd:no-route-to-dest original fix), as there is a customer hitting this on 0.4-27. (updating Summary since this BZ fixes 2 independent bugs, in fact) Just for reference: the codefix in 0.4-29 prevents not only the segfault above, but also the https://bugzilla.redhat.com/show_bug.cgi?id=1484028#c4 one. *** Bug 1484028 has been marked as a duplicate of this bug. *** verified in Satellite 6.2.15 Snap 2 based on Pavel's testing of the included version. -bash-4.1# rpm -qa | grep qpid-dispatch libqpid-dispatch-0.4-29 qpid-dispatch-router-0.4-29 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1672 |