Bug 1535891

Summary:

qdrouterd 0.4-22 can still return qd:no-route-to-dest, with independent segfault bug

Product:

Red Hat Satellite

Reporter:

Pavel Moravec <pmoravec>

Component:

Qpid

Assignee:

Mike Cressman <mcressma>

Status:

CLOSED ERRATA

QA Contact:

jcallaha

Severity:

high

Docs Contact:

Priority:

high

Version:

6.2.13

CC:

akarimi, bbuckingham, bkearney, daniele, gkonda, gmurthy, hmore, janarula, mcressma, mmccune, mmello, pwaghmar, rraghuwa, smutkule, zhunting

Target Milestone:

Unspecified

Keywords:

Triaged

Target Release:

Unused

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

qpid-dispatch-0.4-29

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-21 20:16:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
tcpdump between problematic routers	none

Description Pavel Moravec 2018-01-18 08:30:14 UTC

Description of problem:
goferd connecting to qdrouterd on Capsule can still get "qd:no-route-to-dest" error from the qdrouterd, despite:

- all connections qpidd(Sat)<->qdrouterd(Sat)<->qdrouterd(Caps)<->goferd are stable for a longer time
- qdrouterd is patched by bz1452183 fix / qdrouterd is on 0.4-22 or higher

From some logs, now the problem sounds to be in qdrouterd on Satellite as it seems it forgets it has a working link to qdrouterd to Capsule.


Version-Release number of selected component (if applicable):
qpid-dispatch-router-0.4-22.el7sat.x86_64


How reproducible:
???


Steps to Reproduce:
???



Actual results:
goferd (on a Host registered to a Caps) gets disconnection (that can happen) but further connection attempts fail with "qd:no-route-to-dest" error forever - despite all connectivity is properly established.


Expected results:
The "qd:no-route-to-dest" can be just temporary (it is legitimate to see them during some connection issues or a short time after that, but not longer)


Additional info:
Example logs:

- client:
Jan 09 12:21:16 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] root:532 - Disconnected
Jan 09 12:21:16 clientMachine goferd[1025]: [ERROR][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.proton.reliability:53 - Connection amqps://capsule:5647 disconnected
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://capsule:5647
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://capsule:5647
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://capsule:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] root:490 - connecting to capsule:5647...
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://capsule:5647
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.connect:30 - connected: proton+amqps://capsule:5647
Jan 09 12:21:26 clientMachine goferd[1025]: [INFO][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] root:512 - connected to capsule:5647
Jan 09 12:21:26 clientMachine goferd[1025]: [ERROR][pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6] gofer.messaging.adapter.proton.reliability:47 - receiver 75579d7e-f888-42d2-a40e-42ceed908107 from pulp.agent.96722cb0-bdce-40a2-aa75-1d72d78d52e6 closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')


(and the 'qd:no-route-to-dest' repeats forever since then)

- Capsule:
Jan 09 12:20:46 capsule qdrouterd[63892]: send: Connection reset by peer
Jan 09 12:21:20 capsule qdrouterd[63892]: Tue Jan  9 12:21:20 2018 ROUTER_HELLO (info) HELLO peer expired: satellite
Jan 09 12:21:36 capsule qdrouterd[63892]: send: Connection reset by peer
Jan 09 12:21:37 capsule qdrouterd[63892]: read: Connection reset by peer
Jan 09 12:21:37 capsule qdrouterd[63892]: send: Broken pipe


- Satellite:
Jan 09 12:06:18 satellite qdrouterd[69225]: Tue Jan  9 12:06:18 2018 ROUTER_HELLO (info) HELLO peer expired: capsule


(hostnames sanitized but "capsule" is really the capsule and "satellite" is really the satellite)



From another case from the time of capsule's qdrouterd restart:

Jan 03 17:45:42 satellite qdrouterd[1711]: Wed Jan  3 17:45:42 2018 ROUTER (critical) Outgoing router link closing but not in index: bit=0

This sounds like qdrouterd on Satellite is unaware of link to Capsule - so it might not send routing updates there what means the Capsule's qdrouterd doesnt know where to route requests! That would suggest the root cause, BUT this error was not present in logs in previous case (maybe different restart sequence doesnt trigger this log, or so?)

Comment 4 Pavel Moravec 2018-01-23 12:05:55 UTC

Reproducer - outside Satellite:

1) Have qpidd with pulp.test queue created

2) Have hub router (like Satelltie's one) inter-routing everything (like prefix pulp.) to qpidd

3) have 10 routers (like Capsules' ones) inter-connecting to the hub router and accepting clients

4) randomly freeze (kill -SIGSTOP) and unfreeze (kill -SIGCONT) random router - and do it 5times concurrently

5) run a client (concurrently against each Capsule's router) to try to link-route via the routers to pulp.test queue in qpidd - if you permanently start to get 'qd:no-route-to-dest' errors from an instance of the client, you reproduced the problem

Comment 6 Pavel Moravec 2018-01-23 12:32:41 UTC

Hi Ganesh,
could you please investigate the problem (see previous update with reproducer machine)?

I always noticed:

Tue Jan 23 12:22:21 2018 ROUTER (critical) Outgoing router link closing but not in index: bit=0

before error started. So I *think* the hub router forgets about a link to/from Capsule's router (and closing the link later on raises the above error) - and if/as that link is used for sending MAU updates, if no comes on time, old data are purged away as obsolete?

And why the hub router forgets the link - I guess due to a race condition when link is dropped after 15-16s of blocked Capsule's router - but when the router unfreezes, the hub router wrongly reacts on the 2 concurrent activities (some timeout happened leading to link drop vs. link got new traffic).

Comment 7 Pavel Moravec 2018-01-24 07:45:17 UTC

Created attachment 1385295 [details]
tcpdump between problematic routers

tcpdump between qdrouterd hub (port 5646) and qdrouterd on Capsule (client port 41996) - affected router is caps-3.

Decode port 5646 as AMQP to dissect the traffic properly.

The problem started at: Tue Jan 23 21:13:21 CET 2018 since when the qdrouterd on Caps started returning 'qd:no-route-to-dest' forever. 

Hub's logs:

Tue Jan 23 21:12:33 2018 ROUTER (critical) Outgoing router link closing but not in index: bit=0
Tue Jan 23 21:11:21 2018 ROUTER_LS (info) Router Link Lost - link_id=0
(this might be irrelevant)
..
Tue Jan 23 21:13:01 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-7.gsslab.brq2.redhat.com
Tue Jan 23 21:13:05 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:09 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-4.gsslab.brq2.redhat.com
Tue Jan 23 21:13:10 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-0.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-6.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-8.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-5.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-7.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-1.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-3.gsslab.brq2.redhat.com
Tue Jan 23 21:13:20 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:24 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-6.gsslab.brq2.redhat.com
Tue Jan 23 21:13:24 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-1.gsslab.brq2.redhat.com
Tue Jan 23 21:13:30 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-7.gsslab.brq2.redhat.com
Tue Jan 23 21:13:31 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-5.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-0.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-4.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-8.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-3.gsslab.brq2.redhat.com
Tue Jan 23 21:13:47 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:51 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:51 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-9.gsslab.brq2.redhat.com
Tue Jan 23 21:13:54 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-3.gsslab.brq2.redhat.com
Tue Jan 23 21:13:55 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-2.gsslab.brq2.redhat.com
Tue Jan 23 21:13:58 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-caps-8.gsslab.brq2.redhat.com

Hub's router was frozen since Tue Jan 23 21:13:31 CET 2018 for 15.95 s.


Capsule's logs:

Tue Jan 23 21:13:15 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-hub.gsslab.brq2.redhat.com
Tue Jan 23 21:13:33 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-hub.gsslab.brq2.redhat.com
Tue Jan 23 21:13:54 2018 ROUTER_HELLO (info) HELLO peer expired: pmoravec-rhel74-hub.gsslab.brq2.redhat.com

The Capsule's router was frozen from 21:12:44 till 21:13:01

(but the frozen times makes little sense compared to communication problems in tcpdump..)

Comment 9 Pavel Moravec 2018-01-27 10:38:46 UTC

Per my testing, the BZ is properly fixed in 0.4-29. No permanent qd:route-to-dest error, no segfault or other flaw when running the reproducer for few hours.

Comment 12 Pavel Moravec 2018-01-29 09:36:44 UTC

I can confirm Sat6.3 / qdrouterd 0.8 does _not_ contain this bug. Running the reproducer for >1 hour, no issue found.

Comment 13 Pavel Moravec 2018-02-22 07:47:24 UTC

(In reply to Pavel Moravec from comment #8)
> Testing 0.4-28:
> 
> - no further permanent qd:no-route-to-dest (until temporal cases when a
> router is paused - but that is expected) - so the original flaw seems fixed.
> - BUT letting the reproducer running for a longer time (30mins), a segfault
> happens:
..

> (gdb) bt
> #0  qd_link_closed (link=0x0) at
> /usr/src/debug/qpid-dispatch-0.4/src/container.c:982
> #1  0x00007fef5667065f in router_link_detach_handler (context=0xaaa830,
> link=0xb96920, closed=0) at
> /usr/src/debug/qpid-dispatch-0.4/src/router_node.c:1778
> #2  0x00007fef56662381 in close_links (state=0, pn_link=0x7fef3c05dba0) at
> /usr/src/debug/qpid-dispatch-0.4/src/container.c:297
> #3  close_handler (unused=<optimized out>, qd_conn=0xb82240,
> conn=0x7fef3c036de0) at /usr/src/debug/qpid-dispatch-0.4/src/container.c:333
> #4  handler (handler_context=0xa7b200, conn_context=<optimized out>,
> event=event@entry=QD_CONN_EVENT_CLOSE, qd_conn=0xb82240) at
> /usr/src/debug/qpid-dispatch-0.4/src/container.c:659
> #5  0x00007fef56674ddc in process_connector (cxtr=0xb86640,
> qd_server=0xafea30) at /usr/src/debug/qpid-dispatch-0.4/src/server.c:392
> #6  thread_run (arg=<optimized out>) at
> /usr/src/debug/qpid-dispatch-0.4/src/server.c:627
> #7  0x00007fef561e5e25 in start_thread () from /lib64/libpthread.so.0
> #8  0x00007fef5573b34d in clone () from /lib64/libc.so.6
> (gdb)

Seems this segfault happens also on 0.4-27 (so it isnt regression introduced by the qd:no-route-to-dest original fix), as there is a customer hitting this on 0.4-27.

Comment 14 Pavel Moravec 2018-02-22 08:09:28 UTC

(updating Summary since this BZ fixes 2 independent bugs, in fact)

Comment 15 Pavel Moravec 2018-03-05 12:06:59 UTC

Just for reference: the codefix in 0.4-29 prevents not only the segfault above, but also the 

https://bugzilla.redhat.com/show_bug.cgi?id=1484028#c4

one.

Comment 16 Pavel Moravec 2018-03-05 12:07:55 UTC

*** Bug 1484028 has been marked as a duplicate of this bug. ***

Comment 34 jcallaha 2018-04-30 17:29:14 UTC

verified in Satellite 6.2.15 Snap 2 based on Pavel's testing of the included version.

-bash-4.1# rpm -qa | grep qpid-dispatch
libqpid-dispatch-0.4-29
qpid-dispatch-router-0.4-29

Comment 37 errata-xmlrpc 2018-05-21 20:16:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1672