Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1426242 - goferd disconnects with no route to destination node on client
Summary: goferd disconnects with no route to destination node on client
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.2.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Lukas Pramuk
URL:
Whiteboard:
Depends On:
Blocks: 1353215 1399395
TreeView+ depends on / blocked
 
Reported: 2017-02-23 13:55 UTC by Andrew Kofink
Modified: 2021-08-30 12:18 UTC (History)
26 users (show)

Fixed In Version: qpid-dispatch-0.4-22
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1452183 (view as bug list)
Environment:
Last Closed: 2017-06-20 17:22:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tcpdump of inter-qdrouterd traffic (94.39 KB, application/octet-stream)
2017-04-04 13:09 UTC, Pavel Moravec
no flags Details
reproduced qdrouterd traces with updated mobile.py (49.20 KB, application/x-gzip)
2017-04-20 07:23 UTC, Pavel Moravec
no flags Details
trace logs with verbose mobile.py and tcpdumps (248.43 KB, application/x-gzip)
2017-04-20 20:38 UTC, Pavel Moravec
no flags Details
hotfix version (268.38 KB, application/x-rpm)
2017-04-28 15:50 UTC, Mike McCune
no flags Details
hotfix version (22.23 KB, application/x-rpm)
2017-04-28 15:50 UTC, Mike McCune
no flags Details
[hotfix] qpid-dispatch-tools package (16.11 KB, application/x-rpm)
2017-05-22 06:27 UTC, Pavel Moravec
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA DISPATCH-749 0 None None None 2017-04-22 18:19:34 UTC
Red Hat Bugzilla 1439617 0 medium CLOSED After concurrent startup, qdrouterd on Satellite delays many minutes before advertising qpidd is up 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3010161 0 None None None 2017-04-23 07:26:07 UTC
Red Hat Product Errata RHBA-2017:1553 0 normal SHIPPED_LIVE Satellite 6.2.10 Async Bug Release 2017-06-20 21:19:07 UTC

Internal Links: 1439617

Description Andrew Kofink 2017-02-23 13:55:00 UTC
Description of problem:
qdrouterd on the capsule complains about a broken connection to the Satellite, causing goferd on clients to log qd:no-route-to-dest errors. Restarting qdrouterd on the capsule temporarily fixes the issue.

Version-Release number of selected component (if applicable):
Satellite 6.2.7

How reproducible:
Sometimes

Steps to Reproduce:
1. Set up a Satellite with an external/isolated Capsule

Actual results:

Capsule:
capsule qdrouterd[19369]: read: Connection reset by peer
capsule qdrouterd[19369]: send: Broken pipe

Client:
[ERROR][pulp.agent.73ca16e7-c3e4-4afe-a4f2-82f645229495] gofer.messaging.adapter.proton.reliability:47 - receiver eb42b639-b85b-4ec9-8377-d777ab39c26c from pulp.agent.73ca16e7-c3e4-4afe-a4f2-82f645229495 closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')

Expected results:
Messages are able to be routed from client to satellite

Comment 2 Charles Birk 2017-03-15 18:34:06 UTC
I am also hitting this not only on my capsules but on content hosts connected to those capsules. It's filling up logs.

Mar 15 18:29:07 content1-0 goferd: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender b5617a8f-c1db-449d-b9a2-f0055056f23b to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')

Comment 3 Charles Birk 2017-03-20 18:32:09 UTC
I can confirm that Andrew's solution above fixes the issue at least temporarily.

Comment 4 Daniele Palumbo 2017-03-22 07:21:11 UTC
(In reply to Charles Birk from comment #3)
> I can confirm that Andrew's solution above fixes the issue at least
> temporarily.

You mean:
> Restarting qdrouterd on the capsule temporarily fixes the issue.

Correct?

Comment 5 Charles Birk 2017-03-22 13:42:41 UTC
(In reply to Daniele Palumbo from comment #4)
> (In reply to Charles Birk from comment #3)
> > I can confirm that Andrew's solution above fixes the issue at least
> > temporarily.
> 
> You mean:
> > Restarting qdrouterd on the capsule temporarily fixes the issue.
> 
> Correct?

I am having to restart qdrouterd on the Satellite server as well as every Capsule server that we have (7 total).  This only lasts temporarily before we start experiencing the disconnect issue that this bug was first opened with.

Example:

Mar 21 19:48:29 satellite.example.com qdrouterd[26095]: Tue Mar 21 19:48:29 2017 ROUTER_HELLO (info) HELLO peer expired: capsule6.example.com
Mar 21 19:48:42 satellite.example.com qdrouterd[26095]: Tue Mar 21 19:48:42 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Mar 21 20:46:20 satellite.example.com qdrouterd[26095]: Tue Mar 21 20:46:20 2017 ROUTER_HELLO (info) HELLO peer expired: capsule6.example.com
Mar 21 20:46:32 satellite.example.com qdrouterd[26095]: Tue Mar 21 20:46:32 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Mar 21 22:30:59 satellite.example.com qdrouterd[26095]: Tue Mar 21 22:30:59 2017 ROUTER_LS (info) Router Link Lost - link_id=0
Mar 22 01:01:05 satellite.example.com qdrouterd[26095]: Wed Mar 22 01:01:05 2017 ROUTER_HELLO (info) HELLO peer expired: capsule5.example.com
Mar 22 04:09:39 satellite.example.com qdrouterd[26095]: Wed Mar 22 04:09:39 2017 ROUTER_HELLO (info) HELLO peer expired: capsule4.example.com
Mar 22 06:01:34 satellite.example.com qdrouterd[26095]: Wed Mar 22 06:01:34 2017 ROUTER_HELLO (info) HELLO peer expired: devqacapsule.example.com
Mar 22 07:47:53 satellite.example.com qdrouterd[26095]: Wed Mar 22 07:47:53 2017 ROUTER_HELLO (info) HELLO peer expired: devqacapsule.example.com
Mar 22 07:48:05 satellite.example.com qdrouterd[26095]: Wed Mar 22 07:48:05 2017 ROUTER_LS (info) Router Link Lost - link_id=0

On the capsule servers themselves, we see the following (this is specifically capsule6, just as an example):

Mar 21 12:40:38 capsule6.example.com qdrouterd[31488]: Tue Mar 21 12:40:38 2017 ROUTER (info) Configured Remote Link-route-pattern: prefix=qmf. dir=both
Mar 21 12:40:38 capsule6.example.com qdrouterd[31488]: Tue Mar 21 12:40:38 2017 SERVER (info) Operational, 4 Threads Running
Mar 21 18:15:44 capsule6.example.com qdrouterd[31488]: Tue Mar 21 18:15:44 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 18:39:17 capsule6.example.com qdrouterd[31488]: Tue Mar 21 18:39:17 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 18:47:59 capsule6.example.com qdrouterd[31488]: Tue Mar 21 18:47:59 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 19:09:40 capsule6.example.com qdrouterd[31488]: Tue Mar 21 19:09:40 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 19:31:17 capsule6.example.com qdrouterd[31488]: Tue Mar 21 19:31:17 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 19:41:52 capsule6.example.com qdrouterd[31488]: Tue Mar 21 19:41:52 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 19:48:30 capsule6.example.com qdrouterd[31488]: Tue Mar 21 19:48:30 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com
Mar 21 20:46:18 capsule6.example.com qdrouterd[31488]: Tue Mar 21 20:46:18 2017 ROUTER_HELLO (info) HELLO peer expired: satellite.example.com

And on any content host that is connected to 'capsule6' we are seeing the 'qd:no-route-to-dest' error.

Goferd is also causing high iowait on several of our machines.

Comment 6 Pavel Moravec 2017-03-23 09:58:51 UTC
Guys, do you have some sort of reproducer? Without that, it is hard for engineering to fix it..

Have you migrated the Content Host among Capsules, for example? Since when I migrate a Content Host from Sat to Caps (or vice versa), I get the same:

# set caps to FQDN of Satellite/Capsule where you migrate to

rpm -ev $(rpm -qa | grep katello-ca-consumer)

rpm -Uvh http://${caps}.gsslab.brq.redhat.com/pub/katello-ca-consumer-latest.noarch.rpm

subscription-manager register --user admin --password redhat --environment Library --force

service goferd restart

then I get:

Feb 11 19:35:54 pmoravec-rhel72 goferd: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender ccf5f18e-5e67-4ca2-a763-2769c2185ea8 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')

(I planned to investigate this before filing BZ but havent time so far)

Does/Can this reproducer conform with your scenarios?

Comment 7 Pavel Moravec 2017-03-23 10:08:45 UTC
Re-reading the BZ again, I forgot on "qdrouterd on the capsule complains about a broken connection to the Satellite, causing goferd on clients to log qd:no-route-to-dest errors. " :

If qdrouterd on Caps isnt connected to qdrouterd on Sat, then the goferd errors are reasonable. Since qdrouterd is just a _router_ (like IP router), inter-connecting goferd on Content Hosts to qpidd on Satellite. If either router or connection among isnt established, no route to destination (qpidd).

So you should check why the broken connections happen.

1)
"HELLO peer expired: satellite.example.com" - this often happened in 0.4-20 (or older) version(s) of qpid-dispatch, see:

https://bugzilla.redhat.com/show_bug.cgi?id=1398377
https://access.redhat.com/solutions/2778111

2) is the qdrouter trying to reconnect? It should (quite promptly & frequently, I think)

3) how that reconnect attempt succeeds/fails? If it fails, check if e.g. ulimit -n isnt reached by either qdrouterd

4) once qdrouterd connections are established, goferd shall reconnect again (after some timeout between probes) - automatically

Comment 8 Charles Birk 2017-03-23 13:55:46 UTC
(In reply to Pavel Moravec from comment #7)
> Re-reading the BZ again, I forgot on "qdrouterd on the capsule complains
> about a broken connection to the Satellite, causing goferd on clients to log
> qd:no-route-to-dest errors. " :
> 
> If qdrouterd on Caps isnt connected to qdrouterd on Sat, then the goferd
> errors are reasonable. Since qdrouterd is just a _router_ (like IP router),
> inter-connecting goferd on Content Hosts to qpidd on Satellite. If either
> router or connection among isnt established, no route to destination (qpidd).
> 
> So you should check why the broken connections happen.
> 
> 1)
> "HELLO peer expired: satellite.example.com" - this often happened in 0.4-20
> (or older) version(s) of qpid-dispatch, see:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1398377
> https://access.redhat.com/solutions/2778111


1.) On our Satellite servers:

libqpid-dispatch.x86_64      0.4-21.el7sat     @rhel-7-server-satellite-6.2-rpms
qpid-dispatch-router.x86_64  0.4-21.el7sat     @rhel-7-server-satellite-6.2-rpms

On our Capsule servers:

libqpid-dispatch.x86_64              0.4-21.el7sat           @rhel-7-server-satellite-capsule-6.2-rpms
qpid-dispatch-router.x86_64          0.4-21.el7sat           @rhel-7-server-satellite-capsule-6.2-rpms


> 
> 2) is the qdrouter trying to reconnect? It should (quite promptly &
> frequently, I think)

It is, and it's filling /var/log/messages full of this:

goferd: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender b5617a8f-c1db-449d-b9a2-f0055056f23b to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')

> 
> 3) how that reconnect attempt succeeds/fails? If it fails, check if e.g.
> ulimit -n isnt reached by either qdrouterd

As far as I can tell, this limit is not being hit.

> 
> 4) once qdrouterd connections are established, goferd shall reconnect again
> (after some timeout between probes) - automatically

As I previously stated, restarting qdrouterd on the Satellite server and then all 7 capsules seems to _temporarily_ resolve the issue.

Comment 9 Andrew Kofink 2017-03-23 14:20:00 UTC
Thank you Charles for providing this key information. We don't have a reproducer inside engineering at this time, so I really appreciate your participation.

Pavel,

As Charles mentioned, Sat 6.2.8 ships with qpid-dispatch 0.4-21.

About the no-route-to-dest goferd error messages you refer to in comment 6, does restarting qdrouterd on all the capsules fix the communication problems temporarily, or is the route between goferd and the satellite permanently broken?

Comment 10 Pavel Moravec 2017-03-23 17:20:48 UTC
(In reply to Charles Birk from comment #8)
> (In reply to Pavel Moravec from comment #7)
> > Re-reading the BZ again, I forgot on "qdrouterd on the capsule complains
> > about a broken connection to the Satellite, causing goferd on clients to log
> > qd:no-route-to-dest errors. " :
> > 
> > If qdrouterd on Caps isnt connected to qdrouterd on Sat, then the goferd
> > errors are reasonable. Since qdrouterd is just a _router_ (like IP router),
> > inter-connecting goferd on Content Hosts to qpidd on Satellite. If either
> > router or connection among isnt established, no route to destination (qpidd).
> > 
> > So you should check why the broken connections happen.
> > 
> > 1)
> > "HELLO peer expired: satellite.example.com" - this often happened in 0.4-20
> > (or older) version(s) of qpid-dispatch, see:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1398377
> > https://access.redhat.com/solutions/2778111
> 
> 
> 1.) On our Satellite servers:
> 
> libqpid-dispatch.x86_64      0.4-21.el7sat    
> @rhel-7-server-satellite-6.2-rpms
> qpid-dispatch-router.x86_64  0.4-21.el7sat    
> @rhel-7-server-satellite-6.2-rpms
> 
> On our Capsule servers:
> 
> libqpid-dispatch.x86_64              0.4-21.el7sat          
> @rhel-7-server-satellite-capsule-6.2-rpms
> qpid-dispatch-router.x86_64          0.4-21.el7sat          
> @rhel-7-server-satellite-capsule-6.2-rpms
> 
> 
> > 
> > 2) is the qdrouter trying to reconnect? It should (quite promptly &
> > frequently, I think)
> 
> It is, and it's filling /var/log/messages full of this:
> 
> goferd: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 -
> sender b5617a8f-c1db-449d-b9a2-f0055056f23b to qmf.default.direct closed due
> to: Condition('qd:no-route-to-dest', 'No route to the destination node')
> 
> > 
> > 3) how that reconnect attempt succeeds/fails? If it fails, check if e.g.
> > ulimit -n isnt reached by either qdrouterd
> 
> As far as I can tell, this limit is not being hit.
> 
> > 
> > 4) once qdrouterd connections are established, goferd shall reconnect again
> > (after some timeout between probes) - automatically
> 
> As I previously stated, restarting qdrouterd on the Satellite server and
> then all 7 capsules seems to _temporarily_ resolve the issue.

Here the real cause is bz1367735 causing qpidd consuming much memory causing OOM killer killed qpidd causing the 'qd:no-route-to-dest' is expected consequence.

(now I will check the other 2 cases linked to this BZ)

Comment 12 Pavel Moravec 2017-03-23 19:53:42 UTC
Data in case 01775422 are insufficient, asking for more there.

Comment 15 Andrew Kofink 2017-03-24 16:12:29 UTC
Pavel,

I set up a 6.2.8 environment following the steps in comment 6, but I'm not able to reproduce the repeated no-route-to-dest errors in goferd.

Here's what I did:

1. Install fresh 6.2.8 satellite and capsule install with RHEL 7 repos and Sat 6.2 RHEL 7 Tools repo
2. Register a RHEL7 content host to the satellite, enable it both repos, install katello-agent.
3. Ensure goferd logs on client and qdrouterd logs on satellite/capsule look good
4. On the client

rpm -ev $(rpm -qa | grep katello-ca-consumer)
rpm -Uvh http://capsule.example.com/pub/katello-ca-consumer-latest.noarch.rpm
subscription-manager register --user admin --password changeme --environment Library --force --auto-attach
systemctl restart goferd

I'm seeing none of the no-route-to-dest errors in goferd, just a successful connection:

Mar 24 15:28:19 sat62-host goferd: [INFO][MainThread] gofer.agent.main:87 - agent started.
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://capsule.example.com:5647
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://capsule.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] root:490 - connecting to capsule.example.com:5647...
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://capsule.example.com:5647
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.messaging.adapter.connect:30 - connected: proton+amqps://capsule.example.com:5647
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] root:512 - connected to capsule.example.com:5647
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://capsule.example.com:5647
Mar 24 15:28:19 sat62-host goferd: [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://capsule.example.com:5647
Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.agent.plugin:368 - plugin:katelloplugin, attached => pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1
Mar 24 15:28:19 sat62-host goferd: [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://capsule.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 24 15:28:19 sat62-host goferd: [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] root:490 - connecting to capsule.example.com:5647...
Mar 24 15:28:20 sat62-host goferd: [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://capsule.example.com:5647
Mar 24 15:28:20 sat62-host goferd: [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] gofer.messaging.adapter.connect:30 - connected: proton+amqps://capsule.example.com:5647
Mar 24 15:28:20 sat62-host goferd: [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] root:512 - connected to capsule.example.com:5647

Did I miss a step? I still have the environment set up. I'd be really interested to see how your reproducer differs.

Comment 16 Pavel Moravec 2017-03-24 16:18:05 UTC
(In reply to Andrew Kofink from comment #15)
> Pavel,
> 
> I set up a 6.2.8 environment following the steps in comment 6, but I'm not
> able to reproduce the repeated no-route-to-dest errors in goferd.
> 
> Here's what I did:
> 
> 1. Install fresh 6.2.8 satellite and capsule install with RHEL 7 repos and
> Sat 6.2 RHEL 7 Tools repo
> 2. Register a RHEL7 content host to the satellite, enable it both repos,
> install katello-agent.
> 3. Ensure goferd logs on client and qdrouterd logs on satellite/capsule look
> good
> 4. On the client
> 
> rpm -ev $(rpm -qa | grep katello-ca-consumer)
> rpm -Uvh http://capsule.example.com/pub/katello-ca-consumer-latest.noarch.rpm
> subscription-manager register --user admin --password changeme --environment
> Library --force --auto-attach
> systemctl restart goferd
> 
> I'm seeing none of the no-route-to-dest errors in goferd, just a successful
> connection:
> 
> Mar 24 15:28:19 sat62-host goferd: [INFO][MainThread] gofer.agent.main:87 -
> agent started.
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0]
> gofer.messaging.adapter.connect:28 - connecting:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0]
> gofer.messaging.adapter.proton.connection:87 - open: URL:
> amqps://capsule.example.com:5647|SSL: ca:
> /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate:
> /etc/pki/consumer/bundle.pem|host-validation: None
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] root:490 - connecting to
> capsule.example.com:5647...
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0]
> gofer.messaging.adapter.proton.connection:92 - opened:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0]
> gofer.messaging.adapter.connect:30 - connected:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] root:512 - connected to
> capsule.example.com:5647
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0]
> gofer.messaging.adapter.proton.connection:131 - closed:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:19 sat62-host goferd:
> [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1]
> gofer.messaging.adapter.connect:28 - connecting:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:19 sat62-host goferd: [INFO][worker-0] gofer.agent.plugin:368 -
> plugin:katelloplugin, attached =>
> pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1
> Mar 24 15:28:19 sat62-host goferd:
> [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1]
> gofer.messaging.adapter.proton.connection:87 - open: URL:
> amqps://capsule.example.com:5647|SSL: ca:
> /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate:
> /etc/pki/consumer/bundle.pem|host-validation: None
> Mar 24 15:28:19 sat62-host goferd:
> [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] root:490 -
> connecting to capsule.example.com:5647...
> Mar 24 15:28:20 sat62-host goferd:
> [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1]
> gofer.messaging.adapter.proton.connection:92 - opened:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:20 sat62-host goferd:
> [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1]
> gofer.messaging.adapter.connect:30 - connected:
> proton+amqps://capsule.example.com:5647
> Mar 24 15:28:20 sat62-host goferd:
> [INFO][pulp.agent.9aeb80c7-75c0-44b4-bfdd-5da6065b29b1] root:512 - connected
> to capsule.example.com:5647
> 
> Did I miss a step? I still have the environment set up. I'd be really
> interested to see how your reproducer differs.

My reproducer is very unreliable. I succeeded with this reproducer just once in past and once today. Today with above logs & coredump.

Today I know I freezed qdrouterd on Satellite for a while when I was checking the process via gdb - that triggered hello peer expired and link lost (and this is trivially reproducible, i.e. via:

pid=$(pgrep qdrouterd)
echo "$(date): freezing"
kill -s SIGSTOP $pid
sleep $((RANDOM%10+10))
echo "$(date): unfreezing"
kill -s SIGCONT $pid

Now I let running the above in a loop together with "migrating" the goferd to connect to Satellite / Capsule, but without a success.

Simply we dont have a reliable reproducer :(

Comment 17 Pavel Moravec 2017-03-27 20:55:15 UTC
I think this sequence of qpidd(S) and qdrouterd(S) restarts led to the qdrouterd(C) misbehaviour:

[root@pmoravec-sat62-rhel7 ~]# ./random_restart_qpidd_qdrouterd.sh 
Redirecting to /bin/systemctl start  qpidd.service
Redirecting to /bin/systemctl start  qdrouterd.service
Mon Mar 27 22:43:38 CEST 2017: everyting started
Mon Mar 27 22:43:38 CEST 2017: applying qpidd stop
Redirecting to /bin/systemctl stop  qpidd.service
Mon Mar 27 22:43:38 CEST 2017: status is: qpidd=0, qdrouterd=1
Mon Mar 27 22:43:44 CEST 2017: applying qdrouterd stop
Redirecting to /bin/systemctl stop  qdrouterd.service
Mon Mar 27 22:43:44 CEST 2017: status is: qpidd=0, qdrouterd=0
Mon Mar 27 22:43:49 CEST 2017: applying qdrouterd start
Redirecting to /bin/systemctl start  qdrouterd.service
Mon Mar 27 22:43:49 CEST 2017: status is: qpidd=0, qdrouterd=1
Mon Mar 27 22:43:58 CEST 2017: applying qpidd start
Redirecting to /bin/systemctl start  qpidd.service
Mon Mar 27 22:43:58 CEST 2017: status is: qpidd=1, qdrouterd=1
Mon Mar 27 22:44:06 CEST 2017: applying qpidd stop
Redirecting to /bin/systemctl stop  qpidd.service
Mon Mar 27 22:44:07 CEST 2017: status is: qpidd=0, qdrouterd=1
Mon Mar 27 22:44:12 CEST 2017: applying qdrouterd stop
Redirecting to /bin/systemctl stop  qdrouterd.service
Mon Mar 27 22:44:12 CEST 2017: status is: qpidd=0, qdrouterd=0
Mon Mar 27 22:44:17 CEST 2017: applying qdrouterd start
Redirecting to /bin/systemctl start  qdrouterd.service
Mon Mar 27 22:44:17 CEST 2017: status is: qpidd=0, qdrouterd=1
Mon Mar 27 22:44:23 CEST 2017: applying qpidd start
Redirecting to /bin/systemctl start  qpidd.service
Mon Mar 27 22:44:23 CEST 2017: status is: qpidd=1, qdrouterd=1
^C[root@pmoravec-sat62-rhel7 ~]#

Reproducer script:

service qpidd start
service qdrouterd start
qpidd=1
qdrouterd=1

echo "$(date): everyting started"
while true; do
	if [ $((RANDOM%2)) -eq 0 ]; then
		serv="qpidd"
		if [ $qpidd -eq 1 ]; then
			qpidd=0
			act="stop"
		else
			qpidd=1
			act="start"
		fi
	else
                serv="qdrouterd"
                if [ $qdrouterd -eq 1 ]; then
                        qdrouterd=0
                        act="stop"
                else
                        qdrouterd=1
                        act="start"
                fi
	fi
	echo "$(date): applying $serv $act"
	service $serv $act
	echo "$(date): status is: qpidd=$qpidd, qdrouterd=$qdrouterd"
	sleep 5
	sleep $((RANDOM%5))
done


Anyway, repeating the same steps did _not_ trigger the same bug - sounds some race condition affects it..

Comment 18 Pavel Moravec 2017-03-28 09:51:36 UTC
Is there a way to dump "routing table" from either qdrouterd (I think it isnt due to the management data deliberate drop recently)? Or how to get it from gdb / coredump?

Or what data are worth to collect that could help you? (enable proton traces? some debugging?)

Just a thought where the problem might also be: theoretically, qdrouterd(S) could also be the guilty one, if it does not update (due to some error) the qdrouterd(C) that link routing to qpidd is available (again / for the first time).

Comment 19 Pavel Moravec 2017-03-28 13:23:08 UTC
(In reply to Pavel Moravec from comment #18)
> Just a thought where the problem might also be: theoretically, qdrouterd(S)
> could also be the guilty one, if it does not update (due to some error) the
> qdrouterd(C) that link routing to qpidd is available (again / for the first
> time).

One reproducer for this exists - NOT seen at a customer.

If I artificially restrict ulimit -n for qdrouterd such that it cant open TCP connections _or_ some files it dynamically loads _after_ the TCP and AMQP connection is opened, I can reproduce situation where:

- qdrouterd(S) complains about Too many open files (expected, this does not occur at customers)
- TCP connection between the routers is established (hence user _thinks_ everything is inter-connected)
- but the AMQP link isnt established, thus qd:no-route-to-dest is reasonable


So could either of the routers establish TCP connection but not open the AMQP link?

Comment 20 Ganesh Murthy 2017-03-28 14:50:49 UTC
(In reply to Pavel Moravec from comment #19)
> (In reply to Pavel Moravec from comment #18)
> > Just a thought where the problem might also be: theoretically, qdrouterd(S)
> > could also be the guilty one, if it does not update (due to some error) the
> > qdrouterd(C) that link routing to qpidd is available (again / for the first
> > time).
> 
> One reproducer for this exists - NOT seen at a customer.
> 
> If I artificially restrict ulimit -n for qdrouterd such that it cant open
> TCP connections _or_ some files it dynamically loads _after_ the TCP and
> AMQP connection is opened, I can reproduce situation where:
> 
> - qdrouterd(S) complains about Too many open files (expected, this does not
> occur at customers)
> - TCP connection between the routers is established (hence user _thinks_
> everything is inter-connected)
> - but the AMQP link isnt established, thus qd:no-route-to-dest is reasonable
> 
> 
> So could either of the routers establish TCP connection but not open the
> AMQP link?

Can we try doing the following on both routers 
1. turning on PN_TRACE_FRM=1 
2. turn on trace logging 

When the connection between the routers drops, you should see a message like this on both router logs indicating loss of connection - 

Closed 127.0.0.1:39328
Tue Mar 28 10:34:13 2017 ROUTER_LS (info) Router Link Lost - link_id=0

When the connection is re-established between the routers, you should see the SENT and RCVD HELLO messages going between the routers after an AMQP connection is established on top of the TCP connection like this - 

Tue Mar 28 10:34:12 2017 ROUTER_MA (trace) RCVD: MAR(id=Router.B area=0 have_seq=0)
Tue Mar 28 10:34:12 2017 ROUTER_MA (trace) SENT: MAU(id=Router.A area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.'] del=[])

When a router receives an attach with a target address, it looks in its routing table. If the routing table *does* have this target address, it tries to find a connection on which this attach can be forwarded. If it cannot find any relevant connection, it throws the qd:no-route-to-dest error.

Note here that the fact that you are seeing the qd:no-route-to-dest implicitly means that the router was successfully able to find the target address in its routing table but does not know where to forward the attach because the connection that the attach was supposed to be forwarded is not available or established. So, you can be sure that the router's routing table already has target address.

Comment 25 Sachin Ghai 2017-04-03 11:17:18 UTC
I found similar issue while performing n-1 capsule testing w/ satellite 6.2.9. I upgraded my satellite from 6.2.8 -> 6.2.9 while keeping the capsule at 6.2.8.

Satellite was upgraded successfully, Later, I created a host through capsule and I was not able to install package on that host through satellite server.

goferd was running. However, I was getting following error on checking goferd status:

Apr 03 09:40:37 capn1rhel7test.capqe.lab.eng.rdu2.redhat.com goferd[10812]: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender 5093ca70-6bdf-422a-99da-a9e32245...on node')


and following error found in /var/log/messages:


Apr  3 09:38:03 capn1rhel7test goferd: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender 825ac9f4-9701-432e-b97c-c882eb8c41f4 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')


Please note that this issue appears only for those hosts which were installed through capsule. And as soon as I restarted "qdrouterd" service on capsule, issue was gone.

I'll try to reproduce this issue.

Comment 27 Pavel Moravec 2017-04-04 13:09:03 UTC
Created attachment 1268648 [details]
tcpdump of inter-qdrouterd traffic

When qdrouterd(C) was already returning the error, qdrouterd(S) restarted with above tcpdump running (just decode 5646 as AMQP, no SSL).

See frame 101 and 5th message there - qdrouterd(S) tells qdrouterd(C) to add "pulp" and "qmf" prefixes for link routing.

That suggests me qdrouterd(C) was well informed about the connectivity status, but it hasnt updated its routing table/rules accordingly, since then - and has been still replying the "no link" error to clients.

Questions:
- when qdrouterd(C) can ignore such "add prefix" request?
- can it ignore when it thinks there is _another_ link that can be used (but the link is orphaned record, such that when a client request comes in, this link record is skipped)?

Comment 31 Andrew Kofink 2017-04-19 18:32:09 UTC
From all the cases of this bug I can find, it seems the capsule version is one z stream behind the satellite server version (i.e. satellite 6.2.8, capsule 6.2.7) . It may be helpful to upgrade the capsule version to match the satellite.

Comment 32 Charles Birk 2017-04-19 18:42:02 UTC
I can add that our Satellite server is at 6.2.7 and our capsules are as well.

Comment 33 Pavel Moravec 2017-04-19 20:03:27 UTC
(In reply to Charles Birk from comment #32)
> I can add that our Satellite server is at 6.2.7 and our capsules are as well.

Didnt the error occur roughly during upgrade from 6.2.6 (or older) to 6.2.7? Since any upgrade starts by upgrading Satellite.

I also reproduced it when upgrading from 6.2.8 to 6.2.9 QA snapshot - will debug more and restrict the reproducer just to standalone qdrouterd.

Comment 34 Pavel Moravec 2017-04-20 07:23:21 UTC
Created attachment 1272873 [details]
reproduced qdrouterd traces with updated mobile.py

The attached logs are from qdrouterd(C) - the faulty one - that curiously survived Satellite's upgrade, but didnt survive its reboot. The goferd client was mimicked by script:


#!/usr/bin/python
from time import sleep
from uuid import uuid4
import datetime

from proton import ConnectionException, Message, Timeout
from proton import SSLDomain, SSLException
from proton.utils import BlockingConnection
from proton.reactor import DynamicNodeProperties

ca_certificate='/etc/rhsm/ca/katello-server-ca.pem'
client_certificate='/etc/pki/consumer/bundle.pem'
client_key=None

domain = SSLDomain(SSLDomain.MODE_CLIENT)
domain.set_trusted_ca_db(ca_certificate)
domain.set_credentials(client_certificate, client_key or client_certificate, None)
domain.set_peer_authentication(SSLDomain.VERIFY_PEER)

qmf_snd = None
qmf_rec = None
qmf_conn = None

bounce = True

while True:
    try:
	exception = False
	if qmf_conn is None:
  	    qmf_conn = BlockingConnection("amqps://pmoravec-caps62-bz1426242.gsslab.brq.redhat.com:5647", ssl_domain=domain, heartbeat=5)
            qmf_snd = qmf_conn.create_sender("qmf.default.direct", name=str(uuid4()))
#	    qmf_rec = qmf_conn.create_receiver(None, name=str(uuid4()), dynamic=True, options=DynamicNodeProperties({'x-opt-qd.address': unicode("qmf.default.direct")}))
    except Exception, e:
        print "%s: %s" % (str(datetime.datetime.now()), e)
	exception = True
        pass
    finally:
	sleep(5)
	if not exception:
	    print "%s: OK" % str(datetime.datetime.now())
        if bounce or exception:
   	    try:
                if qmf_snd:
                    qmf_snd.close()
                if qmf_rec:
    	            qmf_rec.close()
                if qmf_conn:
    	            qmf_conn.close()
            except Exception:
                pass
            qmf_conn = None


Script output (relevant part):
..
2017-04-20 09:07:24.867724: OK
2017-04-20 09:07:25.911981: sender e4524b8c-4b7d-40ed-a5f0-dd60a1d532e4 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')
..
(then only no-route-to-dest errors)


Curiously, when I modified the script by commenting out qmf_snd = .. and uncommenting qmf_rec = .. (just to try using _consumer_ and not _producer_ to the same address), the script succeeded:

2017-04-20 09:14:57.519501: OK


So it seems just _producer_ link routing is affected (??? how is that even possible?).

Comment 35 Pavel Moravec 2017-04-20 07:33:02 UTC
A small correction to the timestamps of the client:

..
2017-04-20 09:07:12.863213: OK
2017-04-20 09:07:18.910720: OK
2017-04-20 09:07:24.867724: OK
2017-04-20 09:07:25.911981: sender e4524b8c-4b7d-40ed-a5f0-dd60a1d532e4 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')
2017-04-20 09:07:31.412827: sender 2c44770d-f2ad-4f37-94c6-9c6750f361d7 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')
2017-04-20 09:07:36.918367: sender 8df3689c-e9fb-4f1f-819f-d90c5299a036 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')
..

The timestamps are printed _after_ the 5s sleep or _immediatelly_ when an exception occurs.

That means, at 09:07:19, client succeeded, waited 5s to 09:07:24. A second later another attempt failed.

Relevant trace logs from the router are:


Thu Apr 20 09:07:17 2017 MESSAGE (trace) Received Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\94\00\00\00\08\a1\04seen\d0'} on link qd.internode.1
Thu Apr 20 09:07:17 2017 ROUTER_HELLO (trace) RCVD: HELLO(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492671393 seen=['pmoravec-caps62-bz1426242.gsslab.brq.redhat.com'])
Thu Apr 20 09:07:18 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=['pmoravec-sat62-bz1426242.gsslab.brq.redhat.com'])
Thu Apr 20 09:07:18 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\94\00\00\00\08\a1\04seen\d0'} on link qd.internode.2
Thu Apr 20 09:07:18 2017 MESSAGE (trace) Received Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\94\00\00\00\08\a1\04seen\d0'} on link qd.internode.1
Thu Apr 20 09:07:18 2017 ROUTER_HELLO (trace) RCVD: HELLO(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492671393 seen=['pmoravec-caps62-bz1426242.gsslab.brq.redhat.com'])
Thu Apr 20 09:07:18 2017 SERVER (debug) Accepting incoming connection from 10.34.85.74:38078 to 0.0.0.0:5647
Thu Apr 20 09:07:18 2017 SERVER (trace) Configuring SSL on incoming connection from 10.34.85.74:38078 to 0.0.0.0:5647
Thu Apr 20 09:07:19 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=['pmoravec-sat62-bz1426242.gsslab.brq.redhat.com'])
Thu Apr 20 09:07:19 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\94\00\00\00\08\a1\04seen\d0'} on link qd.internode.2
Thu Apr 20 09:07:20 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=['pmoravec-sat62-bz1426242.gsslab.brq.redhat.com'])
Thu Apr 20 09:07:20 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\94\00\00\00\08\a1\04seen\d0'} on link qd.internode.2
Thu Apr 20 09:07:21 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=['pmoravec-sat62-bz1426242.gsslab.brq.redhat.com'])
Thu Apr 20 09:07:21 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\94\00\00\00\08\a1\04seen\d0'} on link qd.internode.2
Thu Apr 20 09:07:22 2017 ROUTER_HELLO (info) HELLO peer expired: pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Thu Apr 20 09:07:22 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:22 2017 ROUTER (trace) Node pmoravec-sat62-bz1426242.gsslab.brq.redhat.com link removed
Thu Apr 20 09:07:22 2017 ROUTER (trace) Entered Router Flux Mode
Thu Apr 20 09:07:22 2017 ROUTER_LS (trace) Local Link State: LS(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 ls_seq=6 peers=[])
Thu Apr 20 09:07:22 2017 ROUTER_LS (trace) Computed next hops: {}
Thu Apr 20 09:07:22 2017 ROUTER_LS (trace) Computed valid origins: {}
Thu Apr 20 09:07:22 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 ls_seq=6 mobile_seq=0)
Thu Apr 20 09:07:22 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:23 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:23 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:24 2017 SERVER (debug) Accepting incoming connection from 10.34.85.74:38080 to 0.0.0.0:5647
Thu Apr 20 09:07:24 2017 SERVER (trace) Configuring SSL on incoming connection from 10.34.85.74:38080 to 0.0.0.0:5647
Thu Apr 20 09:07:24 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:24 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:25 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:25 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:26 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:26 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 ls_seq=6 mobile_seq=0)
Thu Apr 20 09:07:26 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:27 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:27 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:29 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:29 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:30 2017 ROUTER_HELLO (trace) SENT: HELLO(id=pmoravec-caps62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492670292 seen=[])
Thu Apr 20 09:07:30 2017 ROUTER (trace) Exited Router Flux Mode
Thu Apr 20 09:07:30 2017 MESSAGE (trace) Sending Message{to='amqp:/_local/qdhello' body='\d1\00\00\00\\00\00\00\08\a1\04seenE'} on link qd.internode.2
Thu Apr 20 09:07:30 2017 SERVER (debug) Accepting incoming connection from 10.34.85.74:38082 to 0.0.0.0:5647
Thu Apr 20 09:07:30 2017 SERVER (trace) Configuring SSL on incoming connection from 10.34.85.74:38082 to 0.0.0.0:5647



(I *think* the "ROUTER_LS (trace) SENT: RA(id=.." is important here - noticed it in one another case as well)

Comment 36 Pavel Moravec 2017-04-20 11:12:26 UTC
Simpler reproducer (mimicking Satellite upgrade by pausing the VM):

- run qdrouterd(S) on a VM
- restart qdrouterd
- immediately to that, pause the VM
- wait a minute or two
- resume the VM
- optionally, restart qdrouterd(S) (to ensure this router isnt in a broken shape)
- wait whatever long, but qdrouterd(C) will bounce any _sender_ client with qd:no-route-to-dest error
  - while the router has link to the peer router established, destinations mapped etc.

Curiously, check if traces on the qdrouterd(C) have:

Thu Apr 20 13:05:32 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492686127 ls_seq=1 mobile_seq=1)
Thu Apr 20 13:05:33 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492686127 ls_seq=3 mobile_seq=1)


but the ls_seq=2 is missing.

(while traces on the qdrouterd(S) have also the missing SENT ls_seq=2 one)

Comment 37 Pavel Moravec 2017-04-20 20:38:49 UTC
Created attachment 1273067 [details]
trace logs with verbose mobile.py and tcpdumps

qdrouterd-S.log = Satellite's router
qdrouterd-C.log = Capsule's router
amqp-caps.cap = tcpdump from Capsule (port 5646 = AMQP)
amqp-sat.cap = tcpdump from Satellite (port 5646 = AMQP)


VM on Satellite was frozen from approx. 20:25:10 - 20:27:51 but the Satellite corrected its clock even at 20:28:58 (I deduce).

qdrouterd(S) thought it sent ls_seq=3:

Thu Apr 20 20:25:16 2017 ROUTER_LS (trace) SENT: LSU(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 ls=LS(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 ls_seq=2 peers=['pmoravec-caps62-bz1426242.gsslab.brq.redhat.com']))
Thu Apr 20 20:25:20 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:25:50 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:26:20 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:28:58 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:28:58 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=3 mobile_seq=1)
Thu Apr 20 20:28:59 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:28:59 2017 ROUTER_LS (trace) SENT: LSU(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 ls=LS(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 ls_seq=4 peers=['pmoravec-caps62-bz1426242.gsslab.brq.redhat.com']))
Thu Apr 20 20:29:03 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:29:33 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:30:04 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:30:34 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:31:04 2017 ROUTER_LS (trace) SENT: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)


but qdrouterd(C) didnt get it:
Thu Apr 20 20:27:58 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:28:28 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:28:57 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:28:58 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=2 mobile_seq=1)
Thu Apr 20 20:28:59 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:28:59 2017 ROUTER_LS (trace) RCVD: LSU(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 ls=LS(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 ls_seq=4 peers=['pmoravec-caps62-bz1426242.gsslab.brq.redhat.com']))
Thu Apr 20 20:29:03 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:29:33 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:30:03 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:30:34 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)
Thu Apr 20 20:31:04 2017 ROUTER_LS (trace) RCVD: RA(id=pmoravec-sat62-bz1426242.gsslab.brq.redhat.com area=0 inst=1492712695 ls_seq=4 mobile_seq=1)

and neither tcpdump contains that AMQP frame (search for "frame contains 6c:73:5f:73:65:71:55:03"). I guess qdrouterd(S) was so confused due to timer adjusting/reacting to OS time change.

Not sure if this is the culprit (i.e. if ls_seq must be incremented by one otherwise it is ignored?), but this is the best weird thing I found in the traces.

Comment 39 Pavel Moravec 2017-04-22 18:19:34 UTC
under some circumstances, the router does not remove some old router reference from addr->rnodes, when getting new MAU(?). See e.g.:

# grep mapped /var/log/qdrouterd.log | grep pulp
Fri Apr 21 08:49:01 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:49:01 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:50:34 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:52:03 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:52:03 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:03:02 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:04:10 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:04:10 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:09:46 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:10:26 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:10:26 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:11:49 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:13:07 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:13:07 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:00:23 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:00:26 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:00:26 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:04:33 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:11:20 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:11:20 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:19:26 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:23:18 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:23:18 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:25:52 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:25:52 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:25:52 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:27:12 2017 ROUTER (debug) Remote destination pulp.;class=D unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:27:12 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:34:04 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:34:04 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:16:10 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:16:12 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:16:12 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:32:36 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:32:36 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:32:36 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:55 2017 ROUTER (debug) Remote destination pulp.;class=D unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:55 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:57 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:57 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
#

split for C/D:

# grep mapped /var/log/qdrouterd.log | grep "pulp.;class=C"
Fri Apr 21 08:49:01 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:50:34 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:52:03 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:03:02 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:04:10 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:09:46 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:10:26 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:11:49 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:13:07 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:00:23 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:00:26 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:04:33 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:11:20 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:19:26 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:23:18 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:25:52 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:25:52 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:27:12 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:34:04 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:16:10 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:16:12 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:32:36 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:32:36 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:55 2017 ROUTER (debug) Remote destination pulp.;class=C unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:57 2017 ROUTER (debug) Remote destination pulp.;class=C mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
#

# grep mapped /var/log/qdrouterd.log | grep "pulp.;class=D"
Fri Apr 21 08:49:01 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 08:52:03 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:04:10 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:10:26 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Fri Apr 21 09:13:07 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:00:26 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:11:20 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:23:18 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:25:52 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:27:12 2017 ROUTER (debug) Remote destination pulp.;class=D unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 12:34:04 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:16:12 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:32:36 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:55 2017 ROUTER (debug) Remote destination pulp.;class=D unmapped from router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
Sat Apr 22 13:33:57 2017 ROUTER (debug) Remote destination pulp.;class=D mapped to router pmoravec-sat62-bz1426242.gsslab.brq.redhat.com
#

Why every "class=C mapped" is followed (some time later on) by "class=C unmapped", but the same isnt true for class=D .

The reason is described in https://issues.apache.org/jira/browse/DISPATCH-749 and explained by fact that 'C' addresses are mapped on odd positions while 'D' on even, and the current "removal" deletes just odd positions.

Applying the fix from the https://github.com/apache/qpid-dispatch/pull/160, reproducer does not trigger the infinite no-route-to-host further more \o/ .

Comment 40 Pavel Moravec 2017-04-26 07:03:15 UTC
Successfully pre-verified this on 0.4-22 build.

Reproducer steps:

0) setup:
- Satellite on a VM you can control
- external Capsule
- client registered to the Capsule with goferd running

1) on Satellite:

service qdrouterd restart

2) Immediately(*) after that, suspend the VM with Satellite. "Immediately" means humen-acceptably delay, like execute the restart and click to Suspend after 10 seconds.

3) wait at least 2 minutes

4) Resume the VM

5) on the client system, restart goferd (just to ensure it will re-try to connect)


Current behaviour:
below error appears forever (with some delays between goferd connection attempts)
Apr 26 08:44:16 pmoravec-rhel72 goferd: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender 13e1494a-b3ff-4cde-a15f-47dd22b16c22 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')

(optionally, "receiver" can be affected as well, but in 99% it is the sender)

Expected behaviour:
no such error (or say as some very temporary, in case you restart goferd "too soon", before qdrouterd on Caps re-connects to the Sat and recalculates routing rules)

Comment 41 Daniele Palumbo 2017-04-26 20:50:38 UTC
Hi Pavel,

"suspend" would means also take a vmware snapshot that last for some time?

Comment 42 Pavel Moravec 2017-04-26 21:00:45 UTC
(In reply to Daniele Palumbo from comment #41)
> Hi Pavel,
> 
> "suspend" would means also take a vmware snapshot that last for some time?

I guess so - something that makes the Satellite not responsive both in qdrouterd process and on kernel level, such that the inter-qdrouterd connection has absolutely silent Satellite's end (but when I tried to mimic this via iptables, no luck with repro - there must be something more..)

Anyway the above reproducer is one that trigger the relevant code, there can be more ways how to trigger the bug, I guess.

Comment 44 Mike McCune 2017-04-28 15:49:31 UTC
*** Hotfix instructions for 6.2.9 and prior versions of 6.2 ***


1) download libqpid-dispatch-0.4-22.el7sat.x86_64.rpm and qpid-dispatch-router-0.4-22.el7sat.x86_64.rpm from this bug

2) copy them to your Satellite and Capsules 

3) upgrade the 2 packages on both the Satellite and Capsules: 

# yum update ./libqpid-dispatch-0.4-22.el7sat.x86_64.rpm ./qpid-dispatch-router-0.4-22.el7sat.x86_64.rpm 

4) restart services 

# katello-service restart 

5) Proceed with operations

Comment 45 Mike McCune 2017-04-28 15:50:15 UTC
Created attachment 1274970 [details]
hotfix version

Comment 46 Mike McCune 2017-04-28 15:50:49 UTC
Created attachment 1274971 [details]
hotfix version

Comment 47 Natale Vinto 2017-05-04 14:51:38 UTC
Hotfix worked for me with Satellite 6.2.9 with this issue on some Hosts, applied on Satellite and Capsules, don't forget to:

# systemctl daemon-reload 

due change in qrouterd.service before restarting katello-service, and restart goferd on Hosts

Thanks

Comment 49 Pavel Moravec 2017-05-22 06:27:37 UTC
Created attachment 1280909 [details]
[hotfix] qpid-dispatch-tools package

(In reply to Mike McCune from comment #44)
> *** Hotfix instructions for 6.2.9 and prior versions of 6.2 ***
> 
> 
> 1) download libqpid-dispatch-0.4-22.el7sat.x86_64.rpm and
> qpid-dispatch-router-0.4-22.el7sat.x86_64.rpm from this bug
> 
> 2) copy them to your Satellite and Capsules 
> 
> 3) upgrade the 2 packages on both the Satellite and Capsules: 
> 
> # yum update ./libqpid-dispatch-0.4-22.el7sat.x86_64.rpm
> ./qpid-dispatch-router-0.4-22.el7sat.x86_64.rpm 
> 
> 4) restart services 
> 
> # katello-service restart 
> 
> 5) Proceed with operations

There is a dependency to qpid-dispatch-tools package that needs to be updated too, if present. Attaching it here.

Comment 50 Koen.Diels 2017-05-30 07:51:23 UTC
+1 also affected on 6.2.9 with capsule server.

Comment 52 Lukas Pramuk 2017-06-08 16:13:26 UTC
VERIFIED.

@satellite-capsule-6.2.10-3.0.el7sat.noarch
libqpid-dispatch-0.4-22.el7sat.x86_64
qpid-dispatch-router-0.4-22.el7sat.x86_64

using manual reproducer described in comment#40 :

5) service goferd restart

REPRO:

Jun 08 15:33:17 vm1.example.com systemd[1]: Starting Gofer Agent...
Jun 08 15:33:17 vm1.example.com goferd[3072]: [INFO][Thread-1] gofer.rmi.store:114 - Using: /var/lib/gofer/messaging/pending/demo
Jun 08 15:33:17 vm1.example.com goferd[3072]: [WARNING][MainThread] gofer.agent.plugin:639 - plugin:demo, DISABLED
Jun 08 15:33:17 vm1.example.com goferd[3072]: [INFO][Thread-2] gofer.rmi.store:114 - Using: /var/lib/gofer/messaging/pending/katelloplugin
Jun 08 15:33:17 vm1.example.com goferd[3072]: [INFO][Thread-3] gofer.rmi.store:114 - Using: /var/lib/gofer/messaging/pending/katelloplugin
Jun 08 15:33:17 vm1.example.com goferd[3072]: [INFO][MainThread] gofer.agent.plugin:682 - plugin:katelloplugin loaded using: /usr/lib/gofer/plugins/katelloplugin.py
Jun 08 15:33:17 vm1.example.com goferd[3072]: [INFO][MainThread] rhsm.connection:830 - Connection built: host=cap.example.com port=8443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ verify=False
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][MainThread] katelloplugin:178 - Using /etc/rhsm/ca/katello-default-ca.pem as the ca cert for qpid connection
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][MainThread] rhsm.connection:830 - Connection built: host=cap.example.com port=8443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ verify=False
Jun 08 15:33:18 vm1.example.com goferd[3072]: Loaded plugins: product-id
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][MainThread] katelloplugin:409 - reporting: {'enabled_repos': {'repos': [{'baseurl': ['https://cap.example.com/pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os'], 'repositoryid': 'rhel-7-server-rpms'}, {'baseurl': ['https://cap.example.com/pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/sat-tools/6.2/os'], 'repositoryid': 'rhel-7-server-satellite-tools-6.2-rpms'}]}}
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][worker-0] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cap.example.com:5647
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][worker-0] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://cap.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][MainThread] gofer.agent.main:87 - agent started.
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][worker-0] root:490 - connecting to cap.example.com:5647...
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][worker-0] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://cap.example.com:5647
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][worker-0] gofer.messaging.adapter.connect:30 - connected: proton+amqps://cap.example.com:5647
Jun 08 15:33:18 vm1.example.com goferd[3072]: [INFO][worker-0] root:512 - connected to cap.example.com:5647
Jun 08 15:33:18 vm1.example.com goferd[3072]: [ERROR][worker-0] gofer.messaging.adapter.proton.reliability:47 - sender b7d01549-a41d-4d61-a774-099b6f325a33 to qmf.default.direct closed due to: Condition('qd:no-route-to-dest', 'No route to the destination node')


vs.


FIX:

Jun 08 15:55:47 vm1.example.com systemd[1]: Starting Gofer Agent...
Jun 08 15:55:47 vm1.example.com goferd[3116]: [INFO][Thread-1] gofer.rmi.store:114 - Using: /var/lib/gofer/messaging/pending/demo
Jun 08 15:55:47 vm1.example.com goferd[3116]: [WARNING][MainThread] gofer.agent.plugin:639 - plugin:demo, DISABLED
Jun 08 15:55:47 vm1.example.com goferd[3116]: [INFO][Thread-2] gofer.rmi.store:114 - Using: /var/lib/gofer/messaging/pending/katelloplugin
Jun 08 15:55:47 vm1.example.com goferd[3116]: [INFO][Thread-3] gofer.rmi.store:114 - Using: /var/lib/gofer/messaging/pending/katelloplugin
Jun 08 15:55:47 vm1.example.com goferd[3116]: [INFO][MainThread] gofer.agent.plugin:682 - plugin:katelloplugin loaded using: /usr/lib/gofer/plugins/katelloplugin.py
Jun 08 15:55:48 vm1.example.com goferd[3116]: [INFO][MainThread] rhsm.connection:830 - Connection built: host=cap.example.com port=8443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ verify=False
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][MainThread] katelloplugin:178 - Using /etc/rhsm/ca/katello-default-ca.pem as the ca cert for qpid connection
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][MainThread] rhsm.connection:830 - Connection built: host=cap.example.com port=8443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ verify=False
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][MainThread] katelloplugin:409 - reporting: {'enabled_repos': {'repos': [{'baseurl': ['https://cap.example.com/pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os'], 'repositoryid': 'rhel-7-server-rpms'}, {'baseurl': ['https://cap.example.com/pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/sat-tools/6.2/os'], 'repositoryid': 'rhel-7-server-satellite-tools-6.2-rpms'}]}}
Jun 08 15:56:39 vm1.example.com goferd[3116]: Loaded plugins: product-id
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][worker-0] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cap.example.com:5647
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][worker-0] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://cap.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][MainThread] gofer.agent.main:87 - agent started.
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][worker-0] root:490 - connecting to cap.example.com:5647...
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][worker-0] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://cap.example.com:5647
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][worker-0] gofer.messaging.adapter.connect:30 - connected: proton+amqps://cap.example.com:5647
Jun 08 15:56:39 vm1.example.com goferd[3116]: [INFO][worker-0] root:512 - connected to cap.example.com:5647
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][worker-0] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://cap.example.com:5647
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cap.example.com:5647
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://cap.example.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][worker-0] gofer.agent.plugin:368 - plugin:katelloplugin, attached => pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d] root:490 - connecting to cap.example.com:5647...
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d] gofer.messaging.adapter.proton.connection:92 - opened: proton+amqps://cap.example.com:5647
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d] gofer.messaging.adapter.connect:30 - connected: proton+amqps://cap.example.com:5647
Jun 08 15:56:40 vm1.example.com goferd[3116]: [INFO][pulp.agent.e0092f2a-e0a6-4a43-9979-2b0cd784d21d] root:512 - connected to cap.example.com:5647

>>> the error Condition('qd:no-route-to-dest', 'No route to the destination node') no longer appears after the Capsule<->Satellite connection is re-established

Comment 54 errata-xmlrpc 2017-06-20 17:22:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1553


Note You need to log in before you can comment on or make changes to this bug.