Bug 1439617 - After concurrent startup, qdrouterd on Satellite delays many minutes before advertising qpidd is up
Summary: After concurrent startup, qdrouterd on Satellite delays many minutes before a...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.2.7
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: Unspecified
Assignee: Mike Cressman
QA Contact: Katello QA List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-06 10:25 UTC by Pavel Moravec
Modified: 2017-04-06 18:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-06 17:12:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer stuff and logs (391.89 KB, application/x-gzip)
2017-04-06 10:37 UTC, Pavel Moravec
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1426242 0 high CLOSED goferd disconnects with no route to destination node on client 2021-08-30 12:18:11 UTC

Internal Links: 1426242

Description Pavel Moravec 2017-04-06 10:25:28 UTC
Description of problem:
Having 6 Capsules (the more Capsules the more evident / affecting it is) and (re)starting qdrouterd on Sat and the Capsules altogether, it can take even 10 minutes(!) until qdrouterd on Satellite informs routers on Capsule that they can route pulp.* and qmf.* requests via it.

That time seems to be very random (sometimes few seconds, thats fully OK, sometimes 30 seconds (bit problematic), sometimes 6 minutes (ouch), once 10 minutes. Who knows whats the upper limit, if there is any. The information is sent to Capsules at random times, usually the very first Capsule connected gets the update as the latest one:

(everything started at 11:51:38 - 11:51:45)
qdrouterd-S.log:Thu Apr  6 11:51:38 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[])
qdrouterd-S.log:Thu Apr  6 11:51:40 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[])
qdrouterd-S.log:Thu Apr  6 11:51:44 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[])
qdrouterd-S.log:Thu Apr  6 11:52:41 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[])
qdrouterd-S.log:Thu Apr  6 12:00:53 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[])
qdrouterd-S.log:Thu Apr  6 12:02:22 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[])


This impacts goferd on clients registered to Capsules since they are disconnected meanwhile.

This is _not_ bz1426242 where traces show the MAU message was sent to the Capsule's routers.


Version-Release number of selected component (if applicable):
qpid-dispatch-router-0.4-21.el7sat.x86_64


How reproducible:
100% within few tests - it is nondeterministic behaviour, sometimes the address advertising is fast, sometimes very slow


Steps to Reproduce:
0. Reproducer contains only qpid bits and emulates everything (Satellite, Capsules and a client) on just one machine.
1. Have a system with qpidd and qdrouterd and proton reactor installed, with qpidd default config (will be run as a service)
2. Run reproducer that will:
   - restart qpidd service
   - start qdrouterd listening on 5646 for inter-router traffic and connected to qpidd port 5672
   - start qdrouterd-S listening on 5647 to client connections and link-routing everything via qdrouterd on 5646
   - start another 5 qdrouterd-C-others routers listening on ports 5648 - 5642 and link-routing via 5646
   - run a pytho script trying to open a consumer of qmf.* address via 5647

tar xvzf qdrouterd-very-late-MAU.tar.gz
cd qdrouterd-very-late-MAU
./start_qpid_broker_and_routers.sh
python simulate_first_goferd_run-in-loop.py

3. Check when the python client starts to write "OK"
4. grep MAU qdrouterd-S.log


Actual results:
3. can take minutes
4. grep MAU output like above


Expected results:
3. to take up to 10 seconds
4. MAU appears within 10 seconds since start


Additional info:
- Adding more routers will probably increase the (average) delay of sending MAU.
- the attached tarball contains all the configs and start script, and also *replies-after-6mins-with-MAU logfiles from one such test

Comment 2 Pavel Moravec 2017-04-06 10:36:47 UTC
Cc Ganesh.

Recall, this is _not_ bz1426242 since here the problem is on _sending_ the MAU updates (while 1426242 is about ignoring(?) them on receive) and here the 'qd:no-route-to-dest' is resolved once upon a time (in 1426242, it persists forever).

Comment 3 Pavel Moravec 2017-04-06 10:37:26 UTC
Created attachment 1269323 [details]
reproducer stuff and logs

Comment 4 Ted Ross 2017-04-06 16:41:00 UTC
Pavel,

Is it possible that you have more than one capsule router with the same router-id?

Comment 5 Pavel Moravec 2017-04-06 17:12:31 UTC
(In reply to Ted Ross from comment #4)
> Pavel,
> 
> Is it possible that you have more than one capsule router with the same
> router-id?

Mea culpa. Forgot to customize router-id in the copies of router conf.

Once I fixed this, MAU messages are sent from the Satellite's router few seconds after the peer router is connected.

Sorry for false alarm.

Comment 6 Ted Ross 2017-04-06 18:06:34 UTC
For the record:  In later versions of qpid-dispatch, router-id duplicate detection is improved.  This situation would have been highlighted in the logs with CRITICAL events.


Note You need to log in before you can comment on or make changes to this bug.