Description of problem: Having 6 Capsules (the more Capsules the more evident / affecting it is) and (re)starting qdrouterd on Sat and the Capsules altogether, it can take even 10 minutes(!) until qdrouterd on Satellite informs routers on Capsule that they can route pulp.* and qmf.* requests via it. That time seems to be very random (sometimes few seconds, thats fully OK, sometimes 30 seconds (bit problematic), sometimes 6 minutes (ouch), once 10 minutes. Who knows whats the upper limit, if there is any. The information is sent to Capsules at random times, usually the very first Capsule connected gets the update as the latest one: (everything started at 11:51:38 - 11:51:45) qdrouterd-S.log:Thu Apr 6 11:51:38 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[]) qdrouterd-S.log:Thu Apr 6 11:51:40 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[]) qdrouterd-S.log:Thu Apr 6 11:51:44 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[]) qdrouterd-S.log:Thu Apr 6 11:52:41 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[]) qdrouterd-S.log:Thu Apr 6 12:00:53 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[]) qdrouterd-S.log:Thu Apr 6 12:02:22 2017 ROUTER_MA (trace) SENT: MAU(id=satellite-router area=0 mobile_seq=1 add=['Cpulp.', 'Dpulp.', 'Cqmf.', 'Dqmf.'] del=[]) This impacts goferd on clients registered to Capsules since they are disconnected meanwhile. This is _not_ bz1426242 where traces show the MAU message was sent to the Capsule's routers. Version-Release number of selected component (if applicable): qpid-dispatch-router-0.4-21.el7sat.x86_64 How reproducible: 100% within few tests - it is nondeterministic behaviour, sometimes the address advertising is fast, sometimes very slow Steps to Reproduce: 0. Reproducer contains only qpid bits and emulates everything (Satellite, Capsules and a client) on just one machine. 1. Have a system with qpidd and qdrouterd and proton reactor installed, with qpidd default config (will be run as a service) 2. Run reproducer that will: - restart qpidd service - start qdrouterd listening on 5646 for inter-router traffic and connected to qpidd port 5672 - start qdrouterd-S listening on 5647 to client connections and link-routing everything via qdrouterd on 5646 - start another 5 qdrouterd-C-others routers listening on ports 5648 - 5642 and link-routing via 5646 - run a pytho script trying to open a consumer of qmf.* address via 5647 tar xvzf qdrouterd-very-late-MAU.tar.gz cd qdrouterd-very-late-MAU ./start_qpid_broker_and_routers.sh python simulate_first_goferd_run-in-loop.py 3. Check when the python client starts to write "OK" 4. grep MAU qdrouterd-S.log Actual results: 3. can take minutes 4. grep MAU output like above Expected results: 3. to take up to 10 seconds 4. MAU appears within 10 seconds since start Additional info: - Adding more routers will probably increase the (average) delay of sending MAU. - the attached tarball contains all the configs and start script, and also *replies-after-6mins-with-MAU logfiles from one such test
Cc Ganesh. Recall, this is _not_ bz1426242 since here the problem is on _sending_ the MAU updates (while 1426242 is about ignoring(?) them on receive) and here the 'qd:no-route-to-dest' is resolved once upon a time (in 1426242, it persists forever).
Created attachment 1269323 [details] reproducer stuff and logs
Pavel, Is it possible that you have more than one capsule router with the same router-id?
(In reply to Ted Ross from comment #4) > Pavel, > > Is it possible that you have more than one capsule router with the same > router-id? Mea culpa. Forgot to customize router-id in the copies of router conf. Once I fixed this, MAU messages are sent from the Satellite's router few seconds after the peer router is connected. Sorry for false alarm.
For the record: In later versions of qpid-dispatch, router-id duplicate detection is improved. This situation would have been highlighted in the logs with CRITICAL events.