Bug 836141 - destination cluster de-sync when federation link used for a longer time
Summary: destination cluster de-sync when federation link used for a longer time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.1
Hardware: All
OS: Linux
high
high
Target Milestone: 2.3
: ---
Assignee: mick
QA Contact: Zdenek Kraus
URL:
Whiteboard:
Depends On:
Blocks: 866627
TreeView+ depends on / blocked
 
Reported: 2012-06-28 08:10 UTC by Pavel Moravec
Modified: 2018-12-01 18:23 UTC (History)
10 users (show)

Fixed In Version: qpid-cpp-0.18-4
Doc Type: Bug Fix
Doc Text:
BZ 836141 --------------------------------- A bug report in iambic pentameter, by Mick Goulish. Cause: The eldest cluster broker sends commands Creates its federation link and bridge, But leaves behind its juniors, all confused. Not privy to these mysteries, they soon Fall out of sync, and drag the cluster down. Consequence: Without agreement, how can clusters run? If elder and its juniors disagree, Divide against itself this fragile house, How can it stand? How can the perfect synchrony exist? It can't. And therein lies the rub. Fix: The elder broker builds its links and bridge, But tells the youngsters: "Do thou as I do!" They too can build that bridge, maintaining it Against the day when elder fails. All their Commands are thus maintained in sync. Result: The cluster hums, the traffic leaps and flows. Where once was strife and error, harmony has come. The synchrony will no doubt be disturbed again. But not by this BZ.
Clone Of:
: 866627 (view as bug list)
Environment:
Last Closed: 2013-03-06 18:56:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer stuff (1013 bytes, application/x-compressed-tar)
2012-08-09 08:47 UTC, Pavel Moravec
no flags Details
replication script (1.22 KB, application/x-sh)
2012-08-13 08:07 UTC, Pavel Moravec
no flags Details
Patch based on 0.10-8, does not work. (11.25 KB, patch)
2012-08-13 21:56 UTC, Alan Conway
no flags Details | Diff
Alan's fix ported to proper hotfix 0.10-8_gs_2 (11.33 KB, application/octet-stream)
2012-08-14 18:43 UTC, Pavel Moravec
no flags Details
patch - current (my final) status (101.83 KB, patch)
2012-08-22 11:20 UTC, Pavel Moravec
no flags Details | Diff
qpid traces from GS hotfix 4 with cluster de-sync (123.59 KB, application/x-gzip)
2012-09-05 07:27 UTC, Pavel Moravec
no flags Details
additional patch fixing failover problems (1.08 KB, patch)
2012-09-20 09:34 UTC, Pavel Moravec
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:0561 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging 2.3 security update 2013-03-06 23:48:13 UTC

Description Pavel Moravec 2012-06-28 08:10:39 UTC
Description of problem:
Using queue state replication from a broker (possibly clustered - this does not matter) to a cluster of brokers cause cluster de-sync after a long time:

2012-06-28 08:28:30 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: @QPID.77153a41-7531-47f6-bf55-b30ffed69922: confirmed < (4799+0) but only sent < (4797+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)



Version-Release number of selected component (if applicable):
every checked 


How reproducible:
depending on time, but 10% for default scenario


Steps to Reproduce:
(ideally, if possible, rebuild qpid with changing cpp/src/qpid/SessionState.cpp: static const uint32_t SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the reproducer)

1) Have source broker (or cluster, this does not matter) and dest.cluster with queue state replication of just one queue from source do dest.cluster.
2) On the federation route, setup --ack to some low number (to speedup replication, I used --ack 5).
3) Randomly produce and consume messages to the src.broker to the queue to be replicated - ideally, do the enqueues and dequeues as much alternating as possible. Dont know why, but more alternates speeds up the reproducer as well.
4) Now, be patient. After sending SPONTANEOUS_REQUEST_INTERVAL (by default 64k) of some synchronization messages _from_ the backup cluster (that requires around 100times more messages to be enqueued and dequeued on the replicated queue), there is a probability to hit the bug. Once it was hit on the first attempt (after 2^16 = 64k of such synchronization messages), once after 720896 messages (in 11th "round" / "trial").

  
Actual results:
All brokers in dst.cluster - except the one that has the fed.link established - shut down with log:

2012-06-27 15:39:46 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: @QPID.314e73e8-8bc3-4f5a-b77d-6bdd4ee17e39: confirmed < (720895+0) but only sent < (720893+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)


Expected results:
No such cluster de-sync


Additional info:
- interesting fact: I was able to reproduce it using queue state replication - only. Despite the bug is on federation link session, using fed.link without queue state replication did not lead to the bug.

- the difference comes from the _beginning_ of session communication, per some traces, these AMQP messages sent from dst.cluster to the source are _not_ replayed by (even not multicasted to) the "other dst.brokers" (that have the session / connection as shadow, not local). So these messages are not replayed:

2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 0: {MessageSubscribeBody: queue=replication-queue; destination=replication-exchange; accept-mode=0; acquire-mode=0; resume-id resume-ttl=0; arguments={qpid.sync_frequency:F4:int32(100)}; }
2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 1: {MessageFlowBody: destination=replication-exchange; unit=0; value=4294967295; }
2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 2: {MessageFlowBody: destination=replication-exchange; unit=1; value=4294967295; }

Comment 1 Pavel Moravec 2012-06-28 11:21:58 UTC
In Bridge::create method in ./cpp/src/qpid/broker/Bridge.cpp, these lines are not replayed on shadow connections:

        peer->getMessage().subscribe(args.i_src, args.i_dest, args.i_sync ? 0 : 1, 0, false, "", 0, options);
        peer->getMessage().flow(args.i_dest, 0, 0xFFFFFFFF);
        peer->getMessage().flow(args.i_dest, 1, 0xFFFFFFFF);
        QPID_LOG(debug, "Activated route from queue " << args.i_src << " to " << args.i_dest);

That's why "shadow brokers" (i.e. those that don't have the fed.link as local connection but only as shadow) dont execute the methods directly - so far ok, I think. But..

The above is not executed on shadow brokers as method Link::maintenanceVisit() in ./cpp/src/qpid/broker/Link.cpp is executed differently for them. In particular:

..
else if (state == STATE_OPERATIONAL && (!created.empty() || !cancellations.empty()) && connection != 0) 
        connection->requestIOProcessing (boost::bind(&Link::ioThreadProcessing, this));

the if statement is evaluated to false (as state = STATE_PASSIVE and connection = 0) so connection->requestIOProcessing (that further calls Link::ioThreadProcessing method that calls Bridge::create) is not executed on "shadow" brokers.


The question is: what piece in the chain should multicast the three messages (from Bridge::create)?

Comment 2 Pavel Moravec 2012-06-28 12:29:53 UTC
Backtrace of creating a bridge:
-------------------------------

#0  qpid::broker::Bridge::create (this=0x1aa7840, c=...) at qpid/broker/Bridge.cpp:74
#1  0x00007fabed2eee79 in qpid::broker::Link::ioThreadProcessing (this=0x1a9d8f0) at qpid/broker/Link.cpp:339
#2  0x00007fabed2b5fcc in operator() (this=0x1a9dcf0) at /usr/include/boost/function/function_template.hpp:1013
#3  qpid::broker::Connection::doIoCallbacks (this=0x1a9dcf0) at qpid/broker/Connection.cpp:341
#4  0x00007fabed2b6460 in qpid::broker::Connection::doOutput (this=0x1a9dcf0) at qpid/broker/Connection.cpp:348
#5  0x00007fabec215d01 in qpid::cluster::OutputInterceptor::deliverDoOutput (this=0x1aa03d0, limit=2048) at qpid/cluster/OutputInterceptor.cpp:89
#6  0x00007fabece5fbf2 in invoke<qpid::framing::AMQP_AllOperations::ClusterConnectionHandler> (this=0x7fabe7a05a70, body=<value optimized out>)
    at ./qpid/framing/ClusterConnectionDeliverDoOutputBody.h:63
#7  qpid::framing::AMQP_AllOperations::ClusterConnectionHandler::Invoker::visit (this=0x7fabe7a05a70, body=<value optimized out>) at qpid/framing/AllInvoker.cpp:1116
#8  0x00007fabec1f8876 in qpid::framing::invoke<qpid::cluster::Connection> (target=<value optimized out>, body=<value optimized out>) at ./qpid/framing/Invoker.h:80
#9  0x00007fabec1f1e09 in qpid::cluster::Connection::deliveredFrame (this=0x1aa0370, f=<value optimized out>) at qpid/cluster/Connection.cpp:248
#10 0x00007fabec1cb14e in qpid::cluster::Cluster::processFrame (this=<value optimized out>, e=..., l=<value optimized out>) at qpid/cluster/Cluster.cpp:537
#11 0x00007fabec1cf9b8 in qpid::cluster::Cluster::deliveredFrame (this=0x1a91e30, efConst=<value optimized out>) at qpid/cluster/Cluster.cpp:521
#12 0x00007fabec1d8528 in operator() (this=0x1a92448, values=std::deque with 1 elements = {...}) at /usr/include/boost/function/function_template.hpp:1013
#13 qpid::cluster::PollableQueue<qpid::cluster::EventFrame>::handleBatch (this=0x1a92448, values=std::deque with 1 elements = {...}) at ./qpid/cluster/PollableQueue.h:59
#14 0x00007fabec1d2dbc in operator() (function_obj_ptr=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/bind/mem_fn_template.hpp:162
#15 operator()<std::_Deque_iterator<qpid::cluster::EventFrame, const qpid::cluster::EventFrame&, const qpid::cluster::EventFrame*>, boost::_mfi::mf1<std::_Deque_iterator<qpid::cluster::EventFrame, const qpid::cluster::EventFrame&, const qpid::cluster::EventFrame*>, qpid::cluster::PollableQueue<qpid::cluster::EventFrame>, const std::deque<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> >&>, boost::_bi::list1<const std::deque<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> >&> > (function_obj_ptr=<value optimized out>, 
    a0=<value optimized out>) at /usr/include/boost/bind/bind.hpp:296
#16 operator()<std::deque<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > > (function_obj_ptr=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/bind/bind_template.hpp:47
#17 boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<std::_Deque_iterator<qpid::cluster::EventFrame, qpid::cluster::EventFrame const&, qpid::cluster::EventFrame const*>, boost::_mfi::mf1<std::_Deque_iterator<qpid::cluster::EventFrame, qpid::cluster::EventFrame const&, qpid::cluster::EventFrame const*>, qpid::cluster::PollableQueue<qpid::cluster::EventFrame>, std::deque<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > const&>, boost::_bi::list2<boost::_bi::value<qpid::cluster::PollableQueue<qpid::cluster::EventFrame>*>, boost::arg<1> > >, std::_Deque_iterator<qpid::cluster::EventFrame, qpid::cluster::EventFrame const&, qpid::cluster::EventFrame const*>, std::deque<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > const&>::invoke (function_obj_ptr=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:132
#18 0x00007fabec1dccb2 in operator() (this=0x1a92448) at /usr/include/boost/function/function_template.hpp:1013
#19 qpid::sys::PollableQueue<qpid::cluster::EventFrame>::process (this=0x1a92448) at ./qpid/sys/PollableQueue.h:152
#20 0x00007fabec1e00ac in qpid::sys::PollableQueue<qpid::cluster::EventFrame>::dispatch (this=0x1a92448, cond=...) at ./qpid/sys/PollableQueue.h:138
#21 0x00007fabece08e3d in operator() (this=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013
#22 qpid::sys::PollableConditionPrivate::dispatch (this=<value optimized out>) at qpid/sys/posix/PollableCondition.cpp:95
#23 0x00007fabecec8223 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#24 0x00007fabecec5426 in qpid::sys::DispatchHandle::processEvent (this=0x1a94fd0, type=qpid::sys::Poller::READABLE) at qpid/sys/DispatchHandle.cpp:280
#25 0x00007fabece0d50d in process (this=0x1a77130) at ./qpid/sys/Poller.h:131
#26 qpid::sys::Poller::run (this=0x1a77130) at qpid/sys/epoll/EpollPoller.cpp:519
#27 0x00007fabece04e9a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#28 0x00000037b5a077f1 in start_thread () from /lib64/libpthread.so.0
#29 0x00000037b56e592d in clone () from /lib64/libc.so.6



The problem is that the IO callbacks (see frame #3) are not cluster safe because they are queued for execution in non-IO threads.

Comment 3 Pavel Moravec 2012-07-03 10:54:09 UTC
The issue is more general - it occurs regardless of queue state replication, as it depends only on federation link long usage.

(anyway the replication scenario is valid)

Comment 9 Pavel Moravec 2012-08-09 08:47:40 UTC
Created attachment 603210 [details]
reproducer stuff

Content:
- reproducer_836141.sh (the main script)
- directories 4001 5001 5002 (for running brokers on that ports, 4001 is src. broker, 5001 and 5002 are two dst brokers in cluster, 5001 is elder node, 5002 is being restarted)
- run_spout_drain.sh (to run spout and drain; assuming spout and drain are in $PATH)

Comment 10 Pavel Moravec 2012-08-10 11:41:14 UTC
Comment on attachment 603210 [details]
reproducer stuff

Sorry for wrong reproducer scripts.. It is quite tricky.

Proper reproducer:
1) create queue state replication etc., with "--ack 5" in qpid-route.
2) send 163800 messages to dataQ and in parallel consume them (i.e. to generate 163800*2 event messages on the fed.link)
3) slooowly send 100 messages to the queue (i.e. 2 messages per second). Around 70th message, non-elder dst.broker shuts down.

Comment 12 Alan Conway 2012-08-10 21:43:59 UTC
A proposed patch for this issue is https://reviews.apache.org/r/6493/
The patch is based on trunk r1370792
I have not yet been able to reproduce the problem so the patch is not confirmed.

Comment 13 Pavel Moravec 2012-08-13 08:07:49 UTC
Created attachment 603869 [details]
replication script

Attached is proper version of reproducer script. After sending 65th message in the "until egrep" loop, broker 5002 (the new joiner) shuts down.

Comment 14 Alan Conway 2012-08-13 21:56:09 UTC
Created attachment 604098 [details]
Patch based on 0.10-8, does not work.

It is not straightforward to back-port the fix to 0.10. The attached patch is an attempt but fails the test because in this early code, bridges are named with a UUID which is not consistent across the cluster. We need to synchronize bridge names to get this to work.

Comment 15 Alan Conway 2012-08-13 22:20:40 UTC
Summary of status

Background: If you create a link to a cluster, only the elder broker creates a connection and sets up a bridge subscription. The other brokers do get a shadow connection but they don't create a link or a bridge - they treat it like a regular client connection.

Problem: To subscribe the bridge, the elder sends a MessageSubscribe, MessageFlow etc. which the other brokers know nothing about, so the command numbers are different by 3 for elder vs. non-elder. This doesn't show up immediately because the other end doesn't send completions unless asked, and we don't ask. The problem shows up (sporadically) when the other end hits the "spontaneous request" limit, by default 65536, and sends an unsolicited completion. It doesn't always fail there reliable because if there are more than 3 messages in transit, the command number will be low enough to be in the range of commands that the non-elders know about.

Solution: Non-elder brokers create passive links and bridges, and associate the link with the shadow connection. The _only_ difference between an active and passive link is that an active link opens a connection, but a passive link just waits for the shadow connection to appear. Other than that they send identical commands so have identical numbering. The passive links go thru exactly the same state changes as the active one.

There are 2 patches, 

1. The one mentioned in comment 12 (https://reviews.apache.org/r/6493/): This based off a trunk revision. There is a new test in the patch (cluster_tests.py:test_federation_bridges_consistent) to verify that all members of a cluster send the same bridge subscription commands, that test is passing. I wasn't able to reproduce the original problem on this trunk revision with or without the patch.


2 .attachment 603869 [details]: This is 1. rebased onto 0.10-8. Unfortunately it doesn't work. In the older code bridges are assigned UUID names and these are not synchronized in a cluster. I am able to reproduce the bug on un-patched 0.10, but the patched version fails with a "wrong session name". We need to figure out a way to sync the session names. (In the newer trunk code bridges have user-defined names that are synchronized)

Comment 16 Pavel Moravec 2012-08-14 18:43:11 UTC
Created attachment 604397 [details]
Alan's fix ported to proper hotfix 0.10-8_gs_2

Alan's fix ported to proper hotfix 0.10-8_gs_2 (i.e. not to pure 0.10-8).

To resolve link names generated not from UUID, QPID-3767 (SVN revisions r1342442,
r1334139 and r1334138) needs to be backported upon it.

Comment 17 Pavel Moravec 2012-08-22 11:20:04 UTC
Created attachment 606198 [details]
patch - current (my final) status

Attaching a patch that can be the base for customer's hotfix (fixing BZs 826989, adjacent 844618 and this 836141).

What the patch joins and contains:
1) patch from 826989
2) patch from 844618
3) Alan's backport of QPID-4201
4) Ken's backport of QPID-3767
4) qpid.link for failover renamed to get rid of Uuid (see https://reviews.apache.org/r/6493/, first comment of Kenneth Giusti)
5) one subtle change in Alan's backport in swapping two methods in Connection::Connection

I successfully verified the patch against both reproducers.

However, this patch has two small issues:
1) in 1 from 30 cases, federation link establishment results into immediate cluster de-sync and consequent brokers stop. I can provide details when requested.
2) sometimes when abruptly stopping a broker, it writes to stderr:
pure virtual method called
terminate called without an active exception

Comment 18 Alan Conway 2012-09-04 20:56:45 UTC
Can you attach details on reproducing issue 1 from comment 17, and any log info you might have around the failure?

I'm less worried about issue 2, since it only happens during shutdown.

Comment 19 Alan Conway 2012-09-04 20:57:59 UTC
What was the "Subtle change" mentioned in comment 17 point 5?

Comment 20 Pavel Moravec 2012-09-05 07:27:37 UTC
Created attachment 609884 [details]
qpid traces from GS hotfix 4 with cluster de-sync

Trace logs of cluster de-sync after federation route creation in GS hotfix 4.

Comment 21 Pavel Moravec 2012-09-05 07:41:07 UTC
(In reply to comment #19)
> What was the "Subtle change" mentioned in comment 17 point 5?

Originally I thought I attached the patch with the "subtle change" that based on some tests had less probability of cluster de-sync when fed.route is created. But reviewing it, I commited the patch _without_ the change.

(the change was about moving links.notifyConnection just _after_ addManagementObject method in https://bugzilla.redhat.com/attachment.cgi?id=606198&action=diff#a/qpid/cpp/src/qpid/broker/Connection.cpp_sec1; but as I wrote, I didnt apply the change, cf. with original https://bugzilla.redhat.com/attachment.cgi?id=604098&action=diff#a/qpid/cpp/src/qpid/broker/Connection.cpp_sec1)

Comment 22 mick 2012-09-20 04:53:38 UTC
On my box - mrg26 - I am getting much more frequent failures on start-up -- close to 50%.

I also see that if we get past the start-up without a problem, then the rest of the test runs fine for thousands of iterations.


failure mechanism, at bottom level, is:


amqp_0_10::SessionHandler::attached checks the incoming session name against the name it was expecting, from getState()->getId().getName(), and finds that they differ.  It becomes upset.

The names differ only in the UUID-looking part at the end of the name. ( is it a management ID? )


If I make amqp_0_10::SessionHandler::attached not bother its little head about this check, then the rest of the test runs forever with no problems....


still poking...

Comment 23 mick 2012-09-20 06:03:49 UTC
The problematic names come through
SessionHandler::attachAs

There are two calls to SessionHandler::attachAs

both calls come from Bridge::create  ( should there be two? )

both calls come from different threads

when the queue name is the same in both calls, the test succeeds.

when the two names are different, the test fails.

hmmm.

Comment 24 Pavel Moravec 2012-09-20 09:34:15 UTC
Created attachment 614799 [details]
additional patch fixing failover problems

Additional patch fixing some failures in federation link cluster failover automate tests.

Comment 25 Justin Ross 2012-09-20 13:12:06 UTC
While there remains a problem with the 0.10 backport, this issue is addressed (fixed on trunk, comment 12) for 2.3. -> POST

Comment 26 Mike Cressman 2012-09-24 13:17:14 UTC
Comment 12 is stating that the fix "depends on" trunk r1370792.
The fix itself does not yet appear to be checked into trunk.
Moving back to ASSIGNED.

Comment 27 Alan Conway 2012-09-24 16:39:19 UTC
This patch can't be applied to trunk any more, the cluster code was removed from trunk in the Big Refactor. So we only need the backports.

Comment 31 Zdenek Kraus 2013-01-18 08:10:39 UTC
Issue was successfully reproduced on qpidd-cpp-0.14-22 in 100% runs, on qpid-cpp-0.18-13 issue didn't occur at all after several tries. Thus this issue was fixed.

Tested on RHEL 5.9, 6.4 && i686, x86_64 with qpid-cpp-0.18-13.

-> VERIFIED

Comment 33 errata-xmlrpc 2013-03-06 18:56:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html


Note You need to log in before you can comment on or make changes to this bug.