Bug 835628

Summary: Broker crashes after creating federated links
Product: Red Hat Enterprise MRG Reporter: Jason Dillaman <jdillama>
Component: qpid-cppAssignee: Ken Giusti <kgiusti>
Status: CLOSED ERRATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: iboverma, jneedle, jross, kgiusti, mcressma, pematous
Target Milestone: 2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.14-19 Doc Type: Bug Fix
Doc Text:
Cause Creating a federation link where the source broker is not a member of a cluster (source broker is stand-alone). Consequence This would occasionally cause the destination broker to crash as there was a race between the thread that configures the link and the thread that sends traffic over it. Fix The race was removed by moving the configuration code to the same thread as the data handling code. Result The link configuration is completed fully before traffic is sent over it.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-17 11:11:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 698367    

Description Jason Dillaman 2012-06-26 16:11:08 UTC
Description of problem:
Creating a federated link between two non-clustered brokers may result in a broker crash.  The Link class will attempt to bind to the 'amq.failover' exchange shortly after connecting to the remote broker.  If that exchange does not exist, as will happen if the broker is not running in a clustered environment, the bind call will fail and result in the session being detached.  Due to a race condition, it is possible that the session is invalided prior to sending a queue subscribe.  If the session was detached, its SessionHandler will be cleaned up which will result in the subscribe call attempting to invoke a pure virtual function on the destructed SessionHandler.

Version-Release number of selected component (if applicable):
qpid-cpp-server-0.12-6_ptc_hotfix_4.el6_2.x86_64 
**contains QPID-3963 / BZ824990 patch

How reproducible:
Frequently

Steps to Reproduce:
1. Start two non-clustered brokers
2. Establish a federated link between the brokers
  
Actual results:
The debug logs from the two brokers will show that the link's failover session was invalidated after attempting to bind to the missing exchange.  Depending on timing, it is possible that the session is detached prior to the subscribe call.  This will result in a broker crash.

Expected results:
The link gracefully handles the missing amq.failover exchange.

Additional info:


#0  0x00007f34cb361885 in raise () from /lib64/libc.so.6
#1  0x00007f34cb363065 in abort () from /lib64/libc.so.6
#2  0x0000003c43cbea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x0000003c43cbcc06 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003c43cbcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003c43cbd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x0000003e695e4eca in qpid::framing::Proxy::send (this=0x7f34ca213070, b=...) at qpid/framing/Proxy.cpp:37
#7  0x0000003e695552ca in qpid::framing::AMQP_ServerProxy::Message::subscribe (this=0x7f34ca213070, queue=..., destination=..., acceptMode=1 '\001', acquireMode=0 '\000', exclusive=false, 
    resumeId=..., resumeTtl=0, arguments=...) at qpid/framing/AMQP_ServerProxy.cpp:240
#8  0x00000033b6b992bc in qpid::broker::Link::maintenanceVisit (this=0xc15090) at qpid/broker/Link.cpp:418
#9  0x00000033b6ba2191 in qpid::broker::LinkRegistry::periodicMaintenance (this=0xb045e0) at qpid/broker/LinkRegistry.cpp:92
#10 0x00000033b6ba23e1 in qpid::broker::LinkRegistry::Periodic::fire (this=0xafc3c0) at qpid/broker/LinkRegistry.cpp:72
#11 0x0000003e69600c86 in qpid::sys::Timer::fire (this=<optimized out>, t=...) at qpid/sys/Timer.cpp:195
#12 0x0000003e69601e51 in qpid::sys::Timer::run (this=0xb04250) at qpid/sys/Timer.cpp:129
#13 0x0000003e69539b4a in qpid::sys::(anonymous namespace)::runRunnable (p=<optimized out>) at qpid/sys/posix/Thread.cpp:35
#14 0x00007f34cb11a7f1 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f34cb414ccd in clone () from /lib64/libc.so.6

Comment 2 Ken Giusti 2012-07-20 17:57:13 UTC
This bug was caused by the incorrect backport of this upstream fix:
https://issues.apache.org/jira/browse/QPID-3963

The upstream fix allowed the broker to subscribe for failover events.  This patch did not port cleanly to our MRG downstream repos, causing the above crash.

The fix was originally submitted to the mrg_2_ptc_hotfix branch of the MRG git repo:

http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=fa4ef35981defb5daa0256eebafa0e458a6c3af3

It is relevant only to 0.12 and 0.14-based MRG repos - 0.18 is not affected.

I believe mcressman has ported the fix to 0.12 and 0.14 - Mike can you confirm?

Comment 3 Ken Giusti 2012-07-20 18:06:09 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    Creating a federation link where the source broker is not a member of a cluster (source broker is stand-alone).
Consequence
    This would occasionally cause the destination broker to crash as there was a race between the thread that configures the link and the thread that sends traffic over it.
Fix
    The race was removed by moving the configuration code to the same thread as the data handling code.
Result
    The link configuration is completed fully before traffic is sent over it.

Comment 5 Petr Matousek 2012-07-30 16:22:46 UTC
While verifying Bug 831365 on the latest packages (qpid-cpp-*14-19), I have noticed that following error message is logged to the broker log. The broker is not crashing any more, but the error message shall definitely not appear in the log. The issue is easily reproducible by creating a federated link between two non-clustered brokers.

src_broker log snip:
2012-07-27 17:56:16 notice Broker running
2012-07-27 17:56:28 info Connection is a federation link
2012-07-27 17:56:30 info Queue "qpid.link.032acf3f-260e-47c1-abfe-d51c95a66a85": Policy created: type=reject; maxCount=0; maxSize=104857600
2012-07-27 17:56:30 info Queue "qpid.link.032acf3f-260e-47c1-abfe-d51c95a66a85": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=83886080, flowResumeSize=73400320
2012-07-27 17:56:30 error Execution exception: not-found: Exchange not found: amq.failover (qpid/broker/ExchangeRegistry.cpp:97)

Ken, shall I move this bz back to assigned or create a separate bz for this issue?

Comment 6 Ken Giusti 2012-07-30 19:58:40 UTC
Hi Petr,

The log message is, unfortunately, expected: the broker logs - as an error - any command that it cannot complete.  In this case, the remote is attempting to bind to a non-existing exchange (amq.failover).  The command will fail, resulting in the log message.

Even though it has logged an error, there really is nothing wrong with the broker at this point - the bind fails, the session ends and both sides clean up.

Ideally, the log message shouldn't be issued in this particular case: since the source broker is not part of a cluster, there is no need for failover.  Thus, there is no amq.failover exchange.  The 0.10 spec is pretty clear about this - amq.failover should only exist if the broker supports failover.

The problem is that the other (destination) broker doesn't know if the source is part of a cluster (and has amq.failover) or not.  It only finds out by attempting to bind to amq.failover, and dealing with the result (success or failure).

You'll see the same result if you try to run qpid-receive with the --failover-updates parameter:

 qpid-receive -b 127.0.0.1:7777 -a amq.direct --failover-updates
2012-07-30 15:54:00 [Client] warning Exception received from broker: not-found: not-found: Queue not found: amq.failover (../../../qpid/cpp/src/qpid/broker/SessionAdapter.cpp:692) [caused by 2 \x08:\x01]
qpid-receive: Queue amq.failover does not exist

And the same error will be logged by the broker.

Perhaps the destination broker could query for the existence of amq.failover exchange first, before deciding to bind.  But that would add another level of complexity to the federation link setup.  Another option would be to reduce the log level for failure to bind to amq.failover - though from my quick glance at the code this would be more difficult than it sounds.

In either case, I'd open a new BZ.

Comment 7 Petr Matousek 2012-07-31 09:55:51 UTC
New bug 844655 created for the issue mentioned in Comment 5 and Comment 6.