Bug 466822 - When using RDMA, topic exchange can cause deadlocks
When using RDMA, topic exchange can cause deadlocks
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
All Linux
urgent Severity medium
: 1.1
: ---
Assigned To: Gordon Sim
Kim van der Riet
Depends On:
  Show dependency treegraph
Reported: 2008-10-13 16:53 EDT by Gordon Sim
Modified: 2009-02-04 10:35 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-02-04 10:35:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Patch to rearrange the locking order in TopicExchange::route() (1.70 KB, patch)
2008-11-26 15:52 EST, Andrew Stitcher
no flags Details | Diff
Test program that fails to reproduce the bug (6.92 KB, text/x-c++src)
2008-11-26 16:01 EST, Andrew Stitcher
no flags Details
Test program to trigger deadlock (5.74 KB, text/x-c++src)
2008-11-27 11:30 EST, Gordon Sim
no flags Details

  None (edit)
Description Gordon Sim 2008-10-13 16:53:30 EDT
TopicExchange holds a read lock while routing a message. When RDMA is in use, it is possible that a command on another connection is handled on that same thread before returning from route, and if this then tries to add/remove a binding from the topic exchange it will deadlock.
Comment 1 Frantisek Reznicek 2008-10-31 12:21:44 EDT
No test info. Putting NEEDINFO flag.
Comment 2 Andrew Stitcher 2008-11-07 10:30:29 EST
The original bug is theoretical based on code examination - When I succeed in getting a good reproducer I'll attach it here.

[This is also a gate on the bug fix]
Comment 3 Andrew Stitcher 2008-11-26 15:52:39 EST
Created attachment 324797 [details]
Patch to rearrange the locking order in TopicExchange::route()
Comment 4 Andrew Stitcher 2008-11-26 15:54:43 EST
I have produced a fix for this putative bug and attached it here. It builds and passes the standard checks.

It also doesn't cause a noticeable regression when runnning perftest through the topic exchange.

I attach it here.

It has not been applied because I am unable to actually reproduce the bug itself.

I attach here also a program which attempts (but fails) to reproduce the bug.
Comment 5 Andrew Stitcher 2008-11-26 16:01:18 EST
Created attachment 324799 [details]
Test program that fails to reproduce the bug

It does run for a very long time though and uses a lot of CPU in the broker (on a single CPU as there is only one connection)
Comment 6 Andrew Stitcher 2008-11-26 16:04:24 EST
Gordon - Any more thoughts on this "bug"
Comment 7 Gordon Sim 2008-11-27 11:30:34 EST
Created attachment 324903 [details]
Test program to trigger deadlock

To hit the deadlock we need to have concurrent transfers though- and bind/unbind operations on- the topic exchange. The attached topic_test uses a single connection and would never hit the issue unless multiple instances were run concurrently (with appropriate staggering between them such that the start up and end pases of one run would be concurrent with the transfers of another).

The attached test case targets the problem more directly and resulted in the suspected deadlock on mrg12. Will now test with the patch applied.
Comment 8 Gordon Sim 2008-11-27 12:10:51 EST
Running my test case with --messages 50000 seemed to reliably (5 deadlocks out of 5 attempts) cause the deadlock (the default, 10000, passed on one occasion).

With patch applied it passed 5 out of 5 tries of the same; patch commited as r 721243.

Fyi, in reproducing the IB interface must be used as the broker address. E.g. on mrg12: ./bz466822  -b --messages 50000
Comment 10 David Sommerseth 2008-12-08 15:08:24 EST
Tried to reproduce deadlock, without luck.  Based on RPMs from SVN r722891.  Test verified on mrg14
Comment 12 errata-xmlrpc 2009-02-04 10:35:21 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.