Red Hat Bugzilla – Bug 466822
When using RDMA, topic exchange can cause deadlocks
Last modified: 2009-02-04 10:35:21 EST
TopicExchange holds a read lock while routing a message. When RDMA is in use, it is possible that a command on another connection is handled on that same thread before returning from route, and if this then tries to add/remove a binding from the topic exchange it will deadlock.
No test info. Putting NEEDINFO flag.
The original bug is theoretical based on code examination - When I succeed in getting a good reproducer I'll attach it here.
[This is also a gate on the bug fix]
Created attachment 324797 [details]
Patch to rearrange the locking order in TopicExchange::route()
I have produced a fix for this putative bug and attached it here. It builds and passes the standard checks.
It also doesn't cause a noticeable regression when runnning perftest through the topic exchange.
I attach it here.
It has not been applied because I am unable to actually reproduce the bug itself.
I attach here also a program which attempts (but fails) to reproduce the bug.
Created attachment 324799 [details]
Test program that fails to reproduce the bug
It does run for a very long time though and uses a lot of CPU in the broker (on a single CPU as there is only one connection)
Gordon - Any more thoughts on this "bug"
Created attachment 324903 [details]
Test program to trigger deadlock
To hit the deadlock we need to have concurrent transfers though- and bind/unbind operations on- the topic exchange. The attached topic_test uses a single connection and would never hit the issue unless multiple instances were run concurrently (with appropriate staggering between them such that the start up and end pases of one run would be concurrent with the transfers of another).
The attached test case targets the problem more directly and resulted in the suspected deadlock on mrg12. Will now test with the patch applied.
Running my test case with --messages 50000 seemed to reliably (5 deadlocks out of 5 attempts) cause the deadlock (the default, 10000, passed on one occasion).
With patch applied it passed 5 out of 5 tries of the same; patch commited as r 721243.
Fyi, in reproducing the IB interface must be used as the broker address. E.g. on mrg12: ./bz466822 -b 192.168.10.36 --messages 50000
Tried to reproduce deadlock, without luck. Based on RPMs from SVN r722891. Test verified on mrg14
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.