Bug 604688
Summary: | rdma stability issues including client crashing when broker is killed from underneath it and broker likewise | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> |
Component: | qpid-cpp | Assignee: | Andrew Stitcher <astitcher> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | beta | CC: | jross |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-12-11 18:53:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Gordon Sim
2010-06-16 13:55:25 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=603802#c3 for a different crash. That was without setting ulimit -l 131072 for both client and server. The above trace was after setting this. is the ip address 30.0.20.15 from the above reproducer an error? No, thats the ip of the IB interface on mrg15 where I was testing. Why? It appears to work (perftest will get through its run if allowed to complete, though will often core dump on exit). That address is not using the same convention as the other machines "nearby". The addresses in use on mrg12/mrg14 are 20.0.40.12/14. Also there seem to be 2 IB interfaces (ib0/ib1) on mrg15 each with this same address - that seems odd. The ifs are addr:30.0.10.15 and addr:30.0.20.15. The first one doesn't work for me so I used the second which does (barring these issues). Having looked at crash dumps from this and similar crashes now. I'm tending to the conclusion that there is a heap corruption bug somewhere in the rdma code that only exhibits on the client side of the code. A leading candidate would be freeing a block of memory back to the memory allocator before it has been removed from the rdma receive ring. I've carried out an experiment that would seem to disprove the idea that it's the rdma hardware overwriting things: If I stop the client side code from ever freeing the buffers that are used by the rdma hardware then the crash seems to happen faster rather than slower (and in different places). It still seems consistent with memory corruption of some sort though. A big raft of changes which massively improve the rdma stability has been checked in as of trunk r995165 Also checked into the mrg_1.3 release branch There are essentially 3 tests to use to verify the bug: All of them run the broker and a client in parallel in different windows: Prerequisites: Infiniband/IBoE installed and working on the machine. Either installed packages including the qpid-rdma* packages or compile (make install) from source with a --prefix that is writable (to hold the necessary loadable modules). 1. Run the client in a loop and kill the broker. Broker: while [ ! -f core.* ]; do date; src/qpidd --auth no & sleep 1; kill %%; sleep 1; done client: while [ ! -f core.* ]; do date; src/tests/qpid-perftest -Prdma -b 20.0.40.14 --qt 4 --count 10; done 2. Run the broker; loop killing the client Broker: src/qpidd --auth no Client: while [ ! -f core.* ]; do date; src/tests/qpid-perftest --qt 4 -Prdma -b 20.0.40.14 & sleep 5 ; kill %%; sleep 1; done 3. Run the broker; loop the client with very small counts (so exercising the teardown and setup logic frequently) Broker: src/qpidd --auth no Client: while [ ! -f core.* ]; do date; src/tests/qpid-perftest -Prdma -b 20.0.40.14 --qt 4 --count 10; done |