Hide Forgot
Description of problem: Running Perftest with either Mellanox 10Gb with Rocce or with Mellanox IB with RDMA crashes qpidd server Version-Release number of selected component (if applicable): qpid-cpp-server-0.9.1073306-1.el6.x86_64 How reproducible: Happens with every test run started Steps to Reproduce: 1. start qpidd process /usr/sbin/qpidd --auth no -m no --no-data-dir --mgmt-qmf1 no --mgmt-qmf2 no --load-module /usr/lib64/qpid/daemon/rdma.so --log-enable=info+ 2. start the test /usr/bin/qpid-perftest -P rdma -b 192.168.10.42 --count 100000 --size 8 --npubs 1 --nsubs 1 --qt 13 Actual results: 2011-03-01 13:19:51 error Caught exception in state: 3 with event: 1: Couldn't find existing Connection From the client 2011-03-01 13:19:28 error RDMA: qp=0x7fd5b000bf50: Deleting queue before all write buffers finished 2011-03-01 13:19:28 warning Connection [192.168.10.42:34451 192.168.10.44:5672] closed PublishThread exception: Connection [192.168.10.42:34451 192.168.10.44:5672] closed Expected results: No crashes and finish the test. Additional info: I went back and tested different builds and found that this started with the qpid-cpp-0.7.946106-9.el6 build I have attached the latest coredump to this crash
coredump too large so here is a pointer where I put it on the web http://perf1.lab.bos.redhat.com/network/qpidd_rdma_crash.tar
This may be the same underlying cause as Bug 674056
We need to reconsider the blocker status of this bug. I don't think we can ship with this bug.
This affects both Mellanox 10Gb and Mellanox Infiniband. Also affects publishing reference papers on the subject matter. So I consider it a blocker
At the point where the client fails, the following error is generated on the broker side: #5 Rdma::AsynchIO::processCompletions (this=0x7f8619814cf0) at qpid/sys/rdma/RdmaIO.cpp:385 385 errorCallback(*this); (gdb) list 380 } else { 381 ++recvEvents; 382 } 383 continue; 384 } 385 errorCallback(*this); 386 // TODO: Probably need to flush queues at this point 387 return; 388 } 389 (gdb) info local e = {cq = {px = 0x7f86198144a0, pn = {pi_ = 0x7f8619814570}}, wc = { wr_id = 140213934450336, status = IBV_WC_LOC_LEN_ERR, opcode = 32646, vendor_err = 215, byte_len = 32646, imm_data = 0, qp_num = 3539037, src_qp = 0, wc_flags = 0, pkey_index = 16960, slid = 6529, sl = 134 '\206', dlid_path_bits = 127 '\177'}, dir = Rdma::RECV} status = <value optimized out> dir = <value optimized out> q = {px = 0x7f8619814240} recvEvents = 0 sendEvents = 0 __PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"
And the following queue error appears on the client side when the failure occurs: (gdb) info locals e = {cq = {px = 0x7fbdd8000cf0, pn = {pi_ = 0x7fbdd8000e30}}, wc = {wr_id = 140453349439504, status = IBV_WC_REM_INV_REQ_ERR, opcode = 32701, vendor_err = 138, byte_len = 53, imm_data = 16, qp_num = 2621532, src_qp = 3355462216, wc_flags = 32701, pkey_index = 37744, slid = 28430, sl = 53 '5', dlid_path_bits = 0 '\000'}, dir = Rdma::SEND} status = <value optimized out> dir = <value optimized out> q = {px = 0x7fbdd8000b20} recvEvents = 0 sendEvents = 0 __PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"
Created attachment 494764 [details] Notes from the latest debug session.
Upstream bug report: https://issues.apache.org/jira/browse/QPID-3227
Patched upstream: http://svn.apache.org/viewvc?view=revision&revision=1097102
Fix merged to git mrg_2.0.x repo: http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=a635e062394bb30cd5aa6ca41ca0bc88773fd51e
*** Bug 674056 has been marked as a duplicate of this bug. ***
We see BZ674011 as potential blocker for this defect - where the root cause is "hanging perftest". which is also marked as blocker for BZ484691. At this moment there is only limited amount of specific (Mellanox) hardware resources required by this issue and is not accessible for as long as this issue requires to be tested. But we believe this issue might not be hardware specific as of BZ674056 is marked as duplicate bug and was seen on mrg4, mrg5 with Chelsio cards. But our effort to verify whether this issue is fixed on Chelsio cards is blocked by hanging perftest as described in BZ674011.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html