| Summary: | Qpidd Server crashes when implementing RDMA | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Tom Tracy <ttracy> | ||||
| Component: | qpid-cpp | Assignee: | Ken Giusti <kgiusti> | ||||
| Status: | CLOSED ERRATA | QA Contact: | ppecka <ppecka> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | Development | CC: | astitcher, freznice, gsim, iboverma, jross, kgiusti, ppecka, tross, ttracy | ||||
| Target Milestone: | 2.0 | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | qpid-cpp-mrg-0.10-6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 700156 (view as bug list) | Environment: | |||||
| Last Closed: | 2011-06-23 15:43:02 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | 674011 | ||||||
| Bug Blocks: | 700156, 484691 | ||||||
| Attachments: |
|
||||||
|
Description
Tom Tracy
2011-03-01 18:40:38 UTC
coredump too large so here is a pointer where I put it on the web http://perf1.lab.bos.redhat.com/network/qpidd_rdma_crash.tar This may be the same underlying cause as Bug 674056 We need to reconsider the blocker status of this bug. I don't think we can ship with this bug. This affects both Mellanox 10Gb and Mellanox Infiniband. Also affects publishing reference papers on the subject matter. So I consider it a blocker At the point where the client fails, the following error is generated on the broker side:
#5 Rdma::AsynchIO::processCompletions (this=0x7f8619814cf0)
at qpid/sys/rdma/RdmaIO.cpp:385
385 errorCallback(*this);
(gdb) list
380 } else {
381 ++recvEvents;
382 }
383 continue;
384 }
385 errorCallback(*this);
386 // TODO: Probably need to flush queues at this point
387 return;
388 }
389
(gdb) info local
e = {cq = {px = 0x7f86198144a0, pn = {pi_ = 0x7f8619814570}}, wc = {
wr_id = 140213934450336, status = IBV_WC_LOC_LEN_ERR, opcode = 32646,
vendor_err = 215, byte_len = 32646, imm_data = 0, qp_num = 3539037,
src_qp = 0, wc_flags = 0, pkey_index = 16960, slid = 6529,
sl = 134 '\206', dlid_path_bits = 127 '\177'}, dir = Rdma::RECV}
status = <value optimized out>
dir = <value optimized out>
q = {px = 0x7f8619814240}
recvEvents = 0
sendEvents = 0
__PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"
And the following queue error appears on the client side when the failure occurs:
(gdb) info locals
e = {cq = {px = 0x7fbdd8000cf0, pn = {pi_ = 0x7fbdd8000e30}}, wc = {wr_id = 140453349439504, status = IBV_WC_REM_INV_REQ_ERR, opcode = 32701, vendor_err = 138,
byte_len = 53, imm_data = 16, qp_num = 2621532, src_qp = 3355462216, wc_flags = 32701, pkey_index = 37744, slid = 28430, sl = 53 '5',
dlid_path_bits = 0 '\000'}, dir = Rdma::SEND}
status = <value optimized out>
dir = <value optimized out>
q = {px = 0x7fbdd8000b20}
recvEvents = 0
sendEvents = 0
__PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"
Created attachment 494764 [details]
Notes from the latest debug session.
Upstream bug report: https://issues.apache.org/jira/browse/QPID-3227 Patched upstream: http://svn.apache.org/viewvc?view=revision&revision=1097102 Fix merged to git mrg_2.0.x repo: http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=a635e062394bb30cd5aa6ca41ca0bc88773fd51e *** Bug 674056 has been marked as a duplicate of this bug. *** We see BZ674011 as potential blocker for this defect - where the root cause is "hanging perftest". which is also marked as blocker for BZ484691. At this moment there is only limited amount of specific (Mellanox) hardware resources required by this issue and is not accessible for as long as this issue requires to be tested. But we believe this issue might not be hardware specific as of BZ674056 is marked as duplicate bug and was seen on mrg4, mrg5 with Chelsio cards. But our effort to verify whether this issue is fixed on Chelsio cards is blocked by hanging perftest as described in BZ674011. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html |