Bug 681313 - Qpidd Server crashes when implementing RDMA
Summary: Qpidd Server crashes when implementing RDMA
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: 2.0
: ---
Assignee: Ken Giusti
QA Contact: ppecka
URL:
Whiteboard:
: 674056 (view as bug list)
Depends On: 674011
Blocks: 700156 484691
TreeView+ depends on / blocked
 
Reported: 2011-03-01 18:40 UTC by Tom Tracy
Modified: 2011-06-23 15:43 UTC (History)
9 users (show)

Fixed In Version: qpid-cpp-mrg-0.10-6
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 700156 (view as bug list)
Environment:
Last Closed: 2011-06-23 15:43:02 UTC
Target Upstream Version:


Attachments (Terms of Use)
Notes from the latest debug session. (16.19 KB, text/plain)
2011-04-25 21:12 UTC, Ken Giusti
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0890 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 2.0 Release 2011-06-23 15:42:41 UTC

Description Tom Tracy 2011-03-01 18:40:38 UTC
Description of problem: Running Perftest with either Mellanox 10Gb with Rocce or with Mellanox IB with RDMA crashes qpidd server


Version-Release number of selected component (if applicable):

qpid-cpp-server-0.9.1073306-1.el6.x86_64

How reproducible: Happens with every test run started


Steps to Reproduce:
1. start qpidd process /usr/sbin/qpidd --auth no -m no --no-data-dir --mgmt-qmf1 no --mgmt-qmf2 no --load-module /usr/lib64/qpid/daemon/rdma.so --log-enable=info+
2. start the test 
 /usr/bin/qpid-perftest -P rdma -b 192.168.10.42 --count 100000 --size 8 --npubs 1 --nsubs 1 --qt 13

  
Actual results:
2011-03-01 13:19:51 error Caught exception in state: 3 with event: 1: Couldn't find existing Connection

From the client
2011-03-01 13:19:28 error RDMA: qp=0x7fd5b000bf50: Deleting queue before all write buffers finished
2011-03-01 13:19:28 warning Connection [192.168.10.42:34451 192.168.10.44:5672] closed
PublishThread exception: Connection [192.168.10.42:34451 192.168.10.44:5672] closed


Expected results:

No crashes and finish the test. 


Additional info: I went back and tested different builds and found that this started with the qpid-cpp-0.7.946106-9.el6 build 

I have attached the latest coredump to this crash

Comment 1 Tom Tracy 2011-03-01 18:42:51 UTC
coredump too large so here is a pointer where I put it on the web

http://perf1.lab.bos.redhat.com/network/qpidd_rdma_crash.tar

Comment 2 Andrew Stitcher 2011-03-02 15:57:26 UTC
This may be the same underlying cause as Bug 674056

Comment 5 Ted Ross 2011-03-23 13:36:16 UTC
We need to reconsider the blocker status of this bug.  I don't think we can ship with this bug.

Comment 6 Tom Tracy 2011-03-23 13:41:54 UTC
This affects both Mellanox 10Gb and Mellanox Infiniband. Also affects publishing reference papers on the subject matter. So I consider it a blocker

Comment 8 Ken Giusti 2011-04-25 20:23:45 UTC
At the point where the client fails, the following error is generated on the broker side:

#5  Rdma::AsynchIO::processCompletions (this=0x7f8619814cf0)
    at qpid/sys/rdma/RdmaIO.cpp:385
385                     errorCallback(*this);
(gdb) list
380                         } else {
381                             ++recvEvents;
382                         }
383                         continue;
384                     }
385                     errorCallback(*this);
386                     // TODO: Probably need to flush queues at this point
387                     return;
388                 }
389
(gdb) info local
e = {cq = {px = 0x7f86198144a0, pn = {pi_ = 0x7f8619814570}}, wc = {
    wr_id = 140213934450336, status = IBV_WC_LOC_LEN_ERR, opcode = 32646,
    vendor_err = 215, byte_len = 32646, imm_data = 0, qp_num = 3539037,
    src_qp = 0, wc_flags = 0, pkey_index = 16960, slid = 6529,
    sl = 134 '\206', dlid_path_bits = 127 '\177'}, dir = Rdma::RECV}
status = <value optimized out>
dir = <value optimized out>
q = {px = 0x7f8619814240}
recvEvents = 0
sendEvents = 0
__PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"

Comment 9 Ken Giusti 2011-04-25 20:44:24 UTC
And the following queue error appears on the client side when the failure occurs:

(gdb) info locals
e = {cq = {px = 0x7fbdd8000cf0, pn = {pi_ = 0x7fbdd8000e30}}, wc = {wr_id = 140453349439504, status = IBV_WC_REM_INV_REQ_ERR, opcode = 32701, vendor_err = 138,
    byte_len = 53, imm_data = 16, qp_num = 2621532, src_qp = 3355462216, wc_flags = 32701, pkey_index = 37744, slid = 28430, sl = 53 '5',
    dlid_path_bits = 0 '\000'}, dir = Rdma::SEND}
status = <value optimized out>
dir = <value optimized out>
q = {px = 0x7fbdd8000b20}
recvEvents = 0
sendEvents = 0
__PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"

Comment 10 Ken Giusti 2011-04-25 21:12:35 UTC
Created attachment 494764 [details]
Notes from the latest debug session.

Comment 11 Ken Giusti 2011-04-26 20:11:57 UTC
Upstream bug report:
https://issues.apache.org/jira/browse/QPID-3227

Comment 12 Ken Giusti 2011-04-27 12:53:37 UTC
Patched upstream:

http://svn.apache.org/viewvc?view=revision&revision=1097102

Comment 14 Andrew Stitcher 2011-05-13 03:37:02 UTC
*** Bug 674056 has been marked as a duplicate of this bug. ***

Comment 15 ppecka 2011-05-30 09:19:04 UTC
We see BZ674011 as potential blocker for this defect - where the root cause is "hanging perftest". which is also marked as blocker for BZ484691.
At this moment there is only limited amount of specific (Mellanox) hardware resources required by this issue and is not accessible for as long as this issue requires to be tested.
But we believe this issue might not be hardware specific as of BZ674056 is marked as duplicate bug and was seen on mrg4, mrg5 with Chelsio cards. But our effort to verify whether this issue is fixed on Chelsio cards is blocked by hanging perftest as described in BZ674011.

Comment 18 errata-xmlrpc 2011-06-23 15:43:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0890.html


Note You need to log in before you can comment on or make changes to this bug.