Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 681313

Summary:

Qpidd Server crashes when implementing RDMA

Product:

Red Hat Enterprise MRG

Reporter:

Tom Tracy <ttracy>

Component:

qpid-cpp

Assignee:

Ken Giusti <kgiusti>

Status:

CLOSED ERRATA

QA Contact:

ppecka <ppecka>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

Development

CC:

astitcher, freznice, gsim, iboverma, jross, kgiusti, ppecka, tross, ttracy

Target Milestone:

2.0

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

qpid-cpp-mrg-0.10-6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

700156 (view as bug list)

Environment:

Last Closed:

2011-06-23 15:43:02 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

674011

Bug Blocks:

484691, 700156

Attachments:

Description	Flags
Notes from the latest debug session.	none

Description Tom Tracy 2011-03-01 18:40:38 UTC

Description of problem: Running Perftest with either Mellanox 10Gb with Rocce or with Mellanox IB with RDMA crashes qpidd server


Version-Release number of selected component (if applicable):

qpid-cpp-server-0.9.1073306-1.el6.x86_64

How reproducible: Happens with every test run started


Steps to Reproduce:
1. start qpidd process /usr/sbin/qpidd --auth no -m no --no-data-dir --mgmt-qmf1 no --mgmt-qmf2 no --load-module /usr/lib64/qpid/daemon/rdma.so --log-enable=info+
2. start the test 
 /usr/bin/qpid-perftest -P rdma -b 192.168.10.42 --count 100000 --size 8 --npubs 1 --nsubs 1 --qt 13

  
Actual results:
2011-03-01 13:19:51 error Caught exception in state: 3 with event: 1: Couldn't find existing Connection

From the client
2011-03-01 13:19:28 error RDMA: qp=0x7fd5b000bf50: Deleting queue before all write buffers finished
2011-03-01 13:19:28 warning Connection [192.168.10.42:34451 192.168.10.44:5672] closed
PublishThread exception: Connection [192.168.10.42:34451 192.168.10.44:5672] closed


Expected results:

No crashes and finish the test. 


Additional info: I went back and tested different builds and found that this started with the qpid-cpp-0.7.946106-9.el6 build 

I have attached the latest coredump to this crash

Comment 1 Tom Tracy 2011-03-01 18:42:51 UTC

coredump too large so here is a pointer where I put it on the web

http://perf1.lab.bos.redhat.com/network/qpidd_rdma_crash.tar

Comment 2 Andrew Stitcher 2011-03-02 15:57:26 UTC

This may be the same underlying cause as Bug 674056

Comment 5 Ted Ross 2011-03-23 13:36:16 UTC

We need to reconsider the blocker status of this bug.  I don't think we can ship with this bug.

Comment 6 Tom Tracy 2011-03-23 13:41:54 UTC

This affects both Mellanox 10Gb and Mellanox Infiniband. Also affects publishing reference papers on the subject matter. So I consider it a blocker

Comment 8 Ken Giusti 2011-04-25 20:23:45 UTC

At the point where the client fails, the following error is generated on the broker side:

#5  Rdma::AsynchIO::processCompletions (this=0x7f8619814cf0)
    at qpid/sys/rdma/RdmaIO.cpp:385
385                     errorCallback(*this);
(gdb) list
380                         } else {
381                             ++recvEvents;
382                         }
383                         continue;
384                     }
385                     errorCallback(*this);
386                     // TODO: Probably need to flush queues at this point
387                     return;
388                 }
389
(gdb) info local
e = {cq = {px = 0x7f86198144a0, pn = {pi_ = 0x7f8619814570}}, wc = {
    wr_id = 140213934450336, status = IBV_WC_LOC_LEN_ERR, opcode = 32646,
    vendor_err = 215, byte_len = 32646, imm_data = 0, qp_num = 3539037,
    src_qp = 0, wc_flags = 0, pkey_index = 16960, slid = 6529,
    sl = 134 '\206', dlid_path_bits = 127 '\177'}, dir = Rdma::RECV}
status = <value optimized out>
dir = <value optimized out>
q = {px = 0x7f8619814240}
recvEvents = 0
sendEvents = 0
__PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"

Comment 9 Ken Giusti 2011-04-25 20:44:24 UTC

And the following queue error appears on the client side when the failure occurs:

(gdb) info locals
e = {cq = {px = 0x7fbdd8000cf0, pn = {pi_ = 0x7fbdd8000e30}}, wc = {wr_id = 140453349439504, status = IBV_WC_REM_INV_REQ_ERR, opcode = 32701, vendor_err = 138,
    byte_len = 53, imm_data = 16, qp_num = 2621532, src_qp = 3355462216, wc_flags = 32701, pkey_index = 37744, slid = 28430, sl = 53 '5',
    dlid_path_bits = 0 '\000'}, dir = Rdma::SEND}
status = <value optimized out>
dir = <value optimized out>
q = {px = 0x7fbdd8000b20}
recvEvents = 0
sendEvents = 0
__PRETTY_FUNCTION__ = "void Rdma::AsynchIO::processCompletions()"

Comment 10 Ken Giusti 2011-04-25 21:12:35 UTC

Created attachment 494764 [details]
Notes from the latest debug session.

Comment 11 Ken Giusti 2011-04-26 20:11:57 UTC

Upstream bug report:
https://issues.apache.org/jira/browse/QPID-3227

Comment 12 Ken Giusti 2011-04-27 12:53:37 UTC

Patched upstream:

http://svn.apache.org/viewvc?view=revision&revision=1097102

Comment 13 Ken Giusti 2011-04-27 13:07:59 UTC

Fix merged to git mrg_2.0.x repo:

http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=a635e062394bb30cd5aa6ca41ca0bc88773fd51e

Comment 14 Andrew Stitcher 2011-05-13 03:37:02 UTC

*** Bug 674056 has been marked as a duplicate of this bug. ***

Comment 15 ppecka 2011-05-30 09:19:04 UTC

We see BZ674011 as potential blocker for this defect - where the root cause is "hanging perftest". which is also marked as blocker for BZ484691.
At this moment there is only limited amount of specific (Mellanox) hardware resources required by this issue and is not accessible for as long as this issue requires to be tested.
But we believe this issue might not be hardware specific as of BZ674056 is marked as duplicate bug and was seen on mrg4, mrg5 with Chelsio cards. But our effort to verify whether this issue is fixed on Chelsio cards is blocked by hanging perftest as described in BZ674011.

Comment 18 errata-xmlrpc 2011-06-23 15:43:02 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0890.html