Bug 466086 - IPoIB-CM connectivity problem with eHCA adapters
Summary: IPoIB-CM connectivity problem with eHCA adapters
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: ppc64
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Doug Ledford
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-08 08:39 UTC by Yury Konovalov
Modified: 2009-09-03 14:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:44:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Yury Konovalov 2008-10-08 08:39:11 UTC
dual-port eHCA adapters lose connectivity to random nodes with no visible reason. It could happened over time with or without any load on IB. Ping from affected IBM p520 (with eHCA) to some nodes in the cluster (IBM x86_64 blades with mthca adapters, or another p520) simply stop working. When it happens ibv_ud_pingpong also doesn't work. There's no meaningful messages in the log and dmesg besides messages from IBM GPFS cluster for loosing membership in the cluster. Sometimes I see eHCA dump messages in dmesg, but it does doesn't seem to be correlated to the event of actual connectivity failure. Sometimes there are following messages in dmesg: "ib0: failed to allocate receive buffer ###". Actually after some time we got tons of such messages in dmesg. Then server become unresponsive and basically hangs. Power button and reboot helps to restore connectivity with all nodes in the cluster. Also there is no such problems on nodes with different HCAs.

Hardware:
Server: IBM p520
HCA: eHCA (hw_ver: 0x1000002) according to ibv_devinfo
IB Switch: Cisco-BCH TopspinOS 2.9.0 releng #163

Soft details:
Kernel: 2.6.18-92.1.13.el5
HCA driver: SVNEHCA_0025
IPoIB interface settings:
      - mode set to connected on boot with custom script (since no option for IPoIB-CM in RHEL openibd scripts). 
      - MTU set to 65520 on boot with custom script. (since no option for IPoIB-CM in RHEL openibd scripts). 
 
ib0       Link encap:InfiniBand  HWaddr 80:00:04:CC:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:192.168.1.253  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::202:5500:5187:b000/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:1781 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15 errors:0 dropped:4 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:145768 (142.3 KiB)  TX bytes:1056 (1.0 KiB)

ib1       Link encap:InfiniBand  HWaddr 80:00:04:CD:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:192.168.2.253  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::202:5500:5187:b040/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:2172615 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1419686 errors:0 dropped:1418 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:83580962739 (77.8 GiB)  TX bytes:4524818982 (4.2 GiB)

---------
dmesg:
---------------
ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data QP 0x50d (resource=200000010000050d) has errors.
ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data Error data is available: 200000010000050d.
ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data EHCA ----- error data begin ---------------------------------------------------
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c000 ofs=0000 00000000000004d0 200000010000050d
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c010 ofs=0010 0100000000000310 8000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c020 ofs=0020 a000000500000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c030 ofs=0030 0000000001000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c040 ofs=0040 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c050 ofs=0050 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c060 ofs=0060 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c070 ofs=0070 000000000000ffff 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c080 ofs=0080 0000000000139a80 0000000000139a80
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c090 ofs=0090 0000000000139a80 00000000139bd900
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0a0 ofs=00a0 00000000000003c2 0000000000000007
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0b0 ofs=00b0 0000000000000002 0000000000000006
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0c0 ofs=00c0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0d0 ofs=00d0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0e0 ofs=00e0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0f0 ofs=00f0 0000000000000000 0000000000000004
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c100 ofs=0100 0000000000000003 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c110 ofs=0110 0000000000000000 0000000000000018
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c120 ofs=0120 00000001ef9f2c80 0000000003afaa90
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c130 ofs=0130 0000000000139bdb 0000000000139bdb
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c140 ofs=0140 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c150 ofs=0150 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c160 ofs=0160 0000000000000199 000000000000015b
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c170 ofs=0170 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c180 ofs=0180 0000000000000007 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c190 ofs=0190 0000000000000000 000000011f89224a
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1a0 ofs=01a0 0000000000000000 0000000000000159
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1b0 ofs=01b0 00000001ef9f2c80 0000000003afaa90
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1c0 ofs=01c0 0000000000000001 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1d0 ofs=01d0 00000001ec110000 0000000003afb680
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1e0 ofs=01e0 0000000000000005 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1f0 ofs=01f0 0000000000000003 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c200 ofs=0200 0000000000139bd9 0000000000000002
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c210 ofs=0210 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c220 ofs=0220 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c230 ofs=0230 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c240 ofs=0240 0000000000000000 0000000000139bda
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c250 ofs=0250 00000000ab330f80 0000000000000013
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c260 ofs=0260 0000000000000013 0000000000000004
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c270 ofs=0270 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c280 ofs=0280 0000000000139a80 0000000000139a80
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c290 ofs=0290 0000000000139a80 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2a0 ofs=02a0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2b0 ofs=02b0 139bda0000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2c0 ofs=02c0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2d0 ofs=02d0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2e0 ofs=02e0 0000000000000000 143c000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2f0 ofs=02f0 0000000000000000 6400000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c300 ofs=0300 6800000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c310 ofs=0310 4000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c320 ofs=0320 0000000000000000 02000000000000c8
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c330 ofs=0330 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c340 ofs=0340 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c350 ofs=0350 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c360 ofs=0360 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c370 ofs=0370 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c380 ofs=0380 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c390 ofs=0390 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3a0 ofs=03a0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3b0 ofs=03b0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3c0 ofs=03c0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3d0 ofs=03d0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3e0 ofs=03e0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3f0 ofs=03f0 0000000000000000 0400000000000060
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c400 ofs=0400 8000000000000000 c000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c410 ofs=0410 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c420 ofs=0420 0000000003adf28b 00000001edb91ac0
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c430 ofs=0430 00000000000002bf 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c440 ofs=0440 0000000000000000 0004000000000004
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c450 ofs=0450 0000000000000004 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c460 ofs=0460 0300000000000068 8040000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c470 ofs=0470 c000c00000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c480 ofs=0480 0000000000000000 0000000006b9b03b
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c490 ofs=0490 0000000027e674d0 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c4a0 ofs=04a0 0000000000000000 0000000000000000
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c4b0 ofs=04b0 0000000000000000 0000000000000004
EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c4c0 ofs=04c0 0000000000000002 0000000000000000
ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data EHCA ----- error data end ----------------------------------------------------
GPFS Deadman Switch timer [0] has expired; IOs in progress: 0




And another example:
-----------------------------------------
ib1: failed to allocate receive buffer 110
ib1: failed to allocate receive buffer 110
ib1: failed to allocate receive buffer 110
ib0: failed to allocate receive buffer 110
ib1: failed to allocate receive buffer 110
ib1: failed to allocate receive buffer 110
ib1: failed to allocate receive buffer 101
ib1: failed to allocate receive buffer 94
ib1: failed to allocate receive buffer 91
ib1: failed to allocate receive buffer 104
ib1: failed to allocate receive buffer 104
ib1: failed to allocate receive buffer 106
ib1: failed to allocate receive buffer 88
ib1: failed to allocate receive buffer 101
ib1: failed to allocate receive buffer 98
ib1: failed to allocate receive buffer 97
ib1: failed to allocate receive buffer 97
ib1: failed to allocate receive buffer 91
ib1: failed to allocate receive buffer 93
ib0: failed to allocate receive buffer 105
ib1: failed to allocate receive buffer 105
ib1: failed to allocate receive buffer 105
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib0: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib0: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 104
ib1: failed to allocate receive buffer 104
ib1: failed to allocate receive buffer 103
ib1: failed to allocate receive buffer 107
ib1: failed to allocate receive buffer 107
ib1: failed to allocate receive buffer 107
ib1: failed to allocate receive buffer 107
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib0: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 111
ib1: failed to allocate receive buffer 108
ib1: failed to allocate receive buffer 108
ib1: failed to allocate receive buffer 108
ib1: failed to allocate receive buffer 108
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib0: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
ib1: failed to allocate receive buffer 109
-----------------------------------------

Comment 1 Doug Ledford 2009-04-22 22:59:37 UTC
There were errors in the connected mode support (part of the reason for no script to enable it) prior to kernel 2.6.18-128.1.1.el5.  In addition, the ehca driver and the IPoIB driver have been updated as part of the rhel5.4 update cycle.  This issue should not exist with later kernels.

Comment 2 RHEL Program Management 2009-04-27 14:49:18 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Don Zickus 2009-05-06 17:14:46 UTC
in kernel-2.6.18-144.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 errata-xmlrpc 2009-09-02 08:44:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.