dual-port eHCA adapters lose connectivity to random nodes with no visible reason. It could happened over time with or without any load on IB. Ping from affected IBM p520 (with eHCA) to some nodes in the cluster (IBM x86_64 blades with mthca adapters, or another p520) simply stop working. When it happens ibv_ud_pingpong also doesn't work. There's no meaningful messages in the log and dmesg besides messages from IBM GPFS cluster for loosing membership in the cluster. Sometimes I see eHCA dump messages in dmesg, but it does doesn't seem to be correlated to the event of actual connectivity failure. Sometimes there are following messages in dmesg: "ib0: failed to allocate receive buffer ###". Actually after some time we got tons of such messages in dmesg. Then server become unresponsive and basically hangs. Power button and reboot helps to restore connectivity with all nodes in the cluster. Also there is no such problems on nodes with different HCAs. Hardware: Server: IBM p520 HCA: eHCA (hw_ver: 0x1000002) according to ibv_devinfo IB Switch: Cisco-BCH TopspinOS 2.9.0 releng #163 Soft details: Kernel: 2.6.18-92.1.13.el5 HCA driver: SVNEHCA_0025 IPoIB interface settings: - mode set to connected on boot with custom script (since no option for IPoIB-CM in RHEL openibd scripts). - MTU set to 65520 on boot with custom script. (since no option for IPoIB-CM in RHEL openibd scripts). ib0 Link encap:InfiniBand HWaddr 80:00:04:CC:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.1.253 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::202:5500:5187:b000/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:1781 errors:0 dropped:0 overruns:0 frame:0 TX packets:15 errors:0 dropped:4 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:145768 (142.3 KiB) TX bytes:1056 (1.0 KiB) ib1 Link encap:InfiniBand HWaddr 80:00:04:CD:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.2.253 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::202:5500:5187:b040/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:2172615 errors:0 dropped:0 overruns:0 frame:0 TX packets:1419686 errors:0 dropped:1418 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:83580962739 (77.8 GiB) TX bytes:4524818982 (4.2 GiB) --------- dmesg: --------------- ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data QP 0x50d (resource=200000010000050d) has errors. ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data Error data is available: 200000010000050d. ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data EHCA ----- error data begin --------------------------------------------------- EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c000 ofs=0000 00000000000004d0 200000010000050d EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c010 ofs=0010 0100000000000310 8000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c020 ofs=0020 a000000500000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c030 ofs=0030 0000000001000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c040 ofs=0040 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c050 ofs=0050 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c060 ofs=0060 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c070 ofs=0070 000000000000ffff 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c080 ofs=0080 0000000000139a80 0000000000139a80 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c090 ofs=0090 0000000000139a80 00000000139bd900 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0a0 ofs=00a0 00000000000003c2 0000000000000007 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0b0 ofs=00b0 0000000000000002 0000000000000006 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0c0 ofs=00c0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0d0 ofs=00d0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0e0 ofs=00e0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c0f0 ofs=00f0 0000000000000000 0000000000000004 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c100 ofs=0100 0000000000000003 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c110 ofs=0110 0000000000000000 0000000000000018 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c120 ofs=0120 00000001ef9f2c80 0000000003afaa90 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c130 ofs=0130 0000000000139bdb 0000000000139bdb EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c140 ofs=0140 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c150 ofs=0150 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c160 ofs=0160 0000000000000199 000000000000015b EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c170 ofs=0170 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c180 ofs=0180 0000000000000007 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c190 ofs=0190 0000000000000000 000000011f89224a EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1a0 ofs=01a0 0000000000000000 0000000000000159 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1b0 ofs=01b0 00000001ef9f2c80 0000000003afaa90 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1c0 ofs=01c0 0000000000000001 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1d0 ofs=01d0 00000001ec110000 0000000003afb680 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1e0 ofs=01e0 0000000000000005 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c1f0 ofs=01f0 0000000000000003 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c200 ofs=0200 0000000000139bd9 0000000000000002 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c210 ofs=0210 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c220 ofs=0220 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c230 ofs=0230 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c240 ofs=0240 0000000000000000 0000000000139bda EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c250 ofs=0250 00000000ab330f80 0000000000000013 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c260 ofs=0260 0000000000000013 0000000000000004 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c270 ofs=0270 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c280 ofs=0280 0000000000139a80 0000000000139a80 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c290 ofs=0290 0000000000139a80 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2a0 ofs=02a0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2b0 ofs=02b0 139bda0000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2c0 ofs=02c0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2d0 ofs=02d0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2e0 ofs=02e0 0000000000000000 143c000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c2f0 ofs=02f0 0000000000000000 6400000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c300 ofs=0300 6800000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c310 ofs=0310 4000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c320 ofs=0320 0000000000000000 02000000000000c8 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c330 ofs=0330 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c340 ofs=0340 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c350 ofs=0350 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c360 ofs=0360 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c370 ofs=0370 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c380 ofs=0380 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c390 ofs=0390 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3a0 ofs=03a0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3b0 ofs=03b0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3c0 ofs=03c0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3d0 ofs=03d0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3e0 ofs=03e0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c3f0 ofs=03f0 0000000000000000 0400000000000060 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c400 ofs=0400 8000000000000000 c000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c410 ofs=0410 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c420 ofs=0420 0000000003adf28b 00000001edb91ac0 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c430 ofs=0430 00000000000002bf 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c440 ofs=0440 0000000000000000 0004000000000004 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c450 ofs=0450 0000000000000004 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c460 ofs=0460 0300000000000068 8040000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c470 ofs=0470 c000c00000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c480 ofs=0480 0000000000000000 0000000006b9b03b EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c490 ofs=0490 0000000027e674d0 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c4a0 ofs=04a0 0000000000000000 0000000000000000 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c4b0 ofs=04b0 0000000000000000 0000000000000004 EHCA_DMP:print_error_data resource=200000010000050d adr=c0000001d266c4c0 ofs=04c0 0000000000000002 0000000000000000 ehca lhca@1bc0c38: PU0000 EHCA_ERR:print_error_data EHCA ----- error data end ---------------------------------------------------- GPFS Deadman Switch timer [0] has expired; IOs in progress: 0 And another example: ----------------------------------------- ib1: failed to allocate receive buffer 110 ib1: failed to allocate receive buffer 110 ib1: failed to allocate receive buffer 110 ib0: failed to allocate receive buffer 110 ib1: failed to allocate receive buffer 110 ib1: failed to allocate receive buffer 110 ib1: failed to allocate receive buffer 101 ib1: failed to allocate receive buffer 94 ib1: failed to allocate receive buffer 91 ib1: failed to allocate receive buffer 104 ib1: failed to allocate receive buffer 104 ib1: failed to allocate receive buffer 106 ib1: failed to allocate receive buffer 88 ib1: failed to allocate receive buffer 101 ib1: failed to allocate receive buffer 98 ib1: failed to allocate receive buffer 97 ib1: failed to allocate receive buffer 97 ib1: failed to allocate receive buffer 91 ib1: failed to allocate receive buffer 93 ib0: failed to allocate receive buffer 105 ib1: failed to allocate receive buffer 105 ib1: failed to allocate receive buffer 105 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib0: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib0: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 104 ib1: failed to allocate receive buffer 104 ib1: failed to allocate receive buffer 103 ib1: failed to allocate receive buffer 107 ib1: failed to allocate receive buffer 107 ib1: failed to allocate receive buffer 107 ib1: failed to allocate receive buffer 107 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib0: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 111 ib1: failed to allocate receive buffer 108 ib1: failed to allocate receive buffer 108 ib1: failed to allocate receive buffer 108 ib1: failed to allocate receive buffer 108 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib0: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 ib1: failed to allocate receive buffer 109 -----------------------------------------
There were errors in the connected mode support (part of the reason for no script to enable it) prior to kernel 2.6.18-128.1.1.el5. In addition, the ehca driver and the IPoIB driver have been updated as part of the rhel5.4 update cycle. This issue should not exist with later kernels.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-144.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html