Bug 1886812
Summary: | [RHEL8.3] [opa-ff] create soft link for dsap plugin | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian Chae <bchae> | ||||||
Component: | opa-ff | Assignee: | Honggang LI <honli> | ||||||
Status: | CLOSED ERRATA | QA Contact: | zguo <zguo> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 8.3 | CC: | rdma-dev-team, zguo | ||||||
Target Milestone: | rc | Keywords: | Triaged | ||||||
Target Release: | 8.4 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | opa-ff-10.10.3.0.11-1.el8 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-05-18 14:44:44 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1842946 | ||||||||
Attachments: |
|
Created attachment 1720263 [details]
client log for IBACM failure on HFI1 OPA
I reproduced this issue with opafm running on rdma-master. The opafm in rdma-master was running but Inactive. So, I have to restart it. [root@rdma-master ~]$ /sbin/service opafm status Redirecting to /bin/systemctl status opafm.service ● opafm.service - OPA Fabric Manager Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-09-30 16:13:28 EDT; 1 weeks 5 days ago Process: 135046 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS) Main PID: 135047 (opafmd) Tasks: 102 CGroup: /system.slice/opafm.service ├─135047 /usr/lib/opa-fm/bin/opafmd -D ├─135051 /usr/lib/opa-fm/runtime/sm -e sm_0 └─135052 /usr/lib/opa-fm/runtime/fe -e fe_0 Oct 13 08:39:11 rdma-master fm0_fe[135052]: ERROR[main]: MAI: mai_send_stl_timeout: mai_send_stl: Invalid fd: 18446744073709551615 Oct 13 08:39:11 rdma-master fm0_fe[135052]: ERROR[main]: IF3: rmpp_send_single_sa: can't send reply to GET request to LID[0x4] for TID[...0016952] Oct 13 08:39:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_get_port_guid: Was not able to communicate with SA (SM/SA may ha...d), rc:7 Oct 13 08:39:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_query_service: Error getting GUID rc: 7 Oct 13 08:40:11 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_get_sa_classportInfo: could not talk with SA (SM/SA may have mov... Timeout Oct 13 08:40:11 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_register_fe: can't register FE with the SA at this time rc: 7: Timeout Oct 13 08:40:11 rdma-master fm0_fe[135052]: ERROR[main]: MAI: mai_send_stl_timeout: mai_send_stl: Invalid fd: 18446744073709551615 Oct 13 08:40:11 rdma-master fm0_fe[135052]: ERROR[main]: IF3: rmpp_send_single_sa: can't send reply to GET request to LID[0x4] for TID[...0016956] Oct 13 08:40:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_get_port_guid: Was not able to communicate with SA (SM/SA may ha...d), rc:7 Oct 13 08:40:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_query_service: Error getting GUID rc: 7 Hint: Some lines were ellipsized, use -l to show in full. [root@rdma-master ~]$ /sbin/service opafm restart Redirecting to /bin/systemctl restart opafm.service [root@rdma-master ~]$ [root@rdma-master ~]$ [root@rdma-master ~]$ /sbin/service opafm status Redirecting to /bin/systemctl status opafm.service ● opafm.service - OPA Fabric Manager Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-10-13 08:40:35 EDT; 4s ago Process: 50201 ExecStop=/usr/lib/opa-fm/bin/opafmd halt (code=exited, status=0/SUCCESS) Process: 50212 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS) Main PID: 50213 (opafmd) Tasks: 99 CGroup: /system.slice/opafm.service ├─50213 /usr/lib/opa-fm/bin/opafmd -D ├─50217 /usr/lib/opa-fm/runtime/sm -e sm_0 └─50218 /usr/lib/opa-fm/runtime/fe -e fe_0 Oct 13 08:40:35 rdma-master fm0_fe[50218]: [main]: FE: Using: HFI 2 Port 1 PortGuid 0x00117501016749b1 Oct 13 08:40:36 rdma-master fm0_sm[50217]: [main]: SM: Using: HFI 2 Port 1 PortGuid 0x00117501016749b1 Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: sm_transition: SM STATE TRANSITION from NOTACTIVE to DISCOVERING Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Initial sweep. Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: sm_set_local_port_pkey: sm pkey table already set Oct 13 08:40:39 rdma-master fm0_sm[50217]: WARN [ParallelSw[1]]: SM: sm_send_stl_request_impl: Bad MAD status received from Path:[ 1 44...ethod:01 Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[ParallelSw[1]]: SM: _get_sminfo: failed to get SmInfo from remote SM rdma-perf-07 hfi1...01096c60 Oct 13 08:40:39 rdma-master fm0_sm[50217]: WARN [ParallelSw[1]]: SM: _discover_worker: unable to setup port[44] of node OmniPth0011750...ng port! Oct 13 08:40:39 rdma-master fm0_sm[50217]: rdma-master; MSG:NOTICE|SM:rdma-master:port 1|COND:#5 SM state to master|NODE:rdma-master:po...o MASTER Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: sm_set_delayed_pkeys: sm pkey table refresh Hint: Some lines were ellipsized, use -l to show in full. [root@rdma-qe-15 ~]$ ibstat CA 'hfi1_0' CA type: Number of ports: 1 Firmware version: 1.27.0 Hardware version: 10 Node GUID: 0x0011750101670fb0 System image GUID: 0x0011750101670fb0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x00490020 Port GUID: 0x0011750101670fb0 Link layer: InfiniBand [root@rdma-qe-15 ~]$ [root@rdma-qe-15 ~]$ [root@rdma-qe-15 ~]$ ssh rdma-master ibstat hfi1_1 CA 'hfi1_1' CA type: Number of ports: 1 Firmware version: 1.27.0 Hardware version: 10 Node GUID: 0x00117501016749b1 System image GUID: 0x00117501010fa0e1 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x00490022 Port GUID: 0x00117501016749b1 Link layer: InfiniBand [root@rdma-qe-15 ~]$ [root@rdma-qe-15 ~]$ [root@rdma-qe-15 ~]$ opafabricinfo Fabric 0:0 Information: SM: rdma-master hfi1_1 Guid: 0x00117501016749b1 State: Master Number of HFIs: 6 Number of Switches: 1 Number of Links: 5 Number of HFI Links: 5 (Internal: 0 External: 5) Number of ISLs: 0 (Internal: 0 External: 0) Number of Degraded Links: 0 (HFI Links: 0 ISLs: 0) Number of Omitted Links: 0 (HFI Links: 0 ISLs: 0) ------------------------------------------------------------------------------- [root@rdma-qe-15 ~]$ ib_acme -f i -d 172.31.20.14 -v Service: /run/ibacm-unix.sock Destination: 172.31.20.14 ib_acm_resolve_ip failed: No data available return status 0xffffffff [root@rdma-qe-15 ~]$ commit ad5d934d688911149d795aee1d3b9fa06bf171a9 Author: Mark Haywood <mark.haywood> Date: Tue Mar 24 16:54:55 2020 +0100 ibacm: check provider file ends with .so extension acm_open_providers() reads filenames and checks to see if the filenames contain the .so (shared object) extension via strstr(). But this does not verify that the extension is found at the end of the filename - only that .so is a substring somewhere in the filename. This means filenames of the sort "libibacmp.so.org" will be matched and loaded as providers. The check should be modified to verify that the extension is at the end This ibacm commit introduces a regression for opa-address-resolution . This commit force ibacm only open provider (library) ends with '.so'. But opa-address-resolution provides "/usr/lib64/ibacm/libdsap.so.1.0.0". So, the provider was not used for address resolution for OPA device. We need create soft link "/usr/lib64/ibacm/libdsap.so" for "/usr/lib64/ibacm/libdsap.so.1.0.0". Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RDMA stack bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1594 |
Created attachment 1720240 [details] client log when IBACM worked on RHEL-8.3.0-20200909.1 Description of problem: IBACM between HFI1 OPA server and client host fail to resolve IP address. The RDMA hosts in the lab for this test are rdma-qe-14 and rdma-qe-15. Version-Release number of selected component (if applicable): DISTRO=RHEL-8.3.0-20200909.1 + [20-10-08 12:21:39] cat /etc/redhat-release Red Hat Enterprise Linux release 8.3 Beta (Ootpa) + [20-10-08 12:21:39] uname -a Linux rdma-qe-15.lab.bos.redhat.com 4.18.0-235.el8.x86_64 #1 SMP Thu Sep 3 10:48:30 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux + [20-10-08 12:21:39] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-235.el8.x86_64 root=/dev/mapper/rhel_rdma--qe--15-root ro intel_idle.max_cstate=0 processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH intel_iommu=on crashkernel=auto resume=/dev/mapper/rhel_rdma--qe--15-swap rd.lvm.lv=rhel_rdma-qe-15/root rd.lvm.lv=rhel_rdma-qe-15/swap console=ttyS1,115200 + [20-10-08 12:21:39] rpm -q rdma-core linux-firmware rdma-core-29.0-3.el8.x86_64 linux-firmware-20200619-99.git3890db36.el8.noarch + [20-10-08 12:21:39] tail /sys/class/infiniband/hfi1_0/fw_ver 1.27.0 + [20-10-08 12:21:39] lspci + [20-10-08 12:21:39] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 04:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10) + [20-10-08 12:21:39] ibstat CA 'hfi1_0' CA type: Number of ports: 1 Firmware version: 1.27.0 Hardware version: 10 Node GUID: 0x0011750101670fb0 System image GUID: 0x0011750101670fb0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x00490022 Port GUID: 0x0011750101670fb0 Link layer: InfiniBand + [20-10-08 12:21:39] ibstatus + [20-10-08 12:21:39] rpm -q ibacm ibacm-29.0-3.el8.x86_64 server interfaces ================= 6: hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256 link/infiniband 80:00:00:02:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:10:f0 brd 00:ff:ff:ff:ff:12:40:1b:80:20:00:00:00:00:00:00:ff:ff:ff:ff inet 172.31.20.14/24 brd 172.31.20.255 scope global dynamic noprefixroute hfi1_opa0 valid_lft 3449sec preferred_lft 3449sec inet6 fe80::211:7501:167:10f0/64 scope link noprefixroute valid_lft forever preferred_lft forever 7: hfi1_opa0.8024@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256 link/infiniband 80:00:00:04:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:10:f0 brd 00:ff:ff:ff:ff:12:40:1b:80:24:00:00:00:00:00:00:ff:ff:ff:ff inet 172.31.24.14/24 brd 172.31.24.255 scope global dynamic noprefixroute hfi1_opa0.8024 valid_lft 3449sec preferred_lft 3449sec inet6 fe80::211:7501:167:10f0/64 scope link noprefixroute valid_lft forever preferred_lft forever 8: hfi1_opa0.8022@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256 link/infiniband 80:00:00:06:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:10:f0 brd 00:ff:ff:ff:ff:12:40:1b:80:22:00:00:00:00:00:00:ff:ff:ff:ff inet 172.31.22.14/24 brd 172.31.22.255 scope global dynamic noprefixroute hfi1_opa0.8022 valid_lft 3449sec preferred_lft 3449sec inet6 fe80::211:7501:167:10f0/64 scope link noprefixroute valid_lft forever preferred_lft forever client interfaces ================= 6: hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256 link/infiniband 80:00:00:02:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:0f:b0 brd 00:ff:ff:ff:ff:12:40:1b:80:20:00:00:00:00:00:00:ff:ff:ff:ff inet 172.31.20.15/24 brd 172.31.20.255 scope global noprefixroute hfi1_opa0 valid_lft forever preferred_lft forever inet6 fe80::211:7501:167:fb0/64 scope link noprefixroute valid_lft forever preferred_lft forever 7: hfi1_opa0.8024@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256 link/infiniband 80:00:00:04:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:0f:b0 brd 00:ff:ff:ff:ff:12:40:1b:80:24:00:00:00:00:00:00:ff:ff:ff:ff inet 172.31.24.15/24 brd 172.31.24.255 scope global dynamic noprefixroute hfi1_opa0.8024 valid_lft 3282sec preferred_lft 3282sec 8: hfi1_opa0.8022@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256 link/infiniband 80:00:00:06:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:0f:b0 brd 00:ff:ff:ff:ff:12:40:1b:80:22:00:00:00:00:00:00:ff:ff:ff:ff inet 172.31.22.15/24 brd 172.31.22.255 scope global dynamic noprefixroute hfi1_opa0.8022 valid_lft 3282sec preferred_lft 3282sec How reproducible: 100% Steps to Reproduce: 1. Bringup the RMDA hosts with HFI1 OPA devices with the above shown software 2. ON both hosts, setup the IBACM services /usr/bin/systemctl restart ibacm 3. On the client host, generate /etc/rdma/ibacm_addr.cfg and/etc/rdma/ibacm_opts.cfg files timeout 3m ib_acme -A -O -V 4. On the clent side, run ibacm timeout 3m ib_acme -f i -d 172.31.20.14 -v 5. On the server side, run ibacm timeout 3m ib_acme -f i -d 172.31.20.14 -v Actual results: client side =========== + [20-10-08 12:21:40] timeout 3m ib_acme -f i -d 172.31.20.14 -v Service: /run/ibacm-unix.sock Destination: 172.31.20.14 ib_acm_resolve_ip failed: No data available return status 0xffffffff server side =========== + [20-10-08 12:22:26] timeout 3m ib_acme -f i -d 172.31.20.14 -v Service: /run/ibacm-unix.sock Destination: 172.31.20.14 ib_acm_resolve_ip failed: No data available return status 0xffffffff Expected results: client side =========== + [20-09-09 21:57:47] timeout 3m ib_acme -f i -d 172.31.20.14 -v Service: /run/ibacm-unix.sock Destination: 172.31.20.14 Source: 172.31.20.15 Path information dgid: fe80::11:7501:167:10f0 sgid: fe80::11:7501:167:fb0 dlid: 5 slid: 3 flow label: 0x0 hop limit: 0 tclass: 0 reversible: 1 pkey: 0x8001 sl: 0 mtu: 7 rate: 16 packet lifetime: 14 SA verification: success return status 0x0 server side =========== + [20-09-09 21:58:43] timeout 3m ib_acme -f i -d 172.31.20.14 -v Service: /run/ibacm-unix.sock Destination: 172.31.20.14 Source: 172.31.20.14 Path information dgid: fe80::11:7501:167:10f0 sgid: fe80::11:7501:167:10f0 dlid: 5 slid: 5 flow label: 0x0 hop limit: 0 tclass: 0 reversible: 1 pkey: 0x8001 sl: 0 mtu: 5 rate: 16 packet lifetime: 0 SA verification: success return status 0x0 Additional info: This used to work on RHEL-8.3.0-20200909.1, but now it sees the same issue as described on this bugzilla. Attaching the client test log when it was successful. The only difference when it worked was that the sub-interfaces for HFI1 OPA were out of service - with DOWN states.