RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1886812 - [RHEL8.3] [opa-ff] create soft link for dsap plugin
Summary: [RHEL8.3] [opa-ff] create soft link for dsap plugin
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: opa-ff
Version: 8.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.4
Assignee: Honggang LI
QA Contact: zguo
URL:
Whiteboard:
Depends On:
Blocks: 1842946
TreeView+ depends on / blocked
 
Reported: 2020-10-09 12:57 UTC by Brian Chae
Modified: 2021-05-18 14:44 UTC (History)
2 users (show)

Fixed In Version: opa-ff-10.10.3.0.11-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-18 14:44:44 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
client log when IBACM worked on RHEL-8.3.0-20200909.1 (37.16 KB, text/plain)
2020-10-09 12:57 UTC, Brian Chae
no flags Details
client log for IBACM failure on HFI1 OPA (39.34 KB, text/plain)
2020-10-09 13:01 UTC, Brian Chae
no flags Details

Description Brian Chae 2020-10-09 12:57:32 UTC
Created attachment 1720240 [details]
client log when IBACM worked on RHEL-8.3.0-20200909.1

Description of problem:

IBACM between HFI1 OPA server and client host fail to resolve IP address. The RDMA hosts in the lab for this test are rdma-qe-14 and rdma-qe-15.

Version-Release number of selected component (if applicable):

DISTRO=RHEL-8.3.0-20200909.1

+ [20-10-08 12:21:39] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.3 Beta (Ootpa)
+ [20-10-08 12:21:39] uname -a
Linux rdma-qe-15.lab.bos.redhat.com 4.18.0-235.el8.x86_64 #1 SMP Thu Sep 3 10:48:30 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

+ [20-10-08 12:21:39] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-235.el8.x86_64 root=/dev/mapper/rhel_rdma--qe--15-root ro intel_idle.max_cstate=0 processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH intel_iommu=on crashkernel=auto resume=/dev/mapper/rhel_rdma--qe--15-swap rd.lvm.lv=rhel_rdma-qe-15/root rd.lvm.lv=rhel_rdma-qe-15/swap console=ttyS1,115200
+ [20-10-08 12:21:39] rpm -q rdma-core linux-firmware
rdma-core-29.0-3.el8.x86_64
linux-firmware-20200619-99.git3890db36.el8.noarch
+ [20-10-08 12:21:39] tail /sys/class/infiniband/hfi1_0/fw_ver
1.27.0
+ [20-10-08 12:21:39] lspci
+ [20-10-08 12:21:39] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10)

+ [20-10-08 12:21:39] ibstat
CA 'hfi1_0'
	CA type: 
	Number of ports: 1
	Firmware version: 1.27.0
	Hardware version: 10
	Node GUID: 0x0011750101670fb0
	System image GUID: 0x0011750101670fb0
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 5
		LMC: 0
		SM lid: 1
		Capability mask: 0x00490022
		Port GUID: 0x0011750101670fb0
		Link layer: InfiniBand
+ [20-10-08 12:21:39] ibstatus



+ [20-10-08 12:21:39] rpm -q ibacm
ibacm-29.0-3.el8.x86_64

server interfaces
=================

6: hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256
    link/infiniband 80:00:00:02:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:10:f0 brd 00:ff:ff:ff:ff:12:40:1b:80:20:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.20.14/24 brd 172.31.20.255 scope global dynamic noprefixroute hfi1_opa0
       valid_lft 3449sec preferred_lft 3449sec
    inet6 fe80::211:7501:167:10f0/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
7: hfi1_opa0.8024@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256
    link/infiniband 80:00:00:04:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:10:f0 brd 00:ff:ff:ff:ff:12:40:1b:80:24:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.24.14/24 brd 172.31.24.255 scope global dynamic noprefixroute hfi1_opa0.8024
       valid_lft 3449sec preferred_lft 3449sec
    inet6 fe80::211:7501:167:10f0/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
8: hfi1_opa0.8022@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256
    link/infiniband 80:00:00:06:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:10:f0 brd 00:ff:ff:ff:ff:12:40:1b:80:22:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.22.14/24 brd 172.31.22.255 scope global dynamic noprefixroute hfi1_opa0.8022
       valid_lft 3449sec preferred_lft 3449sec
    inet6 fe80::211:7501:167:10f0/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

client interfaces
=================

6: hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256
    link/infiniband 80:00:00:02:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:0f:b0 brd 00:ff:ff:ff:ff:12:40:1b:80:20:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.20.15/24 brd 172.31.20.255 scope global noprefixroute hfi1_opa0
       valid_lft forever preferred_lft forever
    inet6 fe80::211:7501:167:fb0/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
7: hfi1_opa0.8024@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256
    link/infiniband 80:00:00:04:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:0f:b0 brd 00:ff:ff:ff:ff:12:40:1b:80:24:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.24.15/24 brd 172.31.24.255 scope global dynamic noprefixroute hfi1_opa0.8024
       valid_lft 3282sec preferred_lft 3282sec
8: hfi1_opa0.8022@hfi1_opa0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 256
    link/infiniband 80:00:00:06:fe:80:00:00:00:00:00:00:00:11:75:01:01:67:0f:b0 brd 00:ff:ff:ff:ff:12:40:1b:80:22:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.22.15/24 brd 172.31.22.255 scope global dynamic noprefixroute hfi1_opa0.8022
       valid_lft 3282sec preferred_lft 3282sec




How reproducible:

100%

Steps to Reproduce:
1. Bringup the RMDA hosts with HFI1 OPA devices with the above shown software
2. ON both hosts, setup the IBACM services
/usr/bin/systemctl restart ibacm
3. On the client host, generate /etc/rdma/ibacm_addr.cfg and/etc/rdma/ibacm_opts.cfg files

timeout 3m ib_acme -A -O -V 

4. On the clent side, run ibacm

timeout 3m ib_acme -f i -d 172.31.20.14 -v

5. On the server side, run ibacm

timeout 3m ib_acme -f i -d 172.31.20.14 -v


Actual results:

client side
===========

+ [20-10-08 12:21:40] timeout 3m ib_acme -f i -d 172.31.20.14 -v
Service: /run/ibacm-unix.sock
Destination: 172.31.20.14
ib_acm_resolve_ip failed: No data available

return status 0xffffffff

server side
===========


+ [20-10-08 12:22:26] timeout 3m ib_acme -f i -d 172.31.20.14 -v
Service: /run/ibacm-unix.sock
Destination: 172.31.20.14
ib_acm_resolve_ip failed: No data available

return status 0xffffffff



Expected results:

client side
===========



+ [20-09-09 21:57:47] timeout 3m ib_acme -f i -d 172.31.20.14 -v
Service: /run/ibacm-unix.sock
Destination: 172.31.20.14
Source: 172.31.20.15
Path information
  dgid: fe80::11:7501:167:10f0
  sgid: fe80::11:7501:167:fb0
  dlid: 5
  slid: 3
  flow label: 0x0
  hop limit: 0
  tclass: 0
  reversible: 1
  pkey: 0x8001
  sl: 0
  mtu: 7
  rate: 16
  packet lifetime: 14
SA verification: success

return status 0x0


server side
===========


+ [20-09-09 21:58:43] timeout 3m ib_acme -f i -d 172.31.20.14 -v
Service: /run/ibacm-unix.sock
Destination: 172.31.20.14
Source: 172.31.20.14
Path information
  dgid: fe80::11:7501:167:10f0
  sgid: fe80::11:7501:167:10f0
  dlid: 5
  slid: 5
  flow label: 0x0
  hop limit: 0
  tclass: 0
  reversible: 1
  pkey: 0x8001
  sl: 0
  mtu: 5
  rate: 16
  packet lifetime: 0
SA verification: success

return status 0x0



Additional info:

This used to work on RHEL-8.3.0-20200909.1, but now it sees the same issue as described on this bugzilla. Attaching the client test log when it was successful. The only difference when it worked was that the sub-interfaces for HFI1 OPA were out of service - with DOWN states.

Comment 1 Brian Chae 2020-10-09 13:01:29 UTC
Created attachment 1720263 [details]
client log for IBACM failure on HFI1 OPA

Comment 2 Honggang LI 2020-10-13 12:50:34 UTC
I reproduced this issue with opafm running on rdma-master.

The opafm in rdma-master was running but Inactive. So, I have to restart it.

[root@rdma-master ~]$ /sbin/service opafm status
Redirecting to /bin/systemctl status opafm.service
● opafm.service - OPA Fabric Manager
   Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-09-30 16:13:28 EDT; 1 weeks 5 days ago
  Process: 135046 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS)
 Main PID: 135047 (opafmd)
    Tasks: 102
   CGroup: /system.slice/opafm.service
           ├─135047 /usr/lib/opa-fm/bin/opafmd -D
           ├─135051 /usr/lib/opa-fm/runtime/sm -e sm_0
           └─135052 /usr/lib/opa-fm/runtime/fe -e fe_0

Oct 13 08:39:11 rdma-master fm0_fe[135052]: ERROR[main]: MAI: mai_send_stl_timeout: mai_send_stl: Invalid fd: 18446744073709551615
Oct 13 08:39:11 rdma-master fm0_fe[135052]: ERROR[main]: IF3: rmpp_send_single_sa: can't send reply to GET request to LID[0x4] for TID[...0016952]
Oct 13 08:39:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_get_port_guid: Was not able to communicate with SA (SM/SA may ha...d), rc:7
Oct 13 08:39:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_query_service: Error getting GUID rc: 7
Oct 13 08:40:11 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_get_sa_classportInfo: could not talk with SA (SM/SA may have mov... Timeout
Oct 13 08:40:11 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_register_fe: can't register FE with the SA at this time rc: 7: Timeout
Oct 13 08:40:11 rdma-master fm0_fe[135052]: ERROR[main]: MAI: mai_send_stl_timeout: mai_send_stl: Invalid fd: 18446744073709551615
Oct 13 08:40:11 rdma-master fm0_fe[135052]: ERROR[main]: IF3: rmpp_send_single_sa: can't send reply to GET request to LID[0x4] for TID[...0016956]
Oct 13 08:40:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_get_port_guid: Was not able to communicate with SA (SM/SA may ha...d), rc:7
Oct 13 08:40:24 rdma-master fm0_fe[135052]: PROGR[main]: IF3: if3_mngr_query_service: Error getting GUID rc: 7
Hint: Some lines were ellipsized, use -l to show in full.



[root@rdma-master ~]$ /sbin/service opafm  restart
Redirecting to /bin/systemctl restart opafm.service
[root@rdma-master ~]$ 
[root@rdma-master ~]$ 
[root@rdma-master ~]$ /sbin/service opafm status
Redirecting to /bin/systemctl status opafm.service
● opafm.service - OPA Fabric Manager
   Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2020-10-13 08:40:35 EDT; 4s ago
  Process: 50201 ExecStop=/usr/lib/opa-fm/bin/opafmd halt (code=exited, status=0/SUCCESS)
  Process: 50212 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS)
 Main PID: 50213 (opafmd)
    Tasks: 99
   CGroup: /system.slice/opafm.service
           ├─50213 /usr/lib/opa-fm/bin/opafmd -D
           ├─50217 /usr/lib/opa-fm/runtime/sm -e sm_0
           └─50218 /usr/lib/opa-fm/runtime/fe -e fe_0

Oct 13 08:40:35 rdma-master fm0_fe[50218]: [main]: FE: Using: HFI 2 Port 1 PortGuid 0x00117501016749b1
Oct 13 08:40:36 rdma-master fm0_sm[50217]: [main]: SM: Using: HFI 2 Port 1 PortGuid 0x00117501016749b1
Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: sm_transition: SM STATE TRANSITION from NOTACTIVE to DISCOVERING
Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Initial sweep.
Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: sm_set_local_port_pkey: sm pkey table already set
Oct 13 08:40:39 rdma-master fm0_sm[50217]: WARN [ParallelSw[1]]: SM: sm_send_stl_request_impl: Bad MAD status received from Path:[ 1 44...ethod:01
Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[ParallelSw[1]]: SM: _get_sminfo: failed to get SmInfo from remote SM rdma-perf-07 hfi1...01096c60
Oct 13 08:40:39 rdma-master fm0_sm[50217]: WARN [ParallelSw[1]]: SM: _discover_worker:  unable to setup port[44] of node OmniPth0011750...ng port!
Oct 13 08:40:39 rdma-master fm0_sm[50217]: rdma-master; MSG:NOTICE|SM:rdma-master:port 1|COND:#5 SM state to master|NODE:rdma-master:po...o MASTER
Oct 13 08:40:39 rdma-master fm0_sm[50217]: PROGR[topology]: SM: sm_set_delayed_pkeys: sm pkey table refresh
Hint: Some lines were ellipsized, use -l to show in full.


[root@rdma-qe-15 ~]$ ibstat
CA 'hfi1_0'
	CA type: 
	Number of ports: 1
	Firmware version: 1.27.0
	Hardware version: 10
	Node GUID: 0x0011750101670fb0
	System image GUID: 0x0011750101670fb0
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 5
		LMC: 0
		SM lid: 1
		Capability mask: 0x00490020
		Port GUID: 0x0011750101670fb0
		Link layer: InfiniBand
[root@rdma-qe-15 ~]$ 
[root@rdma-qe-15 ~]$ 
[root@rdma-qe-15 ~]$ ssh rdma-master ibstat hfi1_1
CA 'hfi1_1'
	CA type: 
	Number of ports: 1
	Firmware version: 1.27.0
	Hardware version: 10
	Node GUID: 0x00117501016749b1
	System image GUID: 0x00117501010fa0e1
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 1
		LMC: 0
		SM lid: 1
		Capability mask: 0x00490022
		Port GUID: 0x00117501016749b1
		Link layer: InfiniBand
[root@rdma-qe-15 ~]$ 
[root@rdma-qe-15 ~]$ 
[root@rdma-qe-15 ~]$ opafabricinfo 
Fabric 0:0 Information:
SM: rdma-master hfi1_1 Guid: 0x00117501016749b1 State: Master
Number of HFIs: 6
Number of Switches: 1
Number of Links: 5
Number of HFI Links: 5              (Internal: 0   External: 5)
Number of ISLs: 0                   (Internal: 0   External: 0)
Number of Degraded Links: 0         (HFI Links: 0   ISLs: 0)
Number of Omitted Links: 0          (HFI Links: 0   ISLs: 0)
-------------------------------------------------------------------------------
[root@rdma-qe-15 ~]$ ib_acme -f i -d 172.31.20.14 -v
Service: /run/ibacm-unix.sock
Destination: 172.31.20.14
ib_acm_resolve_ip failed: No data available

return status 0xffffffff
[root@rdma-qe-15 ~]$

Comment 6 Honggang LI 2020-10-14 01:41:18 UTC
commit ad5d934d688911149d795aee1d3b9fa06bf171a9
Author: Mark Haywood <mark.haywood>
Date:   Tue Mar 24 16:54:55 2020 +0100

    ibacm: check provider file ends with .so extension
    
    acm_open_providers() reads filenames and checks to see if the filenames
    contain the .so (shared object) extension via strstr(). But this does
    not verify that the extension is found at the end of the filename - only
    that .so is a substring somewhere in the filename. This means filenames
    of the sort "libibacmp.so.org" will be matched and loaded as providers.
    The check should be modified to verify that the extension is at the end

This ibacm commit introduces a regression for opa-address-resolution . This commit force ibacm only open
provider (library) ends with '.so'. But opa-address-resolution provides "/usr/lib64/ibacm/libdsap.so.1.0.0".
So, the provider was not used for address resolution for OPA device. We need create soft link "/usr/lib64/ibacm/libdsap.so" for "/usr/lib64/ibacm/libdsap.so.1.0.0".

Comment 12 errata-xmlrpc 2021-05-18 14:44:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RDMA stack bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1594


Note You need to log in before you can comment on or make changes to this bug.