Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1534869

Summary: [NM/IPoIB] Use large MTU when IPoIB device support
Product: Red Hat Enterprise Linux 8 Reporter: Honggang LI <honli>
Component: NetworkManagerAssignee: Thomas Haller <thaller>
Status: CLOSED CURRENTRELEASE QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.1CC: atragler, bgalvani, ddutile, dledford, fgiudici, honli, infiniband-qe, jmaxwell, lrintel, rdma-dev-team, rkhan, sukulkar, thaller, vbenes
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.18.0-0.3.20190408git43d9187c14.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-10 13:31:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1689408, 1701002    
Attachments:
Description Flags
NM trace info before active IPoIB interface with 'nmtui'
none
NM trace info after active IPoIB interface with 'nmtui' none

Description Honggang LI 2018-01-16 07:29:15 UTC
Description of problem:

In the old days, the IB L2 MTU is 2K. So it is acceptable to use 2044 as default MTU for "datagram" mode. But new IB HCA support large MTU > 2K. So, nm should check the mtu of HCA before suggest/guess a mtu for IPoIB device.

[root@rdma-dev-10 ~]$  find /sys/ |   grep -i mtu | grep -i port |  xargs tail
==> /sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/mlx4_port1_mtu <==
4096

==> /sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/mlx4_port2_mtu <==
4096


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.use nmtui or nmcli to setup and active an IPoIB interface which running in "datagram" mode

2. ip addr show | grep -i mtu
3.

Actual results:
4: mlx4_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc....
                                                   ^^^^

Expected results:
nm use 4092 as default mtu when IB L2 MTU is 4096.

Additional info:

Comment 2 Thomas Haller 2018-01-16 16:12:18 UTC
Could you please provide a logfile of NetworkManager?

Please enable level=TRACE. See the hints for how to get a logfile at https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf

Comment 3 Honggang LI 2018-01-17 07:20:34 UTC
[root@rdma-dev-10 ~]$ mv /etc/sysconfig/network-scripts/ifcfg-mlx4_ib* /root
[root@rdma-dev-10 ~]$ mv /etc/udev/rules.d/70-persistent-ipoib.rules /root
[root@rdma-dev-10 ~]$ cat /etc/NetworkManager/conf.d/NetworkManager.conf
[logging]
level=TRACE
[root@rdma-dev-10 ~]$ reboot


[root@rdma-dev-10 ~]$ find /sys/ |   grep -i mtu | grep -i port |  xargs tail
==> /sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/mlx4_port1_mtu <==
4096

==> /sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/mlx4_port2_mtu <==
4096
[root@rdma-dev-10 ~]$

[root@rdma-dev-10 ~]$ journalctl -u NetworkManager 2>&1 | tee fresh-reboot-before-config-ib0.txt
[root@rdma-dev-10 ~]$ nmtui

Note: I did *not* change anything when I actived ipoib interface with 'nmtui'. I used all default values.

[root@rdma-dev-10 ~]$ ip addr show | grep -w inet
    inet 127.0.0.1/8 scope host lo
    inet 10.16.45.206/21 brd 10.16.47.255 scope global noprefixroute dynamic lom_1
    inet 172.31.1.40/24 brd 172.31.1.255 scope global noprefixroute dynamic ib1
[root@rdma-dev-10 ~]$ 
[root@rdma-dev-10 ~]$ journalctl -u NetworkManager 2>&1 | tee after-config-ib0.txt

[root@rdma-dev-10 ~]$ find /sys/ -name mode | grep ib | xargs tail
==> /sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/net/ib1/mode <==
datagram

==> /sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/net/ib0/mode <==
datagram
[root@rdma-dev-10 ~]$ ip addr show ib0
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP group default qlen 256
    link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:e1:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
[root@rdma-dev-10 ~]$ ip addr show ib1
5: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP group default qlen 256
    link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:01:f4:52:14:03:00:7b:e1:62 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.1.40/24 brd 172.31.1.255 scope global noprefixroute dynamic ib1
       valid_lft 3459sec preferred_lft 3459sec
    inet6 fe80::2ea5:3c:a4de:68d0/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@rdma-dev-10 ~]$

Comment 4 Honggang LI 2018-01-17 07:23:16 UTC
Created attachment 1382252 [details]
NM trace info before active IPoIB interface with 'nmtui'

Comment 5 Honggang LI 2018-01-17 07:24:14 UTC
Created attachment 1382253 [details]
NM trace info after active IPoIB interface with 'nmtui'

Comment 6 Honggang LI 2018-01-17 07:25:38 UTC
(In reply to Thomas Haller from comment #2)
> Could you please provide a logfile of NetworkManager?

Please see comment #3, #4, #5.

Comment 7 Thomas Haller 2018-01-17 08:41:34 UTC
> In the old days, the IB L2 MTU is 2K. So it is acceptable to use 2044 as
> default MTU for "datagram" mode. But new IB HCA support large MTU > 2K. So, nm 
> should check the mtu of HCA before suggest/guess a mtu for IPoIB device.

Somewhat unrelated first:

In NetworkManager, you may optionally configure infiniband.mtu property. But for datagram mode, we restrict it to 2044. https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/libnm-core/nm-setting-infiniband.c?id=3c6cc7c2e0361f6651f58469ab76f7deb37a1cbe#n197
So, even if you tried to configure something larger, NM would silently reduce it to 2044 (in datagram mode). Do you think that is a bug?

Anyway, you can leave infiniband.mtu at zero (like you did). In this case NM should not re-configure the MTU.


(In reply to Honggang LI from comment #6)
> (In reply to Thomas Haller from comment #2)
> > Could you please provide a logfile of NetworkManager?
> 
> Please see comment #3, #4, #5.

Thank you.

In the provided logfile, there is no indication that NM would change the MTU of the device. It seems that the MTU is reset by kernel when upping the interface.

Do you see any message from kernel in dmesg/journal about truncating the MTU?

Comment 8 Honggang LI 2018-01-17 12:36:32 UTC
(In reply to Thomas Haller from comment #7)
> > In the old days, the IB L2 MTU is 2K. So it is acceptable to use 2044 as
> > default MTU for "datagram" mode. But new IB HCA support large MTU > 2K. So, nm 
> > should check the mtu of HCA before suggest/guess a mtu for IPoIB device.
> 
> Somewhat unrelated first:
> 
> In NetworkManager, you may optionally configure infiniband.mtu property. But
> for datagram mode, we restrict it to 2044.

Yes, I see. But it is not reasonable for modern HCA which support 4K L2 MTU.

https://bugzilla.redhat.com/show_bug.cgi?id=1532638#c4

> https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/libnm-core/
> nm-setting-infiniband.c?id=3c6cc7c2e0361f6651f58469ab76f7deb37a1cbe#n197
> So, even if you tried to configure something larger, NM would silently
> reduce it to 2044 (in datagram mode). Do you think that is a bug?

Yes, it is a bug.

> 
> Anyway, you can leave infiniband.mtu at zero (like you did). In this case NM
> should not re-configure the MTU.
> 
> 
> (In reply to Honggang LI from comment #6)
> > (In reply to Thomas Haller from comment #2)
> > > Could you please provide a logfile of NetworkManager?
> > 
> > Please see comment #3, #4, #5.
> 
> Thank you.
> 
> In the provided logfile, there is no indication that NM would change the MTU
> of the device. It seems that the MTU is reset by kernel when upping the
> interface.
> 
> Do you see any message from kernel in dmesg/journal about truncating the MTU?

No, I did not see any message from kernel about truncating the MTU.

But, something truncated the MTU. After I removed all NetworkManager* rpm and reboot the machine. The MTU always be 2044 after I actived the IPoIB interface.

I checked the Opensm partition configuration file. The IPoIB MTU has been set to 4K. But the mgid MTU is 2K. Seems there are more bugs/issues to allow IPoIB datagram mode use 4K MTU.

$ ibdiagnet -r -c 1000
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.5.7
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib64/ibdm1.5.7
-W- A few ports of local device are up.
    Since port-num was not specified (-p option), port 1 of device 1 will be
    used as the local port.
-I- Discovering ... 25 nodes (3 Switches & 22 CA-s) discovered.

-I- Parsing Subnet file:/var/cache/ibutils/ibdiagnet.lst
-I- Defined 25/25 systems/nodes 

-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- No bad Guids were found

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- General Device Info
-I---------------------------------------------------

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------
-I-    PKey:0x0002 Hosts:22 full:22 limited:0
-I-    PKey:0x0004 Hosts:22 full:22 limited:0
-I-    PKey:0x0006 Hosts:22 full:22 limited:0
-I-    PKey:0x7fff Hosts:37 full:22 limited:15

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:20Gbps SL:0x00

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------

-I---------------------------------------------------
-I- Summary Fabric SM-state-priority
-I---------------------------------------------------
  SM - master
    rdma-master/P1 lid=0x0002 guid=0xf4521403007be131 dev=4099  priority:15
  SM - standby
    rdma-storage-01/P1 lid=0x001e guid=0xf4521403007bcba1 dev=4099  priority:14

-I---------------------------------------------------
-I- Fabric qualities report
-I---------------------------------------------------
-I- Parsing FDBs file:/var/cache/ibutils/ibdiagnet.fdbs
-I- Defined 141 fdb entries for:3 switches
-I- Parsing Multicast FDBs file:/var/cache/ibutils/ibdiagnet.mcfdbs
-I- Defined 265 Multicast Fdb entries for:3 switches
-I- 
-I- Verifying all CA to CA paths ... 
    ---------------------- CA to CA : LFT ROUTE HOP HISTOGRAM -----------------
    The number of CA pairs that are in each number of hops distance.
    This data is based on the result of the routing algorithm.
    
    HOPS NUM-CA-CA-PAIRS
  2   160
  3   170
  4   132
  ---------------------------------------------------------------------------
  
  ---------- LFT CA to CA : SWITCH OUT PORT - NUM DLIDS HISTOGRAM -----------
  Number of actual Destination LIDs going through each switch out port
  considering
  all the CA to CA paths. Ports driving CAs are ignored (as they must
  have = Nca - 1). If the fabric is routed correctly the histogram
  should be narrow for all ports on same level of the tree.
  A detailed report is provided in
  /var/cache/ibutils/ibdmchk.sw_out_port_num_dlids.
  
  NUM-DLIDS NUM-SWITCH-PORTS
       1   2
       2   2
       4   4
       5   2
       6   2
       ---------------------------------------------------------------------------
       
-I- Scanned:462 CA to CA paths 
    ---------------------------------------------------------------------------
    
-I- Scanning all multicast groups for loops and connectivity...
-I- Multicast Group:0xC000 has:3 switches and:22 HCAs
-I- Multicast Group:0xC001 has:3 switches and:22 HCAs
-I- Multicast Group:0xC004 has:3 switches and:3 HCAs
-I- Multicast Group:0xC008 has:3 switches and:22 HCAs
-I- Multicast Group:0xC00D has:3 switches and:7 HCAs
-I- Multicast Group:0xC00F has:3 switches and:14 HCAs
-I- Multicast Group:0xC010 has:3 switches and:14 HCAs
-I- Multicast Group:0xC017 has:3 switches and:14 HCAs
-I- Multicast Group:0xC01C has:1 switches and:3 HCAs
-I- Multicast Group:0xC01E has:3 switches and:15 HCAs
-I- Multicast Group:0xC01F has:3 switches and:15 HCAs
-I- Multicast Group:0xC022 has:3 switches and:15 HCAs
-I- Multicast Group:0xC025 has:3 switches and:8 HCAs
-I- Multicast Group:0xC026 has:3 switches and:8 HCAs
-I- Multicast Group:0xC02D has:3 switches and:8 HCAs
-I- Multicast Group:0xC032 has:1 switches and:2 HCAs
-I- Multicast Group:0xC046 has:3 switches and:4 HCAs
-I- Multicast Group:0xC050 has:1 switches and:2 HCAs
-I- Multicast Group:0xC053 has:3 switches and:3 HCAs
    ---------------------------------------------------------------------------
    
    

-I---------------------------------------------------
-I- Checking credit loops
-I---------------------------------------------------
-I- 
-I- Analyzing Fabric for Credit Loops 1 SLs, 1 VLs used.
-I- no credit loops found
    

-I---------------------------------------------------
-I- mgid-mlid-HCAs table
-I---------------------------------------------------
mgid                                  | mlid   | PKey   | QKey       | MTU   | rate     | HCAs
0xff12401bffff0000:0x0000000000000001 | 0xc001 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 22
0xff12401bffff0000:0x0000000000000002 | 0xc002 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12401bffff0000:0x0000000000000016 | 0xc003 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12401bffff0000:0x00000000000000fb | 0xc004 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 3
0xff12401bffff0000:0x00000000000000fc | 0xc005 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12401bffff0000:0x0000000000000101 | 0xc006 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12401bffff0000:0x0000000000000202 | 0xc007 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12401bffff0000:0x00000000ffffffff | 0xc000 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 22
0xff12601bffff0000:0x0000000000000001 | 0xc008 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 22
0xff12601bffff0000:0x0000000000000002 | 0xc009 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x0000000000000016 | 0xc00a | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000000000000fb | 0xc00b | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x0000000000000101 | 0xc00c | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x0000000000000202 | 0xc00d | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 7
0xff12601bffff0000:0x0000000000010003 | 0xc00e | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff085d60 | 0xc03c | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff085ef0 | 0xc03e | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff0ba73d | 0xc05f | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff0ba741 | 0xc064 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff1d6471 | 0xc03b | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff21acc1 | 0xc037 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff21ad35 | 0xc049 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff317791 | 0xc038 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff40a2b8 | 0xc059 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff49d468 | 0xc05e | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff6f3370 | 0xc039 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff6f33c6 | 0xc069 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff6f33de | 0xc061 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff6f33f6 | 0xc060 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff77d3cc | 0xc040 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff77d81a | 0xc03d | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff7bcba1 | 0xc055 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff7be131 | 0xc036 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ff7be161 | 0xc05d | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ffe70e7e | 0xc057 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff12601bffff0000:0x00000001ffe70e86 | 0xc042 | 0xffff | 0x00000b1b | =2048 | =20Gbps  | 1
0xff154001ffff0003:0x0400000000000000 | 0xc050 | 0xffff | 0x80010000 | =2048 | =10Gbps  | 2
----------------------------------------------------------------
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0     
    Link State Active Check                  0      0     
    General Devices Info Report              0      0     
    Performance Counters Report              0      0     
    Partitions Check                         0      0     
    IPoIB Subnets Check                      0      0     
    Subnet Manager Check                     0      0     
    Fabric Qualities Report                  0      0     
    Credit Loops Check                       0      0     
    Multicast Groups Report                  0      0     

Please see /var/cache/ibutils/ibdiagnet.log for complete log
----------------------------------------------------------------
 
-I- Done. Run time was 10 seconds.



[root@rdma-master ~]$ cat /etc/rdma/partitions-ib0.conf  | grep mtu
# mtu = 
Default=0x7fff, rate=6 mtu=5 scope=2, defmember=full:
Default=0x7fff, ipoib, rate=6 mtu=5 scope=2:
ib0_2=0x0002, rate=7 mtu=5 scope=2, defmember=full:
ib0_2=0x0002, ipoib, rate=7 mtu=5 scope=2:
ib0_4=0x0004, rate=3 mtu=5 scope=2, defmember=full:
ib0_4=0x0004, ipoib, rate=3 mtu=5 scope=2:
ib0_6=0x0006, rate=12 mtu=5 scope=2, defmember=full:
ib0_6=0x0006, ipoib, rate=12 mtu=5 scope=2:


mtu=5 means MTU==4K.

Comment 9 Honggang LI 2018-01-17 12:45:18 UTC
I also increased the MTU in the ifup-ib script. But it does not make any difference.

]$ grep 4092 /etc/sysconfig/network-scripts/ifup-ib 
	[ -z "$MTU" ] && MTU=4092
	[ "$MTU" -gt 4092 ] && MTU=4092

Comment 10 Honggang LI 2018-01-17 12:54:48 UTC
Doug
 Do you know why active IPoIB datagram MTU and mgroup MTU always be 2K? It seems the MTU had been set to 4K in opensm configuration files.

thanks

Comment 11 Doug Ledford 2018-01-17 14:30:38 UTC
Getting an entire subnet to use 4k MTU is not as simple as setting it in the opensm partitions file, unfortunately.  As IPoIB is layered on top of the underlying IB interface, you can't enable a 4k MTU on the IPoIB device unless the underlying MTU is 4k.

According to the ibv_devinfo output on rdma-master, the ib links are at 4k mtu, so that part is ok (this requires that all switches also be set to 4k).

However, even though the underlying link is at 4k, and the config file is saying to use 4k, opensm is indeed creating the groups with only a 2k limit.  As long as opensm is creating the groups with a 2k limit, all of the other machines will honor that and refuse to set the device mtu > 2k.  The real question is why opensm is doing that, and I suspect the answer is because we have rdma-dev-04/05 in the cluster.  These are the older mthca cards, and it's entirely possible that we have the cards that have a hard limit of 2k.  I'm pretty sure that if opensm finds any 2k limited cards in the cluster at all, it limits the entire cluster to 2k on the IPoIB interface.  But, we would have to dig into that to be sure.

However, that all said, I think it's also worth noting that in the early days of IB, 2k maximum was the default.  While it's possible to use 4k MTU on most cards today, it does not appreciably help actual RDMA performance for connected mode ports.  This is because connected mode data transfers never see the MTU, the disassembly/reassembly of the larger RDMA requests in to MTU sized packets happens in the card and is transparent to the software.  Only UD/RD connections are MTU limited and therefore impacted by the link layer MTU size.  And only when IPoIB is in datagram mode does the MTU size imact it.  Also, the upside to a 2k MTU is that it reduces the latency of the fabric overall.  So, it's a tradeoff.  Bigger MTU is better streaming performance for large data transfers, more moderate MTU is lower latency but reduced overall streaming throughput.  So if you are more interested in 1,000,000 tiny messages getting through as fast as possible and are willing to let your SRP data slow down a little bit, then a smaller MTU is better.

Anyway, right now, to see why our MTU is limited to 2k would require enabling debug options in opensm on rdma-master and seeing why opensm is putting the limit in place.

Comment 12 Honggang LI 2018-01-17 14:58:21 UTC
(In reply to Doug Ledford from comment #11)

> Anyway, right now, to see why our MTU is limited to 2k would require
> enabling debug options in opensm on rdma-master and seeing why opensm is
> putting the limit in place.

Fine, I will try to debug this with rdma03/04 in PEK2 LAB. It is not acceptable to repeatedly restart the opensm daemon in rdma-master, as it will impact the whole rdma cluster. And there are two opensm daemon running on the cluster. The standby opensm daemon makes the problem more complex.

rdma03/04 has been connected end to end, so it is safe to abuse the opensm daemon. HCAs on rdma03/04 support L2 4K MTU.

Comment 13 Don Dutile (Red Hat) 2018-01-17 15:29:22 UTC
Note:
There's a bz that qe opened complaining that connected-mode IPoIB does not exist/work on mlx5.  Kamal reported that the new, IPoIB accelerator on mlx5 only works in datagram mode, thus no way to force connected-mode on mlx5 IPoIB -- regression??? -- part of rhel-7.5 rdma backport of upstream's mlx5-ipoib accelerator.
You may want to reach out to Kamal and ask him if MLNX sees a 2K vs 4K MTU limitation in datagram mode in their lab.

Doug: thanks for the explanation in c#11. yet another lesson about our Franken-cluster! ;-)

Honggang: thanks for effort to test on your cluster to verify Doug's hypothesis.
        : many folks working on getting non-working systems working again in the Westford RDMA cluster, so not disturbing it is appreciated.  Art Benoit gave me a disturbing update how dev-ops 'rebalances' the dhcp servers w/o telling our lab-mgrs, creating a set of failed ip-assignments in our cluster, unexpectedly.  Art is working on correcting those problems today.
(as Art also got two of the qe machines back up to operational b/c the raid controllers on them were (somehow) modified such that (jbod, lvm-) provisioning suddenly failed.)

Comment 14 Thomas Haller 2018-01-17 16:12:07 UTC
(In reply to Honggang LI from comment #8)
> (In reply to Thomas Haller from comment #7)
> > In NetworkManager, you may optionally configure infiniband.mtu property. But
> > for datagram mode, we restrict it to 2044.
> 
> Yes, I see. But it is not reasonable for modern HCA which support 4K L2 MTU.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1532638#c4
> 
> > https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/libnm-core/
> > nm-setting-infiniband.c?id=3c6cc7c2e0361f6651f58469ab76f7deb37a1cbe#n197
> > So, even if you tried to configure something larger, NM would silently
> > reduce it to 2044 (in datagram mode). Do you think that is a bug?
> 
> Yes, it is a bug.

So, what is the correct range for datagram's MTU?

In the report rh#1532638, it is not clear that the user actually configured "infiniband.mtu" to a non-zero value in NetworkManager. It doesn't sound like this limitation in the profile is the cause for rh#1532638.

Comment 15 Doug Ledford 2018-01-17 19:12:59 UTC
(In reply to Don Dutile from comment #13)
> Note:
> There's a bz that qe opened complaining that connected-mode IPoIB does not
> exist/work on mlx5.  Kamal reported that the new, IPoIB accelerator on mlx5
> only works in datagram mode, thus no way to force connected-mode on mlx5
> IPoIB -- regression??? -- part of rhel-7.5 rdma backport of upstream's
> mlx5-ipoib accelerator.

I'm not sure if Mellanox considers it a regression or not.  Even in connected mode we don't get hardware offloads that we want.  When they added the ipoib accelerator support, it all boils down to basic IP checksum offloads, but they only work on datagram mode IPoIB connections where the checksum is limited to a single on the wire packet.  In connected mode, the TCP checksum would span link layer packets, making the checksum support much more difficult (if not impossible) for the card to do.  It is likely, I think, that datagram mode + CSUM offloads is in fact faster than connected mode without offloads.  You can argue that we will have lots of customer calls about this unexpected behavior, but it is also likely that once we explain the situation, they'll be fine with it.  A release note to head those calls off at the pass is likely in order.

> You may want to reach out to Kamal and ask him if MLNX sees a 2K vs 4K MTU
> limitation in datagram mode in their lab.

The ipoib code, in ipoib_main.c:ipoib_change_mtu() caps the mtu for the ipoib device at priv->mcast_mtu.  We init priv->mcast_mtu to the largest supported by the underlying link layer when we create the ipoib interface.  We then adjust this later when the ipoib link comes up as we will then have a valid assigned maximum mtu from the SM.  This raises an issue:

If the ipoib device is created before the underlying IB device link layer is up, we will set the mcast_mtu to a whatever the driver's default link layer size is prior to being brought up by the SM.  Depending on the driver, or firmware, or other issues, the card might default to a max of 2k MTU prior to the SM telling it that it can go higher.  I haven't checked this, it might or might not be an issue.  But, what I clearly see in the ipoib driver is that we only query the port at ipoib device creation time and set our maximum MTU then.  It's common, I think, that our ipoib module and devices are loaded/created prior to the link on the underlying device coming up.  So I would actually expect this to be the common case.  Later on, when we join the broadcast group, we query the port again to make sure the underlying IB device is marked up, but we don't double check out MTU.  That's a bug I think.  In fact, if a user were to go into the normal IB tools and change the maximum MTU on a port, there is currently no mechanism where by that is ever picked up by the ipoib driver without simply unloading and reloading the ipoib module.  While I think this is a bug, it is not, however, the cause of *this* bug as we have documented that the OpenSM created multicast groups have the wrong MTU, and since those groups should precede the creation of the ipoib devices, and not depend on the ipoib devices, this bug can not possibly cause the SM to create 2048 MTU pkeys instead of the requested 4096 MTU pkeys.

It is entirely possible that merely a difference in SMs between Mellanox's lab and our lab would prevent Mellanox from seeing the issue we are.  I'm positive that I've identified a shortcoming in the IPoIB MTU management code, but it doesn't necessarily mean that a 4k MTU isn't obtainable in a network where the SM creates the right groups.

> Doug: thanks for the explanation in c#11. yet another lesson about our
> Franken-cluster! ;-)

;-)

> Honggang: thanks for effort to test on your cluster to verify Doug's
> hypothesis.
>         : many folks working on getting non-working systems working again in
> the Westford RDMA cluster, so not disturbing it is appreciated.  Art Benoit
> gave me a disturbing update how dev-ops 'rebalances' the dhcp servers w/o
> telling our lab-mgrs, creating a set of failed ip-assignments in our
> cluster, unexpectedly.  Art is working on correcting those problems today.
> (as Art also got two of the qe machines back up to operational b/c the raid
> controllers on them were (somehow) modified such that (jbod, lvm-)
> provisioning suddenly failed.)

Comment 16 Doug Ledford 2018-01-17 19:41:43 UTC
(In reply to Thomas Haller from comment #14)
> (In reply to Honggang LI from comment #8)
> > (In reply to Thomas Haller from comment #7)
> > > In NetworkManager, you may optionally configure infiniband.mtu property. But
> > > for datagram mode, we restrict it to 2044.
> > 
> > Yes, I see. But it is not reasonable for modern HCA which support 4K L2 MTU.
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1532638#c4
> > 
> > > https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/libnm-core/
> > > nm-setting-infiniband.c?id=3c6cc7c2e0361f6651f58469ab76f7deb37a1cbe#n197
> > > So, even if you tried to configure something larger, NM would silently
> > > reduce it to 2044 (in datagram mode). Do you think that is a bug?
> > 
> > Yes, it is a bug.
> 
> So, what is the correct range for datagram's MTU?

Datagram MTU is limited to phsyical link layer MTU - 4 (the IPOIB_ENCAP_SIZE...when we send the packet out on the wire it has a 4 byte IPOIB headed added, and that's it, so we merely  have to fit the entire IP packet plus our IPOIB header into a link layer MTU sized packet).  At one point this was commonly 2044, but in recent years it is more common that 4092 is also allowed.

For Connected mode, we are sending over a reliable connected queue pair connection, so our limit is nearly infinite.  We could easily do 4MB packets.  However, a single IP packet has a size limit of 65535 due to the 16bit size element in the IP header.  As a result, the 65520 MTU limit was selected so that the largest combination of valid TCP/UDP headers allowed on our interface (obviously there are lots of TCP/UDP headers that aren't valid on our interfaces due to lack of hardware support for the item) plus the payload would still be less than IP length limit.

> In the report rh#1532638, it is not clear that the user actually configured
> "infiniband.mtu" to a non-zero value in NetworkManager. It doesn't sound
> like this limitation in the profile is the cause for rh#1532638.

Comment 17 Honggang LI 2018-01-18 08:49:41 UTC
(In reply to Doug Ledford from comment #11)
> The real
> question is why opensm is doing that, and I suspect the answer is because we
> have rdma-dev-04/05 in the cluster.  These are the older mthca cards, and

I did not find the right answer yet. But it is unlikely because of the old mthca in rdma-dev-04/05, as MTU persists to 2044 with rdma03/04 which has FDR mlx4 HCA. Those FDR HCAs support 4K MTU.

Comment 18 Honggang LI 2018-01-18 11:21:38 UTC
(In reply to Honggang LI from comment #17)
> (In reply to Doug Ledford from comment #11)
> > The real
> > question is why opensm is doing that, and I suspect the answer is because we
> > have rdma-dev-04/05 in the cluster.  These are the older mthca cards, and
> 
> I did not find the right answer yet. But it is unlikely because of the old
> mthca in rdma-dev-04/05, as MTU persists to 2044 with rdma03/04 which has
> FDR mlx4 HCA. Those FDR HCAs support 4K MTU.

Confirmed this is an opensm bug. And it is a very STUPID bug.

Partition Definition:
  [PartitionName][=PKey][,indx0][,ipoib_bc_flags][,defmember=full|limited|both]

    ipoib_bc_flags:
      ipoib_flag|[mgroup_flag]*
      
      mgroup_flag:
        rate=<val>  - specifies rate for this MC group
                      (default is 3 (10GBps))
        mtu=<val>   - specifies MTU for this MC group
                      (default is 4 (2048))
        sl=<val>    - specifies SL for this MC group
                      (default is 0)


OPENSM ONLY HONORS THE FIRST ITEM OF THE MGROUP_FLAGS.

[root@rdma03 opensm]# cat partitions.conf.default 
Default=0x7fff,ipoib, mtu=5 rate=12:ALL=full;
[root@rdma03 opensm]# 
[root@rdma03 opensm]# opensm -D 0xff -P partitions.conf.default 
-------------------------------------------------
OpenSM 4.9.1.MLNX20171001.1764298
 Reading Cached Option File: /etc/opensm/opensm.conf
Command Line Arguments:
 verbose option -D = 0xff
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 4.9.1.MLNX20171001.1764298

Using default GUID 0x2c90300b3cff1
Entering DISCOVERING state

Entering MASTER state


=======================================================================================================
Vendor      : Ty : #  : Sta : LID  : LMC : MTU  : LWA : LSA  : Port GUID        : Neighbor Port (Port #)
Mellanox    : CA : 01 : ACT : 0002 :  0  : 4096 : 4x  : 14   : 0002c90300b3c7c1 : 0002c90300b3cff1 (01)
------------------------------------------------------------------------------------------------------
Mellanox    : CA : 01 : ACT : 0001 :  0  : 4096 : 4x  : 14   * 0002c90300b3cff1 * 0002c90300b3c7c1 (01)
------------------------------------------------------------------------------------------------------

=======================================================================================================
Vendor      : Ty : #  : Sta : LID  : LMC : MTU  : LWA : LSA  : Port GUID        : Neighbor Port (Port #)
Mellanox    : CA : 01 : ACT : 0002 :  0  : 4096 : 4x  : 14   : 0002c90300b3c7c1 : 0002c90300b3cff1 (01)
------------------------------------------------------------------------------------------------------
Mellanox    : CA : 01 : ACT : 0001 :  0  : 4096 : 4x  : 14   * 0002c90300b3cff1 * 0002c90300b3c7c1 (01)
------------------------------------------------------------------------------------------------------

=======================================================================================================
Vendor      : Ty : #  : Sta : LID  : LMC : MTU  : LWA : LSA  : Port GUID        : Neighbor Port (Port #)
Mellanox    : CA : 01 : ACT : 0002 :  0  : 4096 : 4x  : 14   : 0002c90300b3c7c1 : 0002c90300b3cff1 (01)
------------------------------------------------------------------------------------------------------
Mellanox    : CA : 01 : ACT : 0001 :  0  : 4096 : 4x  : 14   * 0002c90300b3cff1 * 0002c90300b3c7c1 (01)
------------------------------------------------------------------------------------------------------
^COpenSM: Got signal 2 - exiting...
Exiting SM






[root@rdma04 ~]# /opt/ibutils/bin/ibdiagnet -r 
Loading IBDIAGNET from: /opt/ibutils/lib64/ibdiagnet1.5.7.1
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /opt/ibutils/lib64/ibdm1.5.7.1
-W- A few ports of local device are up.
    Since port-num was not specified (-p option), port 1 of device 1 will be
    used as the local port.
-I- Discovering ... 2 nodes (0 Switches & 2 CA-s) discovered.

-I- Parsing Subnet file:/tmp/ibdiagnet.lst
-I- Defined 2/2 systems/nodes 

-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- No bad Guids were found

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- General Device Info
-I---------------------------------------------------

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------
-I-    PKey:0x7fff Hosts:2 full:2 limited:0

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:4096Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------

-I---------------------------------------------------
-I- Summary Fabric SM-state-priority
-I---------------------------------------------------
  SM - master
    MT25408/P1 lid=0x0001 guid=0x0002c90300b3cff1 dev=4099  priority:0

-I---------------------------------------------------
-I- Fabric qualities report
-I---------------------------------------------------
-I- Parsing FDBs file:/tmp/ibdiagnet.fdbs
-I- Defined 0 fdb entries for:0 switches
-I- Parsing Multicast FDBs file:/tmp/ibdiagnet.mcfdbs
-I- Defined 0 Multicast Fdb entries for:0 switches
-I- 
-I- Verifying all CA to CA paths ... 
-E- Provided starting point is not connected to a switch !lid:2
-E- Fail to find a path from:S0002c90300b3c7c3/U1/1 to:S0002c90300b3cff3/U1/1
-E- Provided starting point is not connected to a switch !lid:1
-E- Fail to find a path from:S0002c90300b3cff3/U1/1 to:S0002c90300b3c7c3/U1/1
    ---------------------- CA to CA : LFT ROUTE HOP HISTOGRAM -----------------
    The number of CA pairs that are in each number of hops distance.
    This data is based on the result of the routing algorithm.
    
    HOPS NUM-CA-CA-PAIRS
    ---------------------------------------------------------------------------
    
    ---------- LFT CA to CA : SWITCH OUT PORT - NUM DLIDS HISTOGRAM -----------
    Number of actual Destination LIDs going through each switch out port
    considering
    all the CA to CA paths. Ports driving CAs are ignored (as they must
    have = Nca - 1). If the fabric is routed correctly the histogram
    should be narrow for all ports on same level of the tree.
    A detailed report is provided in /tmp/ibdmchk.sw_out_port_num_dlids.
    
    NUM-DLIDS NUM-SWITCH-PORTS
    ---------------------------------------------------------------------------
    
-E- Found 2 missing paths out of:2 paths
    ---------------------------------------------------------------------------
    
-I- Scanning all multicast groups for loops and connectivity...
    ---------------------------------------------------------------------------
    
    
-E- Total Qualities Check Errors:5

-I---------------------------------------------------
-I- Checking credit loops
-I---------------------------------------------------
-I- 
-I- Analyzing Fabric for Credit Loops 1 SLs, 1 VLs used.
-I- no credit loops found
    

-I---------------------------------------------------
-I- mgid-mlid-HCAs table
-I---------------------------------------------------
mgid                                  | mlid   | PKey   | QKey       | MTU   | rate     | HCAs
0xff12401bffff0000:0x0000000000000001 | 0xc001 | 0xffff | 0x00000b1b | =4096 | =10Gbps  | 1
0xff12401bffff0000:0x00000000ffffffff | 0xc000 | 0xffff | 0x00000b1b | =4096 | =10Gbps  | 1
0xff12601bffff0000:0x0000000000000001 | 0xc002 | 0xffff | 0x00000b1b | =4096 | =10Gbps  | 1
0xff12601bffff0000:0x00000001ffb3c7c1 | 0xc003 | 0xffff | 0x00000b1b | =4096 | =10Gbps  | 1
----------------------------------------------------------------
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0     
    Link State Active Check                  0      0     
    General Devices Info Report              0      0     
    Performance Counters Report              0      0     
    Partitions Check                         0      0     
    IPoIB Subnets Check                      0      1     
    Subnet Manager Check                     0      0     
    Fabric Qualities Report                  5      0     
    Credit Loops Check                       0      0     
    Multicast Groups Report                  0      0     

Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------
 
-I- Done. Run time was 0 seconds.
[root@rdma04 ~]# 
[root@rdma04 ~]# 
[root@rdma04 ~]# mtu
==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/net/ib0/mtu <==
4092

==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/net/ib1/mtu <==
4092

==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/mlx4_port1_mtu <==
4096

==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/mlx4_port2_mtu <==
4096
[root@rdma04 ~]# 


*******************************************************************
[root@rdma03 opensm]# vi  partitions.conf.default 
[root@rdma03 opensm]# 
[root@rdma03 opensm]# 
[root@rdma03 opensm]# cat   partitions.conf.default 
Default=0x7fff,ipoib, mtu=3 rate=12:ALL=full;
[root@rdma03 opensm]# 
[root@rdma03 opensm]# 
[root@rdma03 opensm]# opensm -D 0xff -P partitions.conf.default 
-------------------------------------------------
OpenSM 4.9.1.MLNX20171001.1764298
 Reading Cached Option File: /etc/opensm/opensm.conf
Command Line Arguments:
 verbose option -D = 0xff
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 4.9.1.MLNX20171001.1764298

Using default GUID 0x2c90300b3cff1
Entering DISCOVERING state

Entering MASTER state


=======================================================================================================
Vendor      : Ty : #  : Sta : LID  : LMC : MTU  : LWA : LSA  : Port GUID        : Neighbor Port (Port #)
Mellanox    : CA : 01 : ACT : 0002 :  0  : 4096 : 4x  : 14   : 0002c90300b3c7c1 : 0002c90300b3cff1 (01)
------------------------------------------------------------------------------------------------------
Mellanox    : CA : 01 : ACT : 0001 :  0  : 4096 : 4x  : 14   * 0002c90300b3cff1 * 0002c90300b3c7c1 (01)
------------------------------------------------------------------------------------------------------

=======================================================================================================
Vendor      : Ty : #  : Sta : LID  : LMC : MTU  : LWA : LSA  : Port GUID        : Neighbor Port (Port #)
Mellanox    : CA : 01 : ACT : 0002 :  0  : 4096 : 4x  : 14   : 0002c90300b3c7c1 : 0002c90300b3cff1 (01)
------------------------------------------------------------------------------------------------------
Mellanox    : CA : 01 : ACT : 0001 :  0  : 4096 : 4x  : 14   * 0002c90300b3cff1 * 0002c90300b3c7c1 (01)
------------------------------------------------------------------------------------------------------

=======================================================================================================
Vendor      : Ty : #  : Sta : LID  : LMC : MTU  : LWA : LSA  : Port GUID        : Neighbor Port (Port #)
Mellanox    : CA : 01 : ACT : 0002 :  0  : 4096 : 4x  : 14   : 0002c90300b3c7c1 : 0002c90300b3cff1 (01)
------------------------------------------------------------------------------------------------------
Mellanox    : CA : 01 : ACT : 0001 :  0  : 4096 : 4x  : 14   * 0002c90300b3cff1 * 0002c90300b3c7c1 (01)
------------------------------------------------------------------------------------------------------
^COpenSM: Got signal 2 - exiting...
Exiting SM



[root@rdma04 ~]# 
[root@rdma04 ~]# /opt/ibutils/bin/ibdiagnet -r 
Loading IBDIAGNET from: /opt/ibutils/lib64/ibdiagnet1.5.7.1
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /opt/ibutils/lib64/ibdm1.5.7.1
-W- A few ports of local device are up.
    Since port-num was not specified (-p option), port 1 of device 1 will be
    used as the local port.
-I- Discovering ... 2 nodes (0 Switches & 2 CA-s) discovered.

-I- Parsing Subnet file:/tmp/ibdiagnet.lst
-I- Defined 2/2 systems/nodes 

-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- No bad Guids were found

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- General Device Info
-I---------------------------------------------------

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------
-I-    PKey:0x7fff Hosts:2 full:2 limited:0

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:1024Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------

-I---------------------------------------------------
-I- Summary Fabric SM-state-priority
-I---------------------------------------------------
  SM - master
    MT25408/P1 lid=0x0001 guid=0x0002c90300b3cff1 dev=4099  priority:0

-I---------------------------------------------------
-I- Fabric qualities report
-I---------------------------------------------------
-I- Parsing FDBs file:/tmp/ibdiagnet.fdbs
-I- Defined 0 fdb entries for:0 switches
-I- Parsing Multicast FDBs file:/tmp/ibdiagnet.mcfdbs
-I- Defined 0 Multicast Fdb entries for:0 switches
-I- 
-I- Verifying all CA to CA paths ... 
-E- Provided starting point is not connected to a switch !lid:2
-E- Fail to find a path from:S0002c90300b3c7c3/U1/1 to:S0002c90300b3cff3/U1/1
-E- Provided starting point is not connected to a switch !lid:1
-E- Fail to find a path from:S0002c90300b3cff3/U1/1 to:S0002c90300b3c7c3/U1/1
    ---------------------- CA to CA : LFT ROUTE HOP HISTOGRAM -----------------
    The number of CA pairs that are in each number of hops distance.
    This data is based on the result of the routing algorithm.
    
    HOPS NUM-CA-CA-PAIRS
    ---------------------------------------------------------------------------
    
    ---------- LFT CA to CA : SWITCH OUT PORT - NUM DLIDS HISTOGRAM -----------
    Number of actual Destination LIDs going through each switch out port
    considering
    all the CA to CA paths. Ports driving CAs are ignored (as they must
    have = Nca - 1). If the fabric is routed correctly the histogram
    should be narrow for all ports on same level of the tree.
    A detailed report is provided in /tmp/ibdmchk.sw_out_port_num_dlids.
    
    NUM-DLIDS NUM-SWITCH-PORTS
    ---------------------------------------------------------------------------
    
-E- Found 2 missing paths out of:2 paths
    ---------------------------------------------------------------------------
    
-I- Scanning all multicast groups for loops and connectivity...
    ---------------------------------------------------------------------------
    
    
-E- Total Qualities Check Errors:5

-I---------------------------------------------------
-I- Checking credit loops
-I---------------------------------------------------
-I- 
-I- Analyzing Fabric for Credit Loops 1 SLs, 1 VLs used.
-I- no credit loops found
    

-I---------------------------------------------------
-I- mgid-mlid-HCAs table
-I---------------------------------------------------
mgid                                  | mlid   | PKey   | QKey       | MTU   | rate     | HCAs
0xff12401bffff0000:0x0000000000000001 | 0xc001 | 0xffff | 0x00000b1b | =1024 | =10Gbps  | 1
0xff12401bffff0000:0x00000000ffffffff | 0xc000 | 0xffff | 0x00000b1b | =1024 | =10Gbps  | 1
----------------------------------------------------------------
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0     
    Link State Active Check                  0      0     
    General Devices Info Report              0      0     
    Performance Counters Report              0      0     
    Partitions Check                         0      0     
    IPoIB Subnets Check                      0      1     
    Subnet Manager Check                     0      0     
    Fabric Qualities Report                  5      0     
    Credit Loops Check                       0      0     
    Multicast Groups Report                  0      0     

Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------
 
-I- Done. Run time was 0 seconds.
[root@rdma04 ~]# 
[root@rdma04 ~]# 
[root@rdma04 ~]# mtu
==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/net/ib0/mtu <==
1020

==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/net/ib1/mtu <==
4092

==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/mlx4_port1_mtu <==
4096

==> /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/mlx4_port2_mtu <==
4096
[root@rdma04 ~]#

Comment 21 Honggang LI 2018-08-29 12:26:51 UTC
(In reply to Honggang LI from comment #8)

> [root@rdma-master ~]$ cat /etc/rdma/partitions-ib0.conf  | grep mtu
> # mtu = 
> Default=0x7fff, rate=6 mtu=5 scope=2, defmember=full:
> Default=0x7fff, ipoib, rate=6 mtu=5 scope=2:
                         ^^^^^^^^^^^^^^^^^^^^

Because of two issues, we failed to set the MTU to 4K.
1) The configuration file is wrong. There MUST BE a comma (,) between the mgroup_flag flags.

Default=0x7fff, ipoib, rate=6 mtu=5 scope=2:

should be:

Default=0x7fff, ipoib, rate=6, mtu=5, scope=2:
                             ^      ^


I believe we had been mislead by the example configuration file "/etc/rdma/partitions.conf" and upstream doc source file "opensm-top-dir/doc/partition-config.txt". No doc emphasize that the field of mgroup_flag must be spilt with a "comma".

We should update these two files.

2) The function "parse_name_token" is error prone. It gives us a wrong 'flval' when wrong configuration passed into it. In fact, it should raise an error.


I instrumented upstream opensm source code. Output with wrong configuration file.
----------------------------
osm_prtn_config_parse_file open /etc/opensm/partitions.conf
osm_prtn_config_parse_file read line (1) (# Bad configuration, ib0's mtu will be 2044
)
osm_prtn_config_parse_file read line (2) (# Default=0x7fff,ipoib, rate=12 mtu=5:ALL=full;
)
osm_prtn_config_parse_file read line (3) (
)
osm_prtn_config_parse_file read line (4) (# Good configuration, ib0's mtu will be 4092
)
osm_prtn_config_parse_file read line (5) (Default=0x7fff,ipoib, mtu=5 rate=12:ALL=full;
)

===>  parse_name_token return ret=(15) name=(Default), id=(0x7fff)

===>  parse_name_token return ret=(6) flag=(ipoib), flval=((null))

===>  parse_name_token return ret=(15) flag=(mtu), flval=(5 rate=12) <=====
                                                   ^^^^^^^^^^^^^^^^^

IT SHOULD RAISE AN ERROR IN HERE, AS WRONG 'FLVAL' RETURNED. THAT IS WHY OPENSM ONLY HONOR THE FIRST FIELD OF MG_GROUP_FLAG.

===>  parse_name_token return ret=(9) name=(ALL), flag=(full)
----------------------------


> ib0_2=0x0002, rate=7 mtu=5 scope=2, defmember=full:
> ib0_2=0x0002, ipoib, rate=7 mtu=5 scope=2:
> ib0_4=0x0004, rate=3 mtu=5 scope=2, defmember=full:
> ib0_4=0x0004, ipoib, rate=3 mtu=5 scope=2:
> ib0_6=0x0006, rate=12 mtu=5 scope=2, defmember=full:
> ib0_6=0x0006, ipoib, rate=12 mtu=5 scope=2:
> 
> 
> mtu=5 means MTU==4K.

Comment 22 Don Dutile (Red Hat) 2018-08-29 18:22:19 UTC
(In reply to Honggang LI from comment #21)


Nice debug!

So, do we have a lab setup problem (only), or do we have a bad script generating improper config files?

Comment 23 Doug Ledford 2018-08-29 18:32:05 UTC
(In reply to Don Dutile from comment #22)
> (In reply to Honggang LI from comment #21)
> 
> 
> Nice debug!

Agreed, good job Honggang.

> So, do we have a lab setup problem (only), or do we have a bad script
> generating improper config files?

The script doesn't spit out the partition files, they are static files on the lookaside server that are downloaded during an install.  So, just need to update the git repo copy of the files and then that resolves the issue as far as the config files are concerned.

Comment 24 Don Dutile (Red Hat) 2018-08-29 19:31:35 UTC
(In reply to Doug Ledford from comment #23)
> (In reply to Don Dutile from comment #22)
> > (In reply to Honggang LI from comment #21)
> > 
> > 
> > Nice debug!
> 
> Agreed, good job Honggang.
> 
> > So, do we have a lab setup problem (only), or do we have a bad script
> > generating improper config files?
> 
> The script doesn't spit out the partition files, they are static files on
> the lookaside server that are downloaded during an install.  So, just need
> to update the git repo copy of the files and then that resolves the issue as
> far as the config files are concerned.

ok, so who is updating the git repo & pushing it out?

Comment 25 Honggang LI 2018-09-03 09:32:17 UTC
(In reply to Don Dutile from comment #24)
 
> ok, so who is updating the git repo & pushing it out?

Sent two opensm patches to upstream for review. I will try to update the opensm config for rdma-master in this weekend, if patches accepted in upstream.

Comment 26 Honggang LI 2018-09-05 11:38:23 UTC
(In reply to Honggang LI from comment #25)
> (In reply to Don Dutile from comment #24)
>  
> > ok, so who is updating the git repo & pushing it out?
> 
> Sent two opensm patches to upstream for review. I will try to update the
> opensm config for rdma-master in this weekend, if patches accepted in
> upstream.

024fe73e4481 opensm.8.in:  Emphasize that the fields of mgroup_flag must be split with "comma"
1f82c22a1237 partition-config.txt: Emphasize that the fields of mgroup_flag must be split with "comma"
04d2a8be0305 osm_prtn_config.c: parse_group_flag log suspicious group flag value

Patches had been accepted by upstream. I will update the opensm configuration file for rdma cluster in this weekend.

Comment 27 Honggang LI 2018-09-09 16:15:51 UTC
I had updated the opensm configuration file for rdma cluster.

Comment 28 Thomas Haller 2019-02-26 15:42:47 UTC
see also bug 1653494

Comment 29 Thomas Haller 2019-04-05 07:56:35 UTC
This bug is assignd to NetworkManager, but it contains a lots of discussions beyond the specific NetworkManager issue.
As such, this is a duplicate of bug 1653494 (for rhel-7).

Comment 33 Thomas Haller 2019-04-09 06:11:10 UTC
Note that this bug is against rhel-8.

Bug 1653494 is a duplicate of this for rhel-7.

The issue 

  - was fixed upstream (after 1.17.2-dev)
  - it will be fixed in rhel-7.7 (bug 1653494, NetworkManager-1.18.0-0.3.20190408git43d9187c14.el7)
  - it will be fixed in rhel-8.1 (this bug)

As this rhel-8 bug is on the RPL for 7.7, this causes a slight confusion (as the bug cannot be added to rhel-7 errata in this form).

Comment 34 Thomas Haller 2019-04-09 15:36:44 UTC
Dropping this bug from RPL-7.7.
Instead, add the duplicate bug 1653494 to the RPL-7.7.

Comment 36 Vladimir Benes 2019-09-10 13:31:25 UTC
this missed inclusion to RHEL8.1 erratum but was fixed in 7.7 already.