Bug 1422778

Summary: [mlx5] Failed to create device for nic_driver mlx5_core
Product: Red Hat Enterprise MRG Reporter: Ma Yuying <yuma>
Component: realtime-kernelAssignee: Daniel Bristot de Oliveira <daolivei>
Status: CLOSED ERRATA QA Contact: Jiri Kastner <jkastner>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.5CC: bhu, daolivei, jsvarova, lgoncalv, williams, yuma
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 3.10.0-693.15.1 Doc Type: Bug Fix
Doc Text:
The mlx5 driver has a number of configuration options, including the selective support for network protocols, such as InfiniBand and Ethernet. Due to a regression in the configuration of the MRG-RT kernel, the Ethernet mode of the driver was turned off. The regression has been resolved by enabling the mlx5 Ethernet mode, making the Ethernet protocol to work again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-25 12:45:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
boot log with kernel 3.10.0-693.2.1.rt56.585.el6rt.x86_64
none
sosreport: RHEL-RT-7 on hp-dl388g8-19.rhts.eng.pek2.redhat.com none

Description Ma Yuying 2017-02-16 08:21:04 UTC
Description of problem:
Installed 6.9_MRG, then failed to create device for nic_driver mlx5_core

Version-Release number of selected component (if applicable):
3.10.0-514.rt56.210.el6rt.x86_64

How reproducible:
3/3

Steps to Reproduce:
1.Install 6.9 MRG
2.lsmod | grep mlx5   --checked the mlx5_core has been installed
3.ip link show $nic   
  ethtool -i $nic     --found that no device for mlx5_core 

Actual results:
failed

Expected results:
succeed to create the device 

Additional info:
####details info with MRG 514.rt56.210.el6rt:
[root@cisco-c220m3-01 ~]# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether c0:67:af:98:03:5d brd ff:ff:ff:ff:ff:ff
3: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether c0:67:af:98:03:5e brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether f8:72:ea:a4:01:78 brd ff:ff:ff:ff:ff:ff
5: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether f8:72:ea:a4:01:79 brd ff:ff:ff:ff:ff:ff
[root@cisco-c220m3-01 ~]# uname -a
Linux cisco-c220m3-01.rhts.eng.pek2.redhat.com 3.10.0-514.rt56.210.el6rt.x86_64 #1 SMP PREEMPT RT Tue Dec 13 22:46:02 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@cisco-c220m3-01 ~]# lsmod | grep mlx5
mlx5_ib               159074  0
ib_core               207935  11 mlx4_ib,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,usnic_verbs,mlx5_ib
mlx5_core             175590  1 mlx5_ib
[root@cisco-c220m3-01 ~]# ethtool -i eth4
driver: enic
version: 2.3.0.20
firmware-version: 2.1(2aS3)
bus-info: 0000:0b:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@cisco-c220m3-01 ~]# ethtool -i eth1
driver: igb
version: 5.3.0-k
firmware-version: 1.63, 0x80000aa4, 0.309.17
bus-info: 0000:04:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[root@cisco-c220m3-01 core]# cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
  eth0: 8991112   35568    0    0    0     0          0      3065  2798939    7138    0    0    0     0       0          0
  eth1:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
  eth4:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
  eth5:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
    lo: 1740694   10186    0    0    0     0          0         0  1740694   10186    0    0    0     0       0          0
[root@cisco-c220m3-01 core]# modinfo mlx5_core
filename:       /lib/modules/3.10.0-514.rt56.210.el6rt.x86_64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
version:        3.0-1
license:        Dual BSD/GPL
description:    Mellanox Connect-IB, ConnectX-4 core driver
author:         Eli Cohen <eli@mellanox.com>
rhelversion:    7.3
srcversion:     0D21B16CF9CD92A5142D03B
alias:          pci:v000015B3d00001018sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001017sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001016sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001015sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001014sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001013sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001012sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001011sv*sd*bc*sc*i*
depends:        
intree:         Y
vermagic:       3.10.0-514.rt56.210.el6rt.x86_64 SMP preempt mod_unload
parm:           debug_maskebug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (int)
parm:           prof_selrofile selector. Valid range 0 - 2 (int)


####checked that it works fine with RHEL7-rt, details:
[root@cisco-c220m3-01 ~]# uname -a
Linux cisco-c220m3-01.rhts.eng.pek2.redhat.com 3.10.0-514.rt56.420.el7.x86_64 #1 SMP PREEMPT RT Wed Oct 19 15:51:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@cisco-c220m3-01 ~]# lsmod | grep mlx5
mlx5_ib               157087  0
ib_core               210859  15 rdma_cm,ib_cm,iw_cm,rpcrdma,mlx5_ib,ib_srp,ib_ucm,usnic_verbs,ib_iser,ib_srpt,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert
mlx5_core             279942  1 mlx5_ib
ptp                    19267  2 igb,mlx5_core
[root@cisco-c220m3-01 ~]# ethtool -i enp130s0f0
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.17.2020
expansion-rom-version:
bus-info: 0000:82:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

Comment 1 Clark Williams 2017-07-19 20:40:25 UTC
so from what I can see, the mlx5 modules are being loaded, but device creation is not happening. Do you see any failure messages in the boot log?

Comment 3 Ma Yuying 2017-08-31 06:09:52 UTC
Created attachment 1320457 [details]
boot log with kernel 3.10.0-693.2.1.rt56.585.el6rt.x86_64

Comment 4 Ma Yuying 2017-08-31 06:13:06 UTC
Hi Beth,

My apologies for the late. I missed this need_info before....
And I have tried with the new kernel,unfortunately,still hit the
same issue. 
I also attached the boot log, please see attachment 1320457 [details], seems that there is not any failure messages. please help check, thanks.

[root@hp-dl388g8-19 ~]# uname -a 
Linux hp-dl388g8-19.rhts.eng.pek2.redhat.com 3.10.0-693.2.1.rt56.585.el6rt.x86_64 #1 SMP PREEMPT RT Tue Aug 15 14:37:49 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@hp-dl388g8-19 ~]# ip link show 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:07:43:14:8d:50 brd ff:ff:ff:ff:ff:ff
3: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:07:43:14:8d:58 brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 8c:7c:ff:2e:14:00 brd ff:ff:ff:ff:ff:ff
5: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 8c:7c:ff:2e:14:01 brd ff:ff:ff:ff:ff:ff
6: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 2c:44:fd:7f:9f:ac brd ff:ff:ff:ff:ff:ff
7: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2c:44:fd:7f:9f:ad brd ff:ff:ff:ff:ff:ff
8: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2c:44:fd:7f:9f:ae brd ff:ff:ff:ff:ff:ff
9: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2c:44:fd:7f:9f:af brd ff:ff:ff:ff:ff:ff
[root@hp-dl388g8-19 ~]# grep mlx5 loginfo_693.log 
mlx5_core 0000:21:00.1: Shutdown was called
mlx5_core 0000:21:00.0: Shutdown was called
mlx5_core 0000:21:00.0: firmware version: 14.18.1000
mlx5_core 0000:21:00.0: Port module event: module 0, Cable plugged
mlx5_core 0000:21:00.1: firmware version: 14.18.1000
mlx5_core 0000:21:00.1: Port module event: module 1, Cable plugged
mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
[root@hp-dl388g8-19 ~]# modinfo mlx5_core
filename:       /lib/modules/3.10.0-693.2.1.rt56.585.el6rt.x86_64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
version:        3.0-1
license:        Dual BSD/GPL
description:    Mellanox Connect-IB, ConnectX-4 core driver
author:         Eli Cohen <eli@mellanox.com>
rhelversion:    7.4
srcversion:     0C8A83E32073E3E0DBB4223
alias:          pci:v000015B3d0000101Asv*sd*bc*sc*i*
alias:          pci:v000015B3d00001019sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001018sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001017sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001016sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001015sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001014sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001013sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001012sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001011sv*sd*bc*sc*i*
depends:        
intree:         Y
vermagic:       3.10.0-693.2.1.rt56.585.el6rt.x86_64 SMP preempt mod_unload 
parm:           debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint)
parm:           prof_sel:profile selector. Valid range 0 - 2 (uint)

[root@hp-dl388g8-19 ~]# test(){ for i in `seq 1 7`; do ethtool -i eth$i | grep driver & done; }
[root@hp-dl388g8-19 ~]# test

[root@hp-dl388g8-19 ~]# 
driver: bna
driver: tg3
driver: tg3
driver: tg3
driver: tg3
driver: cxgb4
driver: cxgb4

Comment 5 Beth Uptagrafft 2017-08-31 11:36:27 UTC
Hi Yuying,
Thank you for the additional information. We were discussing this yesterday in our engineering call. Can you please tell me what rt-firmware package you have installed? Our latest is rt-firmware-2.4-1.el6rt I believe.

Thanks for the help!
Beth

Comment 7 Ma Yuying 2017-09-01 06:15:57 UTC
(In reply to Beth Uptagrafft from comment #5)
> Hi Yuying,
> Thank you for the additional information. We were discussing this yesterday
> in our engineering call. Can you please tell me what rt-firmware package you
> have installed? Our latest is rt-firmware-2.4-1.el6rt I believe.
> 
> Thanks for the help!
> Beth

Hi Beth,

I checked form the testing log, and found that the rt-firmware is rt-firmware-2.4-1.el6rt.x86_64.Thanks.

some log info:
Installing : rt-firmware-2.4-1.el6rt.x86_64
Verifying  : rt-firmware-2.4-1.el6rt.x86_64 

Thanks,
Yuying.

Comment 9 Daniel Bristot de Oliveira 2017-11-28 20:19:31 UTC
Created attachment 1360021 [details]
sosreport: RHEL-RT-7 on hp-dl388g8-19.rhts.eng.pek2.redhat.com

SOS report containing all the info of the host with RHEL-7-RT installed.
It shows all NICs.

Comment 10 Daniel Bristot de Oliveira 2017-11-29 00:37:14 UTC
Hello there!

Good news from the Pizza Planet! I made the nic to work as expected:

-------------- %< --------------------
[root@hp-dl388g8-19 ~]# ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:07:43:14:8d:50 brd ff:ff:ff:ff:ff:ff
3: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:07:43:14:8d:58 brd ff:ff:ff:ff:ff:ff
4: eth8: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether e4:1d:2d:c0:85:a2 brd ff:ff:ff:ff:ff:ff
5: eth9: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether e4:1d:2d:c0:85:a3 brd ff:ff:ff:ff:ff:ff
6: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 8c:7c:ff:2e:14:00 brd ff:ff:ff:ff:ff:ff
7: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 8c:7c:ff:2e:14:01 brd ff:ff:ff:ff:ff:ff
8: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 2c:44:fd:7f:9f:ac brd ff:ff:ff:ff:ff:ff
9: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2c:44:fd:7f:9f:ad brd ff:ff:ff:ff:ff:ff
10: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2c:44:fd:7f:9f:ae brd ff:ff:ff:ff:ff:ff
11: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2c:44:fd:7f:9f:af brd ff:ff:ff:ff:ff:ff
[root@hp-dl388g8-19 ~]# for i in `seq 1 9`; do ethtool -i eth$i | grep driver ; done
driver: bna
driver: tg3
driver: tg3
driver: tg3
driver: tg3
driver: cxgb4
driver: cxgb4
driver: mlx5_core
driver: mlx5_core
-------------- >% --------------

It turns out that the problem was miss kernel configuration.
I synced the MLX config of the MRG-RT with the RHEL-RT, and then things started to work.

These are the config changes required to make it to work:
--------------- %< --------------
--- /boot/config-3.10.0-693.5.2.rt56.592.el6rt.x86_64	2017-10-13 18:50:07.000000000 -0400
+++ .config	2017-11-28 19:15:51.506616306 -0500
@@ -1,6 +1,6 @@
 #
 # Automatically generated file; DO NOT EDIT.
-# Linux/x86_64 3.10.0-693.5.2.rt56.592.el6rt.x86_64 Kernel Configuration
+# Linux/x86 3.10.0 Kernel Configuration
 #
 CONFIG_64BIT=y
 CONFIG_X86_64=y
@@ -1245,7 +1245,7 @@
 # CONFIG_NETLINK_MMAP is not set
 # CONFIG_NETLINK_DIAG is not set
 CONFIG_NET_MPLS_GSO=m
-# CONFIG_NET_SWITCHDEV is not set
+CONFIG_NET_SWITCHDEV=y
 CONFIG_RPS=y
 CONFIG_RFS_ACCEL=y
 CONFIG_XPS=y
@@ -1344,8 +1344,8 @@
 # CONFIG_NFC is not set
 # CONFIG_LWTUNNEL is not set
 CONFIG_DST_CACHE=y
-# CONFIG_NET_DEVLINK is not set
-CONFIG_MAY_USE_DEVLINK=y
+CONFIG_NET_DEVLINK=m
+CONFIG_MAY_USE_DEVLINK=m
 CONFIG_HAVE_BPF_JIT=y
 
 #
@@ -2096,8 +2096,18 @@
 CONFIG_MLX4_CORE=m
 CONFIG_MLX4_DEBUG=y
 CONFIG_MLX5_CORE=m
-# CONFIG_MLX5_CORE_EN is not set
-# CONFIG_MLXSW_CORE is not set
+CONFIG_MLX5_CORE_EN=y
+CONFIG_MLX5_CORE_EN_DCB=y
+CONFIG_MLXSW_CORE=m
+CONFIG_MLXSW_CORE_HWMON=y
+CONFIG_MLXSW_CORE_THERMAL=y
+CONFIG_MLXSW_PCI=m
+CONFIG_MLXSW_I2C=m
+CONFIG_MLXSW_SWITCHIB=m
+CONFIG_MLXSW_SWITCHX2=m
+CONFIG_MLXSW_SPECTRUM=m
+CONFIG_MLXSW_SPECTRUM_DCB=y
+CONFIG_MLXSW_MINIMAL=m
 # CONFIG_NET_VENDOR_MICREL is not set
 CONFIG_NET_VENDOR_MYRI=y
 CONFIG_MYRI10GE=m
@@ -4818,6 +4828,7 @@
 # CONFIG_RBTREE_TEST is not set
 # CONFIG_INTERVAL_TREE_TEST is not set
 # CONFIG_TEST_RHASHTABLE is not set
+# CONFIG_TEST_PARMAN is not set
 CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
 CONFIG_FIREWIRE_OHCI_REMOTE_DMA=y
 # CONFIG_BUILD_DOCSRC is not set
@@ -5145,5 +5156,6 @@
 CONFIG_SG_POOL=y
 CONFIG_ARCH_HAS_PMEM_API=y
 CONFIG_ARCH_HAS_MMIO_FLUSH=y
+CONFIG_PARMAN=m
 # CONFIG_RH_KABI_SIZE_ALIGN_CHECKS is not set
 CONFIG_RH_MRG_RT=y
------------ >% --------------

Comment 15 errata-xmlrpc 2018-01-25 12:45:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0181