Bug 2014094 - Missing infiniband network interfaces after update to 5.14
Summary: Missing infiniband network interfaces after update to 5.14
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 34
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-14 13:18 UTC by Liss Heidrich
Modified: 2022-06-07 22:48 UTC (History)
20 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-06-07 22:48:52 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
kernel log with kernel 5.15 (93.61 KB, text/plain)
2021-10-14 13:18 UTC, Liss Heidrich
no flags Details
Log testing kernel 5.14 (debug) (260.12 KB, text/plain)
2021-11-03 06:33 UTC, T.I. "Luna" Ericson
no flags Details

Description Liss Heidrich 2021-10-14 13:18:37 UTC
Created attachment 1832945 [details]
kernel log with kernel 5.15

1. Please describe the problem:

Infiniband network interfaces no longer show up and the cards in general don't seem to be detected correctly after updating to kernel 5.14.* or later with Mellanox Connect-X cards. It worked flawlessly before (5.13.* and below).

# uname -r
5.14.7-200.fc34.x86_64


# dmesg (partial, complete is attached)
...
infiniband mlx4_0: Couldn't register device with driver model
...


(lspci still shows card, but that's about the only place)
# lspci -nnv
...
01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0)
        Subsystem: Mellanox Technologies Device [15b3:0022]
        Flags: bus master, fast devsel, latency 0, IRQ 16, IOMMU group 1
        Memory at f7e00000 (64-bit, non-prefetchable) [size=1M]
        Memory at f7000000 (64-bit, prefetchable) [size=8M]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] Vital Product Data
        Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [148] Device Serial Number 00-02-c9-03-00-55-1c-c6
        Kernel driver in use: mlx4_core
        Kernel modules: mlx4_core
...


# ibv_devinfo
No IB devices found



Expected behaviour (as with 5.13.19-200.fc34.x86_64):

# uname -r
5.13.19-200.fc34.x86_64

# ibv_devinfo
hca_id: ibp1s0
        transport:                      InfiniBand (0)
        fw_ver:                         2.9.1000
        node_guid:                      0002:c903:0055:1cc6
        sys_image_guid:                 0002:c903:0055:1cc9
        vendor_id:                      0x02c9
        vendor_part_id:                 26418
        hw_ver:                         0xB0
        board_id:                       MT_0F90120008
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             InfiniBand


# ip link
...
4: ibp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP mode DEFAULT group default qlen 256
    link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:58:49:56:0e:62:db:0d:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
5: ibp1s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256
    link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:58:49:56:0e:62:db:0d:02 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
...


2. What is the Version-Release number of the kernel:

5.14.7-200.fc34.x86_64 and onwards

tested with:
. 5.14.7-200.fc34.x86_64
- 5.14.10-200.fc34.x86_64
- 5.14.11-200.fc34.x86_64
- 5.15.0-0.rc5.39.fc36.x86_64


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

It worked previously (on kernel-5.13.19-200.fc34.x86_64 and below) it first appeared on 5.14.7-200.fc34.x86_64 (at least that is the earliest version I tested that was available for download)


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

- have working infiniband setup
- update to kernel 5.14 or later

=> `ip link`/`ibstat`/`ibv_devinfo` do not show card


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes (tested with 5.15.0-0.rc5.39.fc36.x86_64)


6. Are you running any modules that not shipped with directly Fedora's kernel?:

ZFS dkms, but I did not install it while testing with 5.15.0-0.rc5.39.fc36.x86_64 or 5.14.7-200.fc34.x86_64


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 T.I. "Luna" Ericson 2021-11-02 21:31:10 UTC
This is not an isolated problem. I also am suffering this issue.

Kernel 5.13 versions work fine, but any current 5.14 kernel causes the interfaces to simply disappear.

Not all the modules seem to get loaded on boot, but loading them manually doesn't help either.
I don't remember seeing any obvious warnings in dmesg either (I don't have the logs on hand right now, I'll get them after work today). It's like it partially initializes the kernel modules, then stops, and never finishes getting the device ready.

Comment 2 T.I. "Luna" Ericson 2021-11-03 06:33:39 UTC
Created attachment 1839498 [details]
Log testing kernel 5.14 (debug)

I had hoped running the debug kernel might, by chance, give something more revealing, but no.

The only difference between the module initialization in 5.14 vs 5.13 kernels is that, after the "mlx4_ib_add" messages, 5.13 continues and registers the device, while 5.14 throws a single error:

kernel: infiniband mlx4_0: Couldn't register device with driver model

No big warnings, just that.
That's all we get.

I'm not e kernel developer, so my knowledge here is minimal, but some tips on how to possibly get more out of the kernel here might help.

Comment 3 redhat 2021-11-16 10:29:23 UTC
Can you try the fix here, it works for me with ConnectX 
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
2.0 5GT/s - IB QDR / 10GigE] (rev b0)


https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos.com/T/#u

you need to apply the patch, and rebuild the kernel with it.

Comment 4 Liss Heidrich 2021-11-18 22:05:41 UTC
(In reply to redhat from comment #3)
> Can you try the fix here, it works for me with ConnectX 
> 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> 
> 
> https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos.
> com/T/#u
> 
> you need to apply the patch, and rebuild the kernel with it.

The patch seems to work as expected on my main machine (Fedora 35). Interfaces show up and work as they are supposed to.
My other machine (Fedora 34) kernel panics on boot with "attempted to kill init", but I guess that is probably something unrelated; in the unlikely event that it turns out to be related I will update you.

Comment 5 redhat 2021-11-19 08:34:37 UTC
(In reply to Liss Heidrich from comment #4)
> (In reply to redhat from comment #3)
> > Can you try the fix here, it works for me with ConnectX 
> > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > 
> > 
> > https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos.
> > com/T/#u
> > 
> > you need to apply the patch, and rebuild the kernel with it.
> 
> The patch seems to work as expected on my main machine (Fedora 35).
> Interfaces show up and work as they are supposed to.
Thanks for testing. good to know it works.
> My other machine (Fedora 34) kernel panics on boot with "attempted to kill
> init", but I guess that is probably something unrelated; in the unlikely
> event that it turns out to be related I will update you.
the error seems unrelated to IB. Please do let me know otherwise.

Comment 6 Liss Heidrich 2021-11-19 21:11:43 UTC
(In reply to redhat from comment #5)
> (In reply to Liss Heidrich from comment #4)
> > (In reply to redhat from comment #3)
> > > Can you try the fix here, it works for me with ConnectX 
> > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > > 
> > > 
> > > https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos.
> > > com/T/#u
> > > 
> > > you need to apply the patch, and rebuild the kernel with it.
> > 
> > The patch seems to work as expected on my main machine (Fedora 35).
> > Interfaces show up and work as they are supposed to.
> Thanks for testing. good to know it works.
> > My other machine (Fedora 34) kernel panics on boot with "attempted to kill
> > init", but I guess that is probably something unrelated; in the unlikely
> > event that it turns out to be related I will update you.
> the error seems unrelated to IB. Please do let me know otherwise.

I can now confirm that it was indeed unrelated, tried a different kernel version and now it works flawlessly. Thanks for the patch.

Comment 7 redhat 2021-11-22 07:31:38 UTC
Thank you Liss. 

FTR. the patch is merged to upstream 5.16-rc6, should be soon in stable 5.15.5

Comment 8 Ben Cotton 2022-05-12 15:55:20 UTC
This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 9 Ben Cotton 2022-06-07 22:48:52 UTC
Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07.

Fedora Linux 34 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.