Created attachment 1832945 [details] kernel log with kernel 5.15 1. Please describe the problem: Infiniband network interfaces no longer show up and the cards in general don't seem to be detected correctly after updating to kernel 5.14.* or later with Mellanox Connect-X cards. It worked flawlessly before (5.13.* and below). # uname -r 5.14.7-200.fc34.x86_64 # dmesg (partial, complete is attached) ... infiniband mlx4_0: Couldn't register device with driver model ... (lspci still shows card, but that's about the only place) # lspci -nnv ... 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0) Subsystem: Mellanox Technologies Device [15b3:0022] Flags: bus master, fast devsel, latency 0, IRQ 16, IOMMU group 1 Memory at f7e00000 (64-bit, non-prefetchable) [size=1M] Memory at f7000000 (64-bit, prefetchable) [size=8M] Capabilities: [40] Power Management version 3 Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [100] Alternative Routing-ID Interpretation (ARI) Capabilities: [148] Device Serial Number 00-02-c9-03-00-55-1c-c6 Kernel driver in use: mlx4_core Kernel modules: mlx4_core ... # ibv_devinfo No IB devices found Expected behaviour (as with 5.13.19-200.fc34.x86_64): # uname -r 5.13.19-200.fc34.x86_64 # ibv_devinfo hca_id: ibp1s0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0002:c903:0055:1cc6 sys_image_guid: 0002:c903:0055:1cc9 vendor_id: 0x02c9 vendor_part_id: 26418 hw_ver: 0xB0 board_id: MT_0F90120008 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand # ip link ... 4: ibp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc fq_codel state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:58:49:56:0e:62:db:0d:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 5: ibp1s0d1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc fq_codel state DOWN mode DEFAULT group default qlen 256 link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:58:49:56:0e:62:db:0d:02 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff ... 2. What is the Version-Release number of the kernel: 5.14.7-200.fc34.x86_64 and onwards tested with: . 5.14.7-200.fc34.x86_64 - 5.14.10-200.fc34.x86_64 - 5.14.11-200.fc34.x86_64 - 5.15.0-0.rc5.39.fc36.x86_64 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : It worked previously (on kernel-5.13.19-200.fc34.x86_64 and below) it first appeared on 5.14.7-200.fc34.x86_64 (at least that is the earliest version I tested that was available for download) 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: - have working infiniband setup - update to kernel 5.14 or later => `ip link`/`ibstat`/`ibv_devinfo` do not show card 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes (tested with 5.15.0-0.rc5.39.fc36.x86_64) 6. Are you running any modules that not shipped with directly Fedora's kernel?: ZFS dkms, but I did not install it while testing with 5.15.0-0.rc5.39.fc36.x86_64 or 5.14.7-200.fc34.x86_64 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag.
This is not an isolated problem. I also am suffering this issue. Kernel 5.13 versions work fine, but any current 5.14 kernel causes the interfaces to simply disappear. Not all the modules seem to get loaded on boot, but loading them manually doesn't help either. I don't remember seeing any obvious warnings in dmesg either (I don't have the logs on hand right now, I'll get them after work today). It's like it partially initializes the kernel modules, then stops, and never finishes getting the device ready.
Created attachment 1839498 [details] Log testing kernel 5.14 (debug) I had hoped running the debug kernel might, by chance, give something more revealing, but no. The only difference between the module initialization in 5.14 vs 5.13 kernels is that, after the "mlx4_ib_add" messages, 5.13 continues and registers the device, while 5.14 throws a single error: kernel: infiniband mlx4_0: Couldn't register device with driver model No big warnings, just that. That's all we get. I'm not e kernel developer, so my knowledge here is minimal, but some tips on how to possibly get more out of the kernel here might help.
Can you try the fix here, it works for me with ConnectX 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos.com/T/#u you need to apply the patch, and rebuild the kernel with it.
(In reply to redhat from comment #3) > Can you try the fix here, it works for me with ConnectX > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos. > com/T/#u > > you need to apply the patch, and rebuild the kernel with it. The patch seems to work as expected on my main machine (Fedora 35). Interfaces show up and work as they are supposed to. My other machine (Fedora 34) kernel panics on boot with "attempted to kill init", but I guess that is probably something unrelated; in the unlikely event that it turns out to be related I will update you.
(In reply to Liss Heidrich from comment #4) > (In reply to redhat from comment #3) > > Can you try the fix here, it works for me with ConnectX > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > > > https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos. > > com/T/#u > > > > you need to apply the patch, and rebuild the kernel with it. > > The patch seems to work as expected on my main machine (Fedora 35). > Interfaces show up and work as they are supposed to. Thanks for testing. good to know it works. > My other machine (Fedora 34) kernel panics on boot with "attempted to kill > init", but I guess that is probably something unrelated; in the unlikely > event that it turns out to be related I will update you. the error seems unrelated to IB. Please do let me know otherwise.
(In reply to redhat from comment #5) > (In reply to Liss Heidrich from comment #4) > > (In reply to redhat from comment #3) > > > Can you try the fix here, it works for me with ConnectX > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > > > > > > https://lore.kernel.org/linux-rdma/20211115101519.27210-1-jinpu.wang@ionos. > > > com/T/#u > > > > > > you need to apply the patch, and rebuild the kernel with it. > > > > The patch seems to work as expected on my main machine (Fedora 35). > > Interfaces show up and work as they are supposed to. > Thanks for testing. good to know it works. > > My other machine (Fedora 34) kernel panics on boot with "attempted to kill > > init", but I guess that is probably something unrelated; in the unlikely > > event that it turns out to be related I will update you. > the error seems unrelated to IB. Please do let me know otherwise. I can now confirm that it was indeed unrelated, tried a different kernel version and now it works flawlessly. Thanks for the patch.
Thank you Liss. FTR. the patch is merged to upstream 5.16-rc6, should be soon in stable 5.15.5
This message is a reminder that Fedora Linux 34 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '34'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 34 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07. Fedora Linux 34 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. Thank you for reporting this bug and we are sorry it could not be fixed.