Bug 2362414

Summary: Fedora 42 Kernel Cannot Initialize Mellanox ConnectX Ethernet Adaptor
Product: [Fedora] Fedora Reporter: Xinya Zhang <zxy_thf>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 42CC: acaringi, adscvr, airlied, bskeggs, hdegoede, hpa, josef, kernel-maint, linville, masami256, mchehab, ptalbert, steved, suraj.ghimire7
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-6.14.5-300.fc42 kernel-6.14.5-200.fc41 kernel-6.14.5-100.fc40 Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-05-06 01:16:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mlx4_core failed to initialize the device, reporting command 0x23 timed out none

Description Xinya Zhang 2025-04-25 23:46:42 UTC
Created attachment 2087321 [details]
mlx4_core failed to initialize the device, reporting command 0x23 timed out

Created attachment 2087321 [details]
mlx4_core failed to initialize the device, reporting command 0x23 timed out

1. Please describe the problem:

mlx4_core cannot initialize the network device.

2. What is the Version-Release number of the kernel:

Linux version 6.14.3-300.fc42.x86_64 (mockbuild@f38d0f39c8424fc98f6be30363aa5e29) (gcc (GCC) 15.0.1 20250329 (Red Hat 15.0.1-0), GNU ld version 2.44-3.fc42) #1 SMP PREEMPT_DYNAMIC Sun Apr 20 16:08:39 UTC >

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Yes. It works on Fedora 41 kernel + Fedora 42 remaining. Kernel version

Linux version 6.13.12-200.fc41.x86_64 (mockbuild@9778ff19adb64779af61fc903691b7f0) (gcc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7), GNU ld version 2.43.1-5.fc41) #1 SMP PREEMPT_DYNAMIC Sun Apr 20 15:52:43 U>

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Reboot the system, and the initialization failed.
modprobe -r mlx4 related modules and modprobe mlx4_en can also reproduce the dmesg

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes. Rawhide still has this problem, kernel version:

kernel: Linux version 6.15.0-0.rc3.20250424gita79be02bba5c.31.fc43.x86_64 (mockbuild@f7d7d97eb2d74025a7388a5e09a583ad) (gcc (GCC) 15.0.1 20250418 (Red Hat 15.0.1-0), GNU ld version 2.44-3.fc43) #1 SMP PREEMPT_DYN>

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No. This is newly upgraded Fedora 42

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Related sections:

mlx4_core: Mellanox ConnectX core driver v4.0-0
mlx4_core: Initializing 0000:05:00.0
mlx4_core 0000:05:00.0: DMFS high rate steer mode is: disabled performance optimized steering
mlx4_core 0000:05:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:03:02.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
mlx4_core 0000:05:00.0: command 0x23 timed out (go bit not cleared)
mlx4_core 0000:05:00.0: device is going to be reset
mlx4_core 0000:05:00.0: crdump: FW doesn't support health buffer access, skipping
mlx4_core 0000:05:00.0: device was reset successfully
mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting
mlx4_core 0000:05:00.0: probe with driver mlx4_core failed with error -5

Working dmesg looks like:

kernel: mlx4_core 0000:05:00.0: DMFS high rate steer mode is: disabled performance optimized steering
kernel: mlx4_core 0000:05:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:03:02.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
kernel: mlx4_en 0000:05:00.0: Activating port:1
kernel: mlx4_en: 0000:05:00.0: Port 1: Using 24 TX rings
kernel: mlx4_en: 0000:05:00.0: Port 1: Using 16 RX rings
kernel: mlx4_en: 0000:05:00.0: Port 1: Initializing port
kernel: mlx4_en 0000:05:00.0: registered PHC clock
kernel: mlx4_en 0000:05:00.0: Activating port:2
kernel: mlx4_en: 0000:05:00.0: Port 2: Using 24 TX rings
kernel: mlx4_en: 0000:05:00.0: Port 2: Using 16 RX rings
kernel: mlx4_en: 0000:05:00.0: Port 2: Initializing port

Comment 1 Xinya Zhang 2025-04-26 06:30:55 UTC
Update: rebuild the kernel with CONFIG_PCI_REALLOC_ENABLE_AUTO=n can fix the problem (In practice, it is changing the config file to "# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set")

Is there any feature depending on CONFIG_PCI_REALLOC_ENABLE_AUTO=y?
It seems this option is incompatible with the linux mainline mlx4 driver.

Comment 2 Xinya Zhang 2025-04-28 16:12:43 UTC
Seems there is a related issue in Linux kernel upstream: https://bugzilla.kernel.org/show_bug.cgi?id=220016

Comment 3 Fedora Update System 2025-05-02 16:38:44 UTC
FEDORA-2025-309c37bb06 (kernel-6.14.5-100.fc40) has been submitted as an update to Fedora 40.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-309c37bb06

Comment 4 Fedora Update System 2025-05-02 16:38:54 UTC
FEDORA-2025-8bc830e072 (kernel-6.14.5-200.fc41) has been submitted as an update to Fedora 41.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-8bc830e072

Comment 5 Fedora Update System 2025-05-02 16:39:04 UTC
FEDORA-2025-c5a31cf649 (kernel-6.14.5-300.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-c5a31cf649

Comment 6 Fedora Update System 2025-05-03 02:50:45 UTC
FEDORA-2025-c5a31cf649 has been pushed to the Fedora 42 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-c5a31cf649`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-c5a31cf649

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 7 Fedora Update System 2025-05-03 03:04:07 UTC
FEDORA-2025-8bc830e072 has been pushed to the Fedora 41 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-8bc830e072`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-8bc830e072

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 8 Fedora Update System 2025-05-03 03:26:26 UTC
FEDORA-2025-309c37bb06 has been pushed to the Fedora 40 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-309c37bb06`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-309c37bb06

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 9 Fedora Update System 2025-05-06 01:16:13 UTC
FEDORA-2025-c5a31cf649 (kernel-6.14.5-300.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 10 Fedora Update System 2025-05-06 01:37:32 UTC
FEDORA-2025-8bc830e072 (kernel-6.14.5-200.fc41) has been pushed to the Fedora 41 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 11 Fedora Update System 2025-05-06 02:25:51 UTC
FEDORA-2025-309c37bb06 (kernel-6.14.5-100.fc40) has been pushed to the Fedora 40 stable repository.
If problem still persists, please make note of it in this bug report.