Created attachment 1754958 [details] photo of crash #1 1. Please describe the problem: Attempting to boot both kernel-5.10.8-100.fc32.x86_64 and kernel-5.8.11-200.fc32.x86_64 results in a kernel panic naming the sfc module 2. What is the Version-Release number of the kernel: kernel-5.8.11-200.fc32.x86_64 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : Last known to work kernel-5.9.16-100.fc32.x86_64 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Yes, install either of the 2 newer kernels and attempt to boot with a Solarflare SFN6122F card installed results in various kernel panics. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: 6. Are you running any modules that not shipped with directly Fedora's kernel?: vboxdrv from rpmfusion 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Unable to since the system never boots.
Created attachment 1754959 [details] Photo of 2nd crash on next boot
Oops, just spotted a copy'n'paste error there, the other non-working kernel is NOT kernel-5.8.11-200.fc32.x86_64 which is older and does work, it is kernel-5.10.11-100.fc32.x86_64 which is the current F32 kernel.
Still not working with kernel 5.11.12-100.fc32.x86_64 In case it had anything to do with it, prior to updating to the 5.11 kernel I completely removed all traces of VirtualBox from my system and also uninstalled dkms. Unfortunately since this is a complete kernel crash my ability to provide more info is limited. This is an AMD 3700x processor on an Asus Prime X570-Pro motherboard with 2 x 32GB sticks of ECC RAM, 3 separate nvme drives, 2 SATA SSDs and 2 4TB HGST spinning rust devices in mdadm RAID 1. kernel-devel-5.11.12-100.fc32.x86_64 kernel-5.11.12-100.fc32.x86_64 kernel-core-5.11.12-100.fc32.x86_64 kernel-modules-extra-5.11.12-100.fc32.x86_64 kernel-modules-5.11.12-100.fc32.x86_64
Created attachment 1771984 [details] latest 5.11.12 kernel crash console capture
[trevor@trevor4 SRPMS]$ cat /proc/sys/kernel/tainted 0 From latest working kernel, 5.9.16-100. This also forces a complete resync of my RAID 1 array every time it crashes so it's more than a bit annoying and not exactly easy to try new kernels
Just in case anyone is looking at this and keeping silent about it, I also sent to netdev@vger and got the following response: On 15/04/2021 10:03, Trevor Hemsley wrote: > Hi, > > I run Fedora 32 and since kernels in the 5.10 series I have been unable to boot without getting a panic in the sfc module. I tried on 5.11.12 tonight and the crash still occurs. I have tried reporting this via Fedora channels but the silence has been deafening Seems Red Hat couldn't even be bothered to forward it to us :sigh: > and I suspect this is an upstream issue anyway. You could try building an upstream kernel and driver, and attempting to reproduce the issue there. That would remove some of the unknowns. > BUG: kernel NULL pointer dereference, address: 0000000000000104 > RIP: 0010:efx_farch_ev_process+0x3d2/0x910 [sfc] > Code: c0 02 39 f0 76 34 c1 fe 02 41 03 b6 28 07 00 00 83 e1 03 49 8b 84 f6 d0 00 00 00 48 8b 94 c8 80 09 00 00 b0 01 00 00 00 31 c9 <f0> 8f b1 8a 04 81 00 00 05 c0 0f 05 37 03 00 00 48 8d 74 24 20 4c Hmm, I think this is actually <f0> 0f b1 8a 04 01 00 00 85... which decodes as lock cmpxchg %ecx,0x104(%rdx) With other transcription errors fixed, the key sequence appears to be mov $0x1,%eax xor %ecx,%ecx lock cmpxchg %ecx,0x104(%rdx) So we're saying "if (rdx[0x104] == 1) rdx[0x104] = 0", only atomically. I'd *guess* this is the atomic_cmpxchg() in efx_farch_handle_tx_flush_done() (though it'd be nice to have your sfc.ko, with debugging symbols, to check for certain). Which in turn tells us that tx_queue is NULL; this is suspicious because the relevant commits a81dcd85a7c1 ("sfc: assign TXQs without gaps") 12804793b17c ("sfc: decouple TXQ type from label") happened at about the right time to cause this regression. So now I have to go off and figure out exactly what the semantics of this TX flush done event's 'subdata' field are... looks like it probably corresponds to tx_queue->queue from efx_farch_flush_tx_queue(). Unfortunately, there is no simple lookup to convert from qid to tx_queue, because we just allocate queues as-needed in efx_set_channels() and don't store the reverse mapping (everything else works by label rather than queue, so doesn't need it). I think the right fix is probably just to have efx_farch_handle_tx_flush_done() (and presumably also efx_farch_handle_rx_flush_done()) iterate over all queues (or at least all queues on the channel that received the event; but possibly the events might always be delivered to channel 0 rather than necessarily the channel that owns the queue) and perform the handling on any queue whose qid matches. I will followup with a patch, hopefully some time next week if I can find a 6122F to test with. > Just prior to the crash I get a pair of messages that don't look particularly right but I get these on 5.9.16 too and that survives. > > [ 9.027961] sfc 0000:0b:00.0 enp11s0f0np0: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0 > [ 9.029895] sfc 0000:0b:00.1 enp11s0f1np1: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0 0x2a is MC_CMD_SET_LINK, which gets called in a variety of situations like MTU change, link advertising change (e.g. ethtool -s), and SFP+ module hotplug. An -EINVAL failure typically means we've asked for some combination of link modes that is unsupported or nonsensical; to investigate this further you could try with the mcdi_logging_default=1 module parameter, which will log all MC commands and responses at KERN_INFO — these can then be decoded by reference to mcdi_pcol.h. In any case this seems to be unrelated to the above issue. -ed
Patches contained in https://lore.kernel.org/netdev/6b97b589-91fe-d71e-a7d0-5662a4f7a91c@gmail.com/T/#t apply cleanly to the Fedora 32 kernel-5.11.14-00.fc32.x86_64 package and rebuild with no errors and fix the crash.
Thanks, looks like they were applied to netdev, but has not gotten any feedback yet. I will give it a day or 2 to see how things go, and then pull it back into 5.11.x if all is well. It will likely make 5.11.17, though could be 5.11.18 depending on timing for that release.
It was me that reported it on netdev and I fed back to the maintainer via email as none of the lists on vger accept mail from either my private email address (my entire ISP appears to be blocked) or from O365 (HTML attachments even when send in plain text is selected in thunderturd!) so I had to bodge it to get email through to the list at all.
These patches were included in the pull request for Linus for 5.13 last night. As a result, I have applied them to the Fedora trees for 5.11 and 5.12. They should be included in the 5.11.18 kernel build when it happens, and 5.12.1. For 5.13 (rawhide), it will depend on when Linus does the pull, but I would expect by end of week, and definitely by rc1. Please report back and close this bug if that fixes your issues.
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.