Bug 1924982 - kernels later than 5.9.16-100.fc32 crash in the sfc module
Summary: kernels later than 5.9.16-100.fc32 crash in the sfc module
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 32
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-04 04:45 UTC by Trevor Hemsley
Modified: 2021-05-25 17:31 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 17:31:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
photo of crash #1 (3.25 MB, image/jpeg)
2021-02-04 04:45 UTC, Trevor Hemsley
no flags Details
Photo of 2nd crash on next boot (2.55 MB, image/jpeg)
2021-02-04 04:47 UTC, Trevor Hemsley
no flags Details
latest 5.11.12 kernel crash console capture (2.78 MB, image/jpeg)
2021-04-14 20:54 UTC, Trevor Hemsley
no flags Details

Description Trevor Hemsley 2021-02-04 04:45:45 UTC
Created attachment 1754958 [details]
photo of crash #1

1. Please describe the problem:
Attempting to boot both kernel-5.10.8-100.fc32.x86_64 and kernel-5.8.11-200.fc32.x86_64 results in a kernel panic naming the sfc module

2. What is the Version-Release number of the kernel:
kernel-5.8.11-200.fc32.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Last known to work kernel-5.9.16-100.fc32.x86_64

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Yes, install either of the 2 newer kernels and attempt to boot with a Solarflare 
SFN6122F card installed results in various kernel panics.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:


6. Are you running any modules that not shipped with directly Fedora's kernel?:
vboxdrv from rpmfusion

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
Unable to since the system never boots.

Comment 1 Trevor Hemsley 2021-02-04 04:47:19 UTC
Created attachment 1754959 [details]
Photo of 2nd crash on next boot

Comment 2 Trevor Hemsley 2021-02-04 15:39:33 UTC
Oops, just spotted a copy'n'paste error there, the other non-working kernel is NOT kernel-5.8.11-200.fc32.x86_64 which is older and does work, it is kernel-5.10.11-100.fc32.x86_64 which is the current F32 kernel.

Comment 3 Trevor Hemsley 2021-04-14 20:50:31 UTC
Still not working with kernel 5.11.12-100.fc32.x86_64

In case it had anything to do with it, prior to updating to the 5.11 kernel I completely removed all traces of VirtualBox from my system and also uninstalled dkms. Unfortunately since this is a complete kernel crash my ability to provide more info is limited.

This is an AMD 3700x processor on an Asus Prime X570-Pro motherboard with 2 x 32GB sticks of ECC RAM, 3 separate nvme drives, 2 SATA SSDs and 2 4TB HGST spinning rust devices in mdadm RAID 1.

kernel-devel-5.11.12-100.fc32.x86_64
kernel-5.11.12-100.fc32.x86_64
kernel-core-5.11.12-100.fc32.x86_64
kernel-modules-extra-5.11.12-100.fc32.x86_64
kernel-modules-5.11.12-100.fc32.x86_64

Comment 4 Trevor Hemsley 2021-04-14 20:54:45 UTC
Created attachment 1771984 [details]
latest 5.11.12 kernel crash console capture

Comment 5 Trevor Hemsley 2021-04-14 21:07:05 UTC
[trevor@trevor4 SRPMS]$ cat /proc/sys/kernel/tainted
0

From latest working kernel, 5.9.16-100.

This also forces a complete resync of my RAID 1 array every time it crashes so it's more than a bit annoying and not exactly easy to try new kernels

Comment 6 Trevor Hemsley 2021-04-15 12:57:56 UTC
Just in case anyone is looking at this and keeping silent about it, I also sent to netdev@vger and got the following response:

On 15/04/2021 10:03, Trevor Hemsley wrote:
> Hi,
>
> I run Fedora 32 and since kernels in the 5.10 series I have been unable to boot without getting a panic in the sfc module. I tried on 5.11.12 tonight and the crash still occurs. I have tried reporting this via Fedora channels but the silence has been deafening
Seems Red Hat couldn't even be bothered to forward it to us :sigh:

> and I suspect this is an upstream issue anyway.
You could try building an upstream kernel and driver, and attempting to
 reproduce the issue there.  That would remove some of the unknowns.

> BUG: kernel NULL pointer dereference, address: 0000000000000104

> RIP: 0010:efx_farch_ev_process+0x3d2/0x910 [sfc]
> Code: c0 02 39 f0 76 34 c1 fe 02 41 03 b6 28 07 00 00 83 e1 03 49 8b 84 f6 d0 00 00 00 48 8b 94 c8 80 09 00 00 b0 01 00 00 00 31 c9 <f0> 8f b1 8a 04 81 00 00 05 c0 0f 05 37 03 00 00 48 8d 74 24 20 4c
Hmm, I think this is actually <f0> 0f b1 8a 04 01 00 00 85...
 which decodes as lock cmpxchg %ecx,0x104(%rdx)
With other transcription errors fixed, the key sequence appears to be
    mov $0x1,%eax
    xor %ecx,%ecx
    lock cmpxchg %ecx,0x104(%rdx)
So we're saying "if (rdx[0x104] == 1) rdx[0x104] = 0", only atomically.
I'd *guess* this is the atomic_cmpxchg() in efx_farch_handle_tx_flush_done()
 (though it'd be nice to have your sfc.ko, with debugging symbols, to
 check for certain).
Which in turn tells us that tx_queue is NULL; this is suspicious
 because the relevant commits
    a81dcd85a7c1 ("sfc: assign TXQs without gaps")
    12804793b17c ("sfc: decouple TXQ type from label")
 happened at about the right time to cause this regression.
So now I have to go off and figure out exactly what the semantics
 of this TX flush done event's 'subdata' field are... looks like it
 probably corresponds to tx_queue->queue from
 efx_farch_flush_tx_queue().
Unfortunately, there is no simple lookup to convert from qid to
 tx_queue, because we just allocate queues as-needed in
 efx_set_channels() and don't store the reverse mapping (everything
 else works by label rather than queue, so doesn't need it).
I think the right fix is probably just to have
 efx_farch_handle_tx_flush_done() (and presumably also
 efx_farch_handle_rx_flush_done()) iterate over all queues (or at
 least all queues on the channel that received the event; but
 possibly the events might always be delivered to channel 0 rather
 than necessarily the channel that owns the queue) and perform the
 handling on any queue whose qid matches.
I will followup with a patch, hopefully some time next week if I can
 find a 6122F to test with.

> Just prior to the crash I get a pair of messages that don't look particularly right but I get these on 5.9.16 too and that survives.
>
> [    9.027961] sfc 0000:0b:00.0 enp11s0f0np0: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0
> [    9.029895] sfc 0000:0b:00.1 enp11s0f1np1: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0

0x2a is MC_CMD_SET_LINK, which gets called in a variety of situations
 like MTU change, link advertising change (e.g. ethtool -s), and SFP+
 module hotplug.  An -EINVAL failure typically means we've asked for
 some combination of link modes that is unsupported or nonsensical; to
 investigate this further you could try with the mcdi_logging_default=1
 module parameter, which will log all MC commands and responses at
 KERN_INFO — these can then be decoded by reference to mcdi_pcol.h.
In any case this seems to be unrelated to the above issue.

-ed

Comment 7 Trevor Hemsley 2021-04-20 21:51:03 UTC
Patches contained in https://lore.kernel.org/netdev/6b97b589-91fe-d71e-a7d0-5662a4f7a91c@gmail.com/T/#t apply cleanly to the Fedora 32 kernel-5.11.14-00.fc32.x86_64 package and rebuild with no errors and fix the crash.

Comment 8 Justin M. Forbes 2021-04-21 16:33:41 UTC
Thanks, looks like they were applied to netdev, but has not gotten any feedback yet. I will give it a day or 2 to see how things go, and then pull it back into 5.11.x if all is well. It will likely make 5.11.17, though could be 5.11.18 depending on timing for that release.

Comment 9 Trevor Hemsley 2021-04-21 16:40:50 UTC
It was me that reported it on netdev and I fed back to the maintainer via email as none of the lists on vger accept mail from either my private email address (my entire ISP appears to be blocked) or from O365 (HTML attachments even when send in plain text is selected in thunderturd!) so I had to bodge it to get email through to the list at all.

Comment 10 Justin M. Forbes 2021-04-29 14:13:01 UTC
These patches were included in the pull request for Linus for 5.13 last night. As a result, I have applied them to the Fedora trees for 5.11 and 5.12.  They should be included in the 5.11.18 kernel build when it happens, and 5.12.1.  For 5.13 (rawhide), it will depend on when Linus does the pull, but I would expect by end of week, and definitely by rc1.  Please report back and close this bug if that fixes your issues.

Comment 11 Fedora Program Management 2021-04-29 16:58:26 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Ben Cotton 2021-05-25 17:31:55 UTC
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.