Bug 2330681

Summary: Recent debug kernels fail to boot with "failed to validate module" errors (BPF / BTF)
Product: [Fedora] Fedora Reporter: Mikhail <mikhail.v.gavrilov>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, adscvr, airlied, alciregi, awilliam, bskeggs, bugzilla, hdegoede, hpa, jforbes, jmontleo, josef, kernel-maint, kparal, linville, lruzicka, masami256, mchehab, ptalbert, robatino, steved, suraj.ghimire7, zbyszek
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: RejectedBlocker AcceptedFreezeException
Fixed In Version: kernel-6.14.0-0.rc6.49.fc42 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-03-15 00:43:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2291264    
Attachments:
Description Flags
Terminal photo
none
Terminal photo
none
virsh console "dmesg" none

Description Mikhail 2024-12-05 22:49:05 UTC
Something changed between kernel-6.13.0-0.rc0.20241126git7eef7e306d3c.10.fc42 and kernel-6.13.0-0.rc1.20241202gite70140ba0d2b.14.fc42 which made kernel-6.13.0-0.rc1.20241202gite70140ba0d2b.14.fc42 and all subsequent kernels become non-working

Instead, I see a lot of "failed to validate module" messages in the terminal during boot. 

But the upstream kernel I built at the same commit and .config works fine.

Reproducible: Always

Comment 1 Mikhail 2024-12-05 22:50:42 UTC
Created attachment 2061419 [details]
Terminal photo

Comment 2 Fedora Blocker Bugs Application 2024-12-06 06:46:27 UTC
Proposed as a Blocker and Freeze Exception for 42-beta by Fedora user mikhail using the blocker tracking app because:

 Some changes in the rhel patchset completely made all my systems unbootable.

Comment 3 Adam Williamson 2024-12-06 07:06:30 UTC
It boots fine on openQA (or else it wouldn't have passed gating, and all the Rawhide validation tests would fail).

On my system kernel-6.13.0-0.rc1.20241203gitcdd30ebb1b9f.16.fc42.x86_64 doesn't work for graphics, but does at least get me to a console (I hadn't got time to look into why, yet). I'm not seeing this BPF stuff.

Comment 4 Mikhail 2024-12-08 19:26:25 UTC
Adam, please test the debug kernel.
# dnf install kernel-debug kernel-debug-modules-extra


This issue only affected the debug kernel. A non-debug kernel works as intended.

Comment 5 Adam Williamson 2025-01-20 18:35:09 UTC
Mikhail, is this still happening?

Anyhow, if it only affects the debug kernel, I don't think it can be a blocker, as no install uses that by default...

Comment 6 Mikhail 2025-02-03 20:09:58 UTC
Created attachment 2075031 [details]
Terminal photo

(In reply to Adam Williamson from comment #5)
> Mikhail, is this still happening?
Yes, the latest builds https://koji.fedoraproject.org/koji/buildinfo?buildID=2649629 still not work. The messages in the terminal have changed a bit, I suspect due to a problem with dwarves package. https://bugzilla.redhat.com/show_bug.cgi?id=2342785
 
> Anyhow, if it only affects the debug kernel, I don't think it can be a
> blocker, as no install uses that by default...

The debug kernel allows you to see many problems that are usually hidden. That is why I use the debug kernel on a daily basis.

Comment 7 Adam Williamson 2025-02-03 23:03:07 UTC
That's a good reason to use it for testing, but it doesn't mean bugs in it are a release blocker.

Comment 8 Chris Murphy 2025-02-18 23:16:26 UTC
I'm hitting this also. These kernels do not boot but their non-debug equivalent versions boot fine. What I get is a bunch of failed to validate module messages and then an apparent hang, no plymouth prompt to unlock the root volume, ESC key does nothing.

kernel-debug-6.14.0-0.rc3.29.fc42.x86_64
kernel-debug-6.13.3-200.fc41.x86_64

Comment 9 Chris Murphy 2025-02-19 20:07:17 UTC
Created attachment 2077149 [details]
virsh console "dmesg"

Reproduced it in qemu/kvm. Other than having UEFI enabled (without Secure Boot), it's a stock VMM VM.

Comment 10 Chris Murphy 2025-02-19 20:26:56 UTC
Fedora-Workstation-Live-Rawhide-20250219.n.0.x86_64.iso is using 6.14.0-0.rc3.29.fc43.x86_64 which is a no-debug kernel. This is probably why OpenQA hasn't caught this problem.

Comment 11 Chris Murphy 2025-02-19 20:34:02 UTC
Just to be extra sure, looking in /run/rootfsbase/usr/lib/modules/6.14.0-0.rc3.29.fc43.x86_64/config I see:

# CONFIG_KASAN is not set
# CONFIG_BTRFS_ASSERT is not set

And at least those two things are set on Fedora debug kernels. And still another way to check is the kernel file size, non-debug are 16-17M. Debug are 31-32M.

root@localhost-live:~# ls -lsh /run/initramfs/live/boot/x86_64/loader/linux
17M -rwxr-xr-x. 1 root root 17M Feb 19 06:44 /run/initramfs/live/boot/x86_64/loader/linux

Comment 12 Chris Murphy 2025-02-19 20:46:46 UTC
Fails in both UEFI and BIOS qemu/kvm.

Comment 13 Adam Williamson 2025-02-19 20:58:06 UTC
> Fedora-Workstation-Live-Rawhide-20250219.n.0.x86_64.iso is using 6.14.0-0.rc3.29.fc43.x86_64 which is a no-debug kernel. This is probably why OpenQA hasn't caught this problem.

Well, yes, that's what all my comments above mean. It also means the bug isn't particularly critical; it just makes debugging kernel problems harder.

Comment 14 Mikhail 2025-02-19 21:48:18 UTC
Anyway, it's regression. The user can remove all non-debug kernels, and after upgrading to the next Fedora release, the system became broken.

Comment 15 Adam Williamson 2025-02-23 19:30:39 UTC
-4 in https://pagure.io/fedora-qa/blocker-review/issue/1745 , marking rejected blocker. FE vote is still open.

Comment 16 Kamil Páral 2025-02-24 18:52:48 UTC
Discussed on 2025-02-24 in a blocker review meeting [1]:

!agreed 2330681 - AcceptedBetaFE - We would like to fix debug kernels ASAP, and we don't ship them on any medium, so this should be a safe freeze exception to grant.

[1] https://meetbot.fedoraproject.org/blocker-review_matrix_fedoraproject-org/2025-02-24/f42-blocker-review.2025-02-24-17.01.log.html

Comment 17 Adam Williamson 2025-03-02 17:46:09 UTC
*** Bug 2334643 has been marked as a duplicate of this bug. ***

Comment 18 Adam Williamson 2025-03-02 17:48:31 UTC
Useful comment from the other bug, from Jason Montleon:

From serial I collected some output:
```
[    9.670579] BPF: [145778] ENUM ee 
[    9.672260] BPF: size=4 vlen=53
[    9.673775] BPF:  
[    9.675183] BPF: Invalid name
[    9.676689] BPF: 
[    9.678155] failed to validate module [fuse] BTF: -22
[    9.901438] BPF: [145778] ENUM ee 
[    9.903195] BPF: size=4 vlen=53
[    9.904717] BPF:  
[    9.906061] BPF: Invalid name
[    9.907546] BPF: 
[    9.908922] failed to validate module [fuse] BTF: -22
[    9.994502] BPF: 	 type_id=350 bits_offset=64
[    9.996322] BPF:  
[    9.997616] BPF: Invalid name
[    9.999123] BPF: 
[   10.000428] failed to validate module [scsi_dh_alua] BTF: -22
[   10.065557] BPF: [145788] FUNC  
[   10.067110] BPF: type_id=199
[   10.068530] BPF:  
[   10.069743] BPF: Invalid name
[   10.071136] BPF: 
[   10.072428] failed to validate module [scsi_dh_emc] BTF: -22
[   10.143174] BPF: 	 type_id=18 bits_offset=296
[   10.144713] BPF:  
[   10.145914] BPF: Invalid name
[   10.147255] BPF: 
[   10.148445] failed to validate module [scsi_dh_rdac] BTF: -22
[   10.242935] systemd[1]: systemd-modules-load.service: Main process exited, code=exited, status=1/FAILURE
[   10.261194] systemd[1]: systemd-modules-load.service: Failed with result 'exit-code'.
[   10.279406] systemd[1]: Failed to start systemd-modules-load.service - Load Kernel Modules.
[FAILED] Failed to start systemd-modules-load.service - Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
```

Comment 19 Zbigniew Jędrzejewski-Szmek 2025-03-06 20:21:55 UTC
What is the libbpf version? There were some bugs that were only fixed in 1.5.0, but also later backport to 1.4.7.

Comment 20 Justin M. Forbes 2025-03-07 17:58:49 UTC
Want to give https://koji.fedoraproject.org/koji/taskinfo?taskID=129948696 a try?

Comment 21 Jason Montleon 2025-03-07 21:06:59 UTC
Created attachment 2079293 [details]
6.14.0-0.rc5.20250307git00a7d39898c8.47.fc43.x86_64+debug journal

This boots so it is much better. I do still see two cases Invalid offset, but I can't see what module(s) might be causing them. I have uploaded the journal in case someone else can pick it out.
```
Mar 07 15:54:41 fedora kernel: BPF:          type_id=3067 offset=0 size=1
Mar 07 15:54:41 fedora kernel: BPF:  
Mar 07 15:54:41 fedora kernel: BPF: Invalid offset
Mar 07 15:54:41 fedora kernel: BPF: 
```

Comment 22 Chris Murphy 2025-03-11 16:46:34 UTC
Tested fixed in kernel-debug-6.14.0-0.rc6.49.fc42.x86_64, reported fixed in 6.14.0-0.rc5.2a520073e74f.47

Comment 23 Adam Williamson 2025-03-11 16:58:27 UTC
Re-opening for F42 tracking.

Comment 24 Fedora Update System 2025-03-11 16:58:54 UTC
FEDORA-2025-1b8a020e07 (kernel-6.14.0-0.rc6.49.fc42 and kernel-headers-6.14.0-0.rc6.49.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-1b8a020e07

Comment 25 Fedora Update System 2025-03-12 01:44:34 UTC
FEDORA-2025-1b8a020e07 has been pushed to the Fedora 42 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-1b8a020e07`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-1b8a020e07

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 26 Lukas Ruzicka 2025-03-12 11:37:06 UTC
The latest update of the debug kernel boots normally.

Comment 27 Fedora Update System 2025-03-15 00:43:43 UTC
FEDORA-2025-1b8a020e07 (kernel-6.14.0-0.rc6.49.fc42 and kernel-headers-6.14.0-0.rc6.49.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.