Bug 1920857

Summary: in-kernel BTF is malformed
Product: [Fedora] Fedora Reporter: Chris Murphy <bugzilla>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, adscvr, agedosier, airlied, alciregi, aoliva, berrange, bskeggs, clalancette, dmalcolm, fweimer, hdegoede, itamar, jakub, jarodwilson, jeremy, jforbes, jglisse, jonathan, josef, jwakely, kernel-maint, laine, law, lgoncalv, libvirt-maint, linville, masami256, mchehab, mpolacek, mprivozn, msebor, ncarrilho, nickc, ptalbert, pwhalen, santiago, sipoyare, steved, veillard, virt-maint, vrothber
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: dwarves1-1.20-1.fc34 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-08 20:47:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
virsh dumpxml
none
journal
none
bpf.c
none
dmesg
none
dmesg 5.11.0-0.rc6.134.fc34.x86_64 none

Description Chris Murphy 2021-01-27 06:49:48 UTC
Description of problem:

VM starts with kernel 5.10.10 but fails with 5.11.0-rc5.


Version-Release number of selected component (if applicable):
libvirt-daemon-7.0.0-1.fc34.x86_64
qemu-kvm-5.2.0-4.fc34.x86_64
systemd-247.2-1.fc34.x86_64
kernel-debug-5.11.0-0.rc5.134.fc34.x86_64


How reproducible:
Always


Steps to Reproduce:
1. Boot with kernel-debug-5.11.0-0.rc5.134.fc34.x86_64
2. Try to start "fedora33" VM using virt-manager
3.

Actual results:

VM doesn't start, I get an error in virt-manager:
Error starting domain: failed to load cgroup BPF prog: Invalid argument

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 101, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 57, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1329, in startup
    self._backend.create()
  File "/usr/lib64/python3.9/site-packages/libvirt.py", line 1234, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: failed to load cgroup BPF prog: Invalid argument



Journal


[  165.419108] systemd-machined[800]: New machine qemu-1-fedora33.
[  165.468109] systemd[1]: Started Virtual Machine qemu-1-fedora33.
[  165.502623] libvirtd[986]: failed to load cgroup BPF prog: Invalid argument
[  165.506860] audit[986]: VIRT_RESOURCE pid=986 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=deny vm="fedora33" uuid=c35d7a47-eb22-4de9-8291-3f6c3f01a7c2 cgroup="/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dfedora33.scope/" class=all exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=failed'
[  165.700885] kernel: virbr0: port 1(vnet0) entered disabled state
[  165.797256] kernel: device vnet0 left promiscuous mode
[  165.797448] kernel: virbr0: port 1(vnet0) entered disabled state
[  165.646976] audit: ANOM_PROMISCUOUS dev=vnet0 prom=0 old_prom=256 auid=4294967295 uid=107 gid=107 ses=4294967295
[  165.672594] NetworkManager[863]: <info>  [1611729971.4410] device (vnet0): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
[  165.674047] NetworkManager[863]: <info>  [1611729971.4423] device (vnet0): released from master device virbr0
[  165.676858] gnome-shell[1142]: Removing a network device that was not added
[  165.682699] gnome-shell[1142]: JS ERROR: TypeError: this._devices[section] is undefined
                                  _connectionRemoved@resource:///org/gnome/shell/ui/status/network.js:1950:27
[  165.699697] systemd[1]: machine-qemu\x2d1\x2dfedora33.scope: Succeeded.
[  165.721082] systemd-machined[800]: Machine qemu-1-fedora33 terminated.
[  165.728116] libvirtd[986]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dfedora33.scope/': No such file or directory
[  165.728782] libvirtd[986]: Failed to remove cgroup for fedora33



Expected results:

Should start

Additional info:

Comment 1 Chris Murphy 2021-01-27 06:50:55 UTC
Created attachment 1751144 [details]
virsh dumpxml

Comment 2 Chris Murphy 2021-01-27 06:51:08 UTC
Created attachment 1751145 [details]
journal

Comment 3 Chris Murphy 2021-01-27 06:52:08 UTC
systemd-247.2-1.fc34.x86_64

Comment 4 Michal Privoznik 2021-02-01 12:10:33 UTC
I've managed to reproduce and found that virBPFLoadProg() logs the following message:

in-kernel BTF is malformed\nprocessed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0\n

This error message is produced here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c#n11994

	if (IS_ERR(btf_vmlinux)) {
		/* Either gcc or pahole or kernel are broken. */
		verbose(env, "in-kernel BTF is malformed\n");
		ret = PTR_ERR(btf_vmlinux);
		goto skip_full_check;
	}

Based on the comment, I don't think that libvirt is the broken one. I think the bug should be switched over to kernel. In fact, I was able to write a small reproducer that takes libvirt out of the picture.

Comment 5 Michal Privoznik 2021-02-01 12:12:13 UTC
Created attachment 1752813 [details]
bpf.c

Comment 6 Michal Privoznik 2021-02-01 12:13:22 UTC
Reassigning to kernel based on comment 4.

Comment 7 Paul Whalen 2021-02-03 20:03:37 UTC
Seeing the same error when using a Dockerfile with podman:

STEP 1: FROM registry.fedoraproject.org/fedora:latest
STEP 2: RUN /usr/bin/dnf install -y httpd
error running container: error creating container for [/bin/sh -c /usr/bin/dnf install -y httpd]: bpf create `in-kernel BTF is malformed
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
`: Invalid argument
: exit status 1
Error: error building at STEP "RUN /usr/bin/dnf install -y httpd": error while running runtime: exit status 1


In OpenQA - https://openqa.fedoraproject.org/tests/768948#step/podman/18

podman-plugins-3.0.0-0.184.dev.gitfb653c4.fc34.x86_64
podman-3.0.0-0.184.dev.gitfb653c4.fc34.x86_64

5.11.0-0.rc6.141.fc34.x86_64

Comment 8 Chris Murphy 2021-02-03 20:28:21 UTC
I sent an email to the usptream list:
https://lore.kernel.org/bpf/CAJCQCtSQLc0VHqO4BY_-YB2OmCNNmHCS6fNdQKmMWGn2v=Jpdw@mail.gmail.com/T/#u

Comment 9 Paul Whalen 2021-02-03 20:30:50 UTC
Also fails with Kernel 5.11.0-0.rc5.134.fc34.

Downgrading to Kernel 5.11.0-0.rc4.129.fc34, podman works without error. 

Also seen in OpenQA - https://openqa.fedoraproject.org/tests/763612#step/podman/16

Comment 10 Chris Murphy 2021-02-03 20:47:52 UTC
Confirmed, no error in virt-manager using 5.11.0-0.rc4.129.fc34.x86_64+debug either. The VM and its guest OS start up normally.

Comment 11 Chris Murphy 2021-02-03 22:29:15 UTC
Created attachment 1754912 [details]
dmesg

Maybe it's related...

Feb 03 15:06:26 fmac.local kernel: BPF:        sched_reset_on_fork type_id=6 bitfield_size=0 bits_offset=0
Feb 03 15:06:26 fmac.local kernel: BPF: 
Feb 03 15:06:26 fmac.local kernel: BPF:Invalid member bits_offset
Feb 03 15:06:26 fmac.local kernel: BPF:

Comment 12 Chris Murphy 2021-02-03 22:33:58 UTC
I'm trying to bisect between rc4 and rc5, but the resulting kernel is failing to startup due to many messages like this, many of which also repeat.

Feb 03 15:05:41 kernel: failed to validate module [fuse] BTF: -22
Feb 03 15:05:41 kernel: failed to validate module [hid_apple] BTF: -22
Feb 03 15:05:41 kernel: failed to validate module [hid_apple] BTF: -22
Feb 03 15:05:41 kernel: failed to validate module [video] BTF: -22
Feb 03 15:05:41 kernel: failed to validate module [crc_itu_t] BTF: -22
Feb 03 15:05:41 kernel: failed to validate module [hid_apple] BTF: -22
Feb 03 15:05:41 kernel: failed to validate module [hid_appleir] BTF: -22
..snip...


So I'm stuck for the moment until I figure that out.

Comment 13 Chris Murphy 2021-02-04 00:13:39 UTC
Created attachment 1754934 [details]
dmesg 5.11.0-0.rc6.134.fc34.x86_64

Comment 14 Chris Murphy 2021-02-04 01:22:07 UTC
Looks like this bug is gcc 11 related, doesn't happen when I recompile with gcc 10.2.1.

Comment 15 Jakub Jelinek 2021-02-04 12:24:05 UTC
It just seems kernel or whatever kernel uses (pahole) isn't able to deal with DWARF5.

Comment 16 Daniel Berrangé 2021-02-04 12:31:53 UTC
Yep, dwarves  package is currently broken wrt DWARF5 format.  There are patches upstream and a new release coming real soon, meanwhile you have to build with -gdwarf-4 in CFLAGS if you need to use any of the dwarves tools like pahole. See https://bugzilla.redhat.com/show_bug.cgi?id=1919965

Comment 17 Jakub Jelinek 2021-02-04 12:49:56 UTC
And to be precise, it is broken for DWARF4 already, just GCC doesn't emit that DWARF4 attribute unless -gdwarf-5 because too many consumers were broken 4 years ago.

Comment 18 Pavel Hrdina 2021-02-04 14:12:33 UTC
*** Bug 1925158 has been marked as a duplicate of this bug. ***

Comment 19 Nuno Carrilho 2021-02-04 22:36:24 UTC
kernel-5.11.0-0.rc6.20210203git3aaf0a27ffc2.143.fc34.x86_64 does not resolve this issue :(

Still had to  go back to 5.11.0-0.rc4.129.fc34

Comment 20 Chris Murphy 2021-02-05 03:06:32 UTC
dwarves-1.20-1.fc34.x86_64
libdwarves1-1.20-1.fc34.x86_64

Fixes both "failed to validate module [?????] BTF: -22" type errors, and 'in-kernel BTF is malformed" with qemu-kvm and libvirt.

Comment 21 Chris Murphy 2021-02-05 03:09:30 UTC
OK maybe the second problem is fixed with gcc-11.0.0-0.18.fc34.x86_64 which results in (GCC) 11.0.0 20210130 attached to the kernel build I just did. I'm not sure what else it could have been.

Comment 22 Chris Murphy 2021-02-05 03:26:00 UTC
As Nuno reports, this kernel still has the problem, but it also is built with GCC 11.0.0 20210130 so... that's not it? Bah! Anyway, I've also updated the upstream thread.

Feb 04 18:36:31 fmac.local kernel: Linux version 5.11.0-0.rc6.20210203git3aaf0a27ffc2.143.fc34.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 11.0.0 20210130 (Red Hat 11.0.0-0), GNU ld version 2.35.1-26.fc34) #1 SMP Wed Feb 3 19:07:40 UTC 2021

Comment 23 Nuno Carrilho 2021-02-05 09:11:46 UTC
Hi Chris,

I believe dwarves-1.20-1.fc34.x86_64 and libdwarves1-1.20-1.fc34.x86_64 packages nothing have to do with this issue.
O my server both are not installed.
Thanks

Comment 24 Florian Weimer 2021-02-05 09:13:23 UTC
From https://kojipkgs.fedoraproject.org//packages/kernel/5.11.0/0.rc6.20210203git3aaf0a27ffc2.143.fc34/data/logs/x86_64/root.log:

DEBUG util.py:446:   dwarves                   x86_64  1.19-2.fc34                            build  121 k
DEBUG util.py:446:   libdwarves1               x86_64  1.19-2.fc34                            build  186 k

So the .143. kernel was still built using the old dwarves version, so I think the failure is expected?

Comment 25 Paul Whalen 2021-02-08 20:32:31 UTC
Podman is working again with Kernel 5.11.0-0.rc7.149.fc34.aarch64, which was built with libdwarves1-1.20-1.fc34

Comment 26 Chris Murphy 2021-02-08 20:48:56 UTC
And also fixed in 5.11.0-0.rc7.149.fc34, et al.