Hide Forgot
Description of problem: VM starts with kernel 5.10.10 but fails with 5.11.0-rc5. Version-Release number of selected component (if applicable): libvirt-daemon-7.0.0-1.fc34.x86_64 qemu-kvm-5.2.0-4.fc34.x86_64 systemd-247.2-1.fc34.x86_64 kernel-debug-5.11.0-0.rc5.134.fc34.x86_64 How reproducible: Always Steps to Reproduce: 1. Boot with kernel-debug-5.11.0-0.rc5.134.fc34.x86_64 2. Try to start "fedora33" VM using virt-manager 3. Actual results: VM doesn't start, I get an error in virt-manager: Error starting domain: failed to load cgroup BPF prog: Invalid argument Traceback (most recent call last): File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in cb_wrapper callback(asyncjob, *args, **kwargs) File "/usr/share/virt-manager/virtManager/asyncjob.py", line 101, in tmpcb callback(*args, **kwargs) File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 57, in newfn ret = fn(self, *args, **kwargs) File "/usr/share/virt-manager/virtManager/object/domain.py", line 1329, in startup self._backend.create() File "/usr/lib64/python3.9/site-packages/libvirt.py", line 1234, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirt.libvirtError: failed to load cgroup BPF prog: Invalid argument Journal [ 165.419108] systemd-machined[800]: New machine qemu-1-fedora33. [ 165.468109] systemd[1]: Started Virtual Machine qemu-1-fedora33. [ 165.502623] libvirtd[986]: failed to load cgroup BPF prog: Invalid argument [ 165.506860] audit[986]: VIRT_RESOURCE pid=986 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm resrc=cgroup reason=deny vm="fedora33" uuid=c35d7a47-eb22-4de9-8291-3f6c3f01a7c2 cgroup="/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dfedora33.scope/" class=all exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=failed' [ 165.700885] kernel: virbr0: port 1(vnet0) entered disabled state [ 165.797256] kernel: device vnet0 left promiscuous mode [ 165.797448] kernel: virbr0: port 1(vnet0) entered disabled state [ 165.646976] audit: ANOM_PROMISCUOUS dev=vnet0 prom=0 old_prom=256 auid=4294967295 uid=107 gid=107 ses=4294967295 [ 165.672594] NetworkManager[863]: <info> [1611729971.4410] device (vnet0): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') [ 165.674047] NetworkManager[863]: <info> [1611729971.4423] device (vnet0): released from master device virbr0 [ 165.676858] gnome-shell[1142]: Removing a network device that was not added [ 165.682699] gnome-shell[1142]: JS ERROR: TypeError: this._devices[section] is undefined _connectionRemoved@resource:///org/gnome/shell/ui/status/network.js:1950:27 [ 165.699697] systemd[1]: machine-qemu\x2d1\x2dfedora33.scope: Succeeded. [ 165.721082] systemd-machined[800]: Machine qemu-1-fedora33 terminated. [ 165.728116] libvirtd[986]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dfedora33.scope/': No such file or directory [ 165.728782] libvirtd[986]: Failed to remove cgroup for fedora33 Expected results: Should start Additional info:
Created attachment 1751144 [details] virsh dumpxml
Created attachment 1751145 [details] journal
systemd-247.2-1.fc34.x86_64
I've managed to reproduce and found that virBPFLoadProg() logs the following message: in-kernel BTF is malformed\nprocessed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0\n This error message is produced here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c#n11994 if (IS_ERR(btf_vmlinux)) { /* Either gcc or pahole or kernel are broken. */ verbose(env, "in-kernel BTF is malformed\n"); ret = PTR_ERR(btf_vmlinux); goto skip_full_check; } Based on the comment, I don't think that libvirt is the broken one. I think the bug should be switched over to kernel. In fact, I was able to write a small reproducer that takes libvirt out of the picture.
Created attachment 1752813 [details] bpf.c
Reassigning to kernel based on comment 4.
Seeing the same error when using a Dockerfile with podman: STEP 1: FROM registry.fedoraproject.org/fedora:latest STEP 2: RUN /usr/bin/dnf install -y httpd error running container: error creating container for [/bin/sh -c /usr/bin/dnf install -y httpd]: bpf create `in-kernel BTF is malformed processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 `: Invalid argument : exit status 1 Error: error building at STEP "RUN /usr/bin/dnf install -y httpd": error while running runtime: exit status 1 In OpenQA - https://openqa.fedoraproject.org/tests/768948#step/podman/18 podman-plugins-3.0.0-0.184.dev.gitfb653c4.fc34.x86_64 podman-3.0.0-0.184.dev.gitfb653c4.fc34.x86_64 5.11.0-0.rc6.141.fc34.x86_64
I sent an email to the usptream list: https://lore.kernel.org/bpf/CAJCQCtSQLc0VHqO4BY_-YB2OmCNNmHCS6fNdQKmMWGn2v=Jpdw@mail.gmail.com/T/#u
Also fails with Kernel 5.11.0-0.rc5.134.fc34. Downgrading to Kernel 5.11.0-0.rc4.129.fc34, podman works without error. Also seen in OpenQA - https://openqa.fedoraproject.org/tests/763612#step/podman/16
Confirmed, no error in virt-manager using 5.11.0-0.rc4.129.fc34.x86_64+debug either. The VM and its guest OS start up normally.
Created attachment 1754912 [details] dmesg Maybe it's related... Feb 03 15:06:26 fmac.local kernel: BPF: sched_reset_on_fork type_id=6 bitfield_size=0 bits_offset=0 Feb 03 15:06:26 fmac.local kernel: BPF: Feb 03 15:06:26 fmac.local kernel: BPF:Invalid member bits_offset Feb 03 15:06:26 fmac.local kernel: BPF:
I'm trying to bisect between rc4 and rc5, but the resulting kernel is failing to startup due to many messages like this, many of which also repeat. Feb 03 15:05:41 kernel: failed to validate module [fuse] BTF: -22 Feb 03 15:05:41 kernel: failed to validate module [hid_apple] BTF: -22 Feb 03 15:05:41 kernel: failed to validate module [hid_apple] BTF: -22 Feb 03 15:05:41 kernel: failed to validate module [video] BTF: -22 Feb 03 15:05:41 kernel: failed to validate module [crc_itu_t] BTF: -22 Feb 03 15:05:41 kernel: failed to validate module [hid_apple] BTF: -22 Feb 03 15:05:41 kernel: failed to validate module [hid_appleir] BTF: -22 ..snip... So I'm stuck for the moment until I figure that out.
Created attachment 1754934 [details] dmesg 5.11.0-0.rc6.134.fc34.x86_64
Looks like this bug is gcc 11 related, doesn't happen when I recompile with gcc 10.2.1.
It just seems kernel or whatever kernel uses (pahole) isn't able to deal with DWARF5.
Yep, dwarves package is currently broken wrt DWARF5 format. There are patches upstream and a new release coming real soon, meanwhile you have to build with -gdwarf-4 in CFLAGS if you need to use any of the dwarves tools like pahole. See https://bugzilla.redhat.com/show_bug.cgi?id=1919965
And to be precise, it is broken for DWARF4 already, just GCC doesn't emit that DWARF4 attribute unless -gdwarf-5 because too many consumers were broken 4 years ago.
*** Bug 1925158 has been marked as a duplicate of this bug. ***
kernel-5.11.0-0.rc6.20210203git3aaf0a27ffc2.143.fc34.x86_64 does not resolve this issue :( Still had to go back to 5.11.0-0.rc4.129.fc34
dwarves-1.20-1.fc34.x86_64 libdwarves1-1.20-1.fc34.x86_64 Fixes both "failed to validate module [?????] BTF: -22" type errors, and 'in-kernel BTF is malformed" with qemu-kvm and libvirt.
OK maybe the second problem is fixed with gcc-11.0.0-0.18.fc34.x86_64 which results in (GCC) 11.0.0 20210130 attached to the kernel build I just did. I'm not sure what else it could have been.
As Nuno reports, this kernel still has the problem, but it also is built with GCC 11.0.0 20210130 so... that's not it? Bah! Anyway, I've also updated the upstream thread. Feb 04 18:36:31 fmac.local kernel: Linux version 5.11.0-0.rc6.20210203git3aaf0a27ffc2.143.fc34.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 11.0.0 20210130 (Red Hat 11.0.0-0), GNU ld version 2.35.1-26.fc34) #1 SMP Wed Feb 3 19:07:40 UTC 2021
Hi Chris, I believe dwarves-1.20-1.fc34.x86_64 and libdwarves1-1.20-1.fc34.x86_64 packages nothing have to do with this issue. O my server both are not installed. Thanks
From https://kojipkgs.fedoraproject.org//packages/kernel/5.11.0/0.rc6.20210203git3aaf0a27ffc2.143.fc34/data/logs/x86_64/root.log: DEBUG util.py:446: dwarves x86_64 1.19-2.fc34 build 121 k DEBUG util.py:446: libdwarves1 x86_64 1.19-2.fc34 build 186 k So the .143. kernel was still built using the old dwarves version, so I think the failure is expected?
Podman is working again with Kernel 5.11.0-0.rc7.149.fc34.aarch64, which was built with libdwarves1-1.20-1.fc34
And also fixed in 5.11.0-0.rc7.149.fc34, et al.