Bug 2022075 - Nested KVM, RHEL 8.5 L0, Fedora 35 L1: qemu-kvm: ../target/i386/kvm/kvm.c:2833: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
Summary: Nested KVM, RHEL 8.5 L0, Fedora 35 L1: qemu-kvm: ../target/i386/kvm/kvm.c:283...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: 35
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Fedora Virtualization Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-10 17:26 UTC by Kevin Fenzi
Modified: 2022-12-13 15:49 UTC (History)
12 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-12-13 15:49:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Kevin Fenzi 2021-11-10 17:26:17 UTC
Output from libguestfs-test-tool:

[root@compose-rawhide01 ~][PROD-IAD2]# LIBGUESTFS_BACKEND=direct libguestfs-test-tool               
     ************************************************************                                   
     *                    IMPORTANT NOTICE
     *
     * When reporting bugs, include the COMPLETE, UNEDITED                                          
     * output below in your bug report.
     *
     ************************************************************                                   
LIBGUESTFS_BACKEND=direct
PATH=/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin                   
XDG_RUNTIME_DIR=/run/user/0
SELinux: Enforcing
guestfs_get_append: (null)
guestfs_get_autosync: 1
guestfs_get_backend: direct
guestfs_get_backend_settings: []
guestfs_get_cachedir: /var/tmp                                                                      
guestfs_get_hv: /usr/bin/qemu-kvm                                                                   
guestfs_get_memsize: 1280                                                                           
guestfs_get_network: 0
guestfs_get_path: /usr/lib64/guestfs
guestfs_get_pgroup: 0                                                                               
guestfs_get_program: libguestfs-test-tool
guestfs_get_recovery_proc: 1                                                                        
guestfs_get_smp: 1
guestfs_get_sockdir: /tmp
guestfs_get_tmpdir: /tmp
guestfs_get_trace: 0                                                                                
guestfs_get_verbose: 1                                                                              
host_cpu: x86_64                                                                                    
Launching appliance, timeout set to 600 seconds.                                                    
libguestfs: launch: program=libguestfs-test-tool                                                    
libguestfs: launch: version=1.46.0fedora=35,release=1.fc35,libvirt                                  
libguestfs: launch: backend registered: direct                                                      
libguestfs: launch: backend registered: libvirt
libguestfs: launch: backend registered: uml                                                         
libguestfs: launch: backend registered: unix                                                        
libguestfs: launch: backend=direct
libguestfs: launch: tmpdir=/tmp/libguestfsk756Jb                                                    
libguestfs: launch: umask=0022                                                                      
libguestfs: launch: euid=0                                                                          
libguestfs: begin building supermin appliance
libguestfs: run supermin                                                                            
libguestfs: command: run: /usr/bin/supermin
libguestfs: command: run: \ --build   
libguestfs: command: run: \ --verbose                                                               
libguestfs: command: run: \ --if-newer
libguestfs: command: run: \ --lock /var/tmp/.guestfs-0/lock                                         
libguestfs: command: run: \ --copy-kernel                                                           
libguestfs: command: run: \ -f ext2
libguestfs: command: run: \ --host-cpu x86_64
libguestfs: command: run: \ /usr/lib64/guestfs/supermin.d                                           
libguestfs: command: run: \ -o /var/tmp/.guestfs-0/appliance.d                                      
supermin: version: 5.3.1                
supermin: rpm: detected RPM version 4.17
supermin: rpm: detected RPM architecture x86_64
supermin: package handler: fedora/rpm                                                               
supermin: acquiring lock on /var/tmp/.guestfs-0/lock                                                
supermin: if-newer: output does not need rebuilding                                                 
libguestfs: finished building supermin appliance                                                    
libguestfs: begin testing qemu features                                                             
libguestfs: checking for previously cached test results of /usr/bin/qemu-kvm, in /var/tmp/.guestfs-0
libguestfs: loading previously cached test results
libguestfs: qemu version: 6.1                                                                       
libguestfs: qemu mandatory locking: yes                                                             
libguestfs: qemu KVM: enabled             
libguestfs: finished testing qemu features
/usr/bin/qemu-kvm \                                                                                 
    -global virtio-blk-pci.scsi=off \  
    -no-user-config \
    -nodefaults \                                                                                   
    -display none \      
    -machine accel=kvm:tcg,graphics=off \                                                           
    -cpu max \             
    -m 1280 \     
    -no-reboot \          
    -rtc driftfix=slew \
    -no-hpet \             
    -global kvm-pit.lost_tick_policy=discard \
    -kernel /var/tmp/.guestfs-0/appliance.d/kernel \                                                
    -initrd /var/tmp/.guestfs-0/appliance.d/initrd \                                                
    -object rng-random,filename=/dev/urandom,id=rng0 \                                              
    -device virtio-rng-pci,rng=rng0 \
    -device virtio-scsi-pci,id=scsi \
    -drive file=/tmp/libguestfsk756Jb/scratch1.img,cache=unsafe,format=raw,id=hd0,if=none \         
    -device scsi-hd,drive=hd0 \          
    -drive file=/var/tmp/.guestfs-0/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none \
    -device scsi-hd,drive=appliance \
    -device virtio-serial-pci \
    -serial stdio \     
    -chardev socket,path=/tmp/libguestfsl7pJlc/guestfsd.sock,id=channel0 \                          
    -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \                         
    -append "panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check pr
intk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=U
UID=e42f185a-b4a2-4864-95b9-37bd7186fab3 selinux=0 guestfs_verbose=1 TERM=screen"                   
qemu-kvm: error: failed to set MSR 0x345 to 0x2000                                                  
qemu-kvm: ../target/i386/kvm/kvm.c:2833: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs'
 failed.                                       
libguestfs: error: appliance closed the connection unexpectedly, see earlier error messages         
libguestfs: child_cleanup: 0x5587c998cd20: child process died                                       
libguestfs: sending SIGTERM to process 1272
libguestfs: error: /usr/bin/qemu-kvm killed by signal 6 (Aborted), see debug messages above         
libguestfs: error: guestfs_launch failed, see earlier error messages                                
libguestfs: closing guestfs handle 0x5587c998cd20 (state 0)                                         
libguestfs: command: run: rm                 
libguestfs: command: run: \ -rf /tmp/libguestfsk756Jb                                               
libguestfs: command: run: rm               
libguestfs: command: run: \ -rf /tmp/libguestfsl7pJlc

This is the fedora rawhide composer vm. I just upgraded it from f34 where it was working fine to f35. 

The host is a rhel8.5 box running kernel-4.18.0-348.el8.x86_64 with 'options kvm_intel nested=1' set. 

/proc/cpuinfo has: 
...
processor       : 95
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
stepping        : 7
microcode       : 0x5003102
cpu MHz         : 3700.000
cache size      : 36608 KB
physical id     : 1
siblings        : 48
core id         : 27
cpu cores       : 24
apicid          : 119
initial apicid  : 119
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit
bogomips        : 4205.22
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

The guest cpuinfo has: 
processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel Xeon Processor (Cascadelake)
stepping        : 6
microcode       : 0x1
cpu MHz         : 2095.076
cache size      : 16384 KB
physical id     : 15
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips        : 4190.15
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Comment 1 Richard W.M. Jones 2021-11-11 08:46:44 UTC
Nested KVM is unfortunately known to be very flaky.  Adding a few people
who work in this area.

Comment 2 Vitaly Kuznetsov 2021-11-11 11:51:08 UTC
Could you please also give QEMU command line from L0 and /proc/cpuinfo from L1 (in case the one from https://bugzilla.redhat.com/show_bug.cgi?id=2022075#c0 corresponds to L0).

Comment 3 Richard W.M. Jones 2021-11-11 12:11:37 UTC
Definitely agree we'd want to see the L0 qemu command and the other information.
Also be good to have exact kernel and qemu versions of each layer.

I believe the scenario is:

  L0 : RHEL 8.5 (AV or non-AV?)

  L1 : Fedora 35      <-- qemu command from comment 0 runs here

  L2 : libguestfs appliance with same Fedora 35 kernel as L1

Comment 4 Laszlo Ersek 2021-11-11 13:27:14 UTC
This assertion failure is really strange; it implies that the kernel set *more* MSRs than what QEMU requested. When the kernel sets *fewer* than requested, QEMU logs an error but continues otherwise fine. This looks like a misunderstanding on MSRs between KVM and QEMU.

Comment 5 Kevin Fenzi 2021-11-11 19:01:01 UTC
Thanks for all the quick replies here. :) 

In the mean time last night I downgraded the guest to f34's qemu and that got everything working fine, so that makes it sound to me like qemu is the issue here. ;) 

(In reply to Vitaly Kuznetsov from comment #2)
> Could you please also give QEMU command line from L0 and /proc/cpuinfo from

qemu      247996 24.5 33.9 136310620 133897228 ? Sl   Nov10 377:28 /usr/libexec/qemu-kvm -name guest=compose-rawhide01.iad2.fedoraproject.org,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-13-compose-rawhide01.ia/master-key.aes -machine pc-q35-rhel8.2.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu Cascadelake-Server,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off -m 131072 -overcommit mem-lock=off -smp 16,maxcpus=80,sockets=80,cores=1,threads=1 -uuid 7864f4a9-499d-4cee-bebd-31c4ca735fad -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=46,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 -device pcie-pci-bridge,id=pci.8,bus=pci.1,addr=0x0 -device pcie-root-port,port=0x17,chassis=9,id=pci.9,bus=pcie.0,addr=0x2.0x7 -device qemu-xhci,p2=15,p3=15,id=usb,bus=pci.3,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.4,addr=0x0 -blockdev {"driver":"host_device","filename":"/dev/vg_guests/compose-rawhide01.iad2.fedoraproject.org","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"} -device virtio-blk-pci,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=48,id=hostnet0,vhost=on,vhostfd=49 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:57:74:7c,bus=pci.2,addr=0x0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,fd=50,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5906,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 -device ich9-intel-hda,id=sound0,bus=pcie.0,addr=0x1b -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device i6300esb,id=watchdog0,bus=pci.8,addr=0x1 -watchdog-action reset -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device virtio-balloon-pci,id=balloon0,bus=pci.6,addr=0x0 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.7,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on


> L1 (in case the one from
> https://bugzilla.redhat.com/show_bug.cgi?id=2022075#c0 corresponds to L0).

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 1
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 2
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 3
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 4
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 5
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 6
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 7
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 8
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 8
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 8
initial apicid	: 8
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 9
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 9
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 9
initial apicid	: 9
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 10
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 10
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 10
initial apicid	: 10
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 11
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 11
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 11
initial apicid	: 11
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 12
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 12
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 12
initial apicid	: 12
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 13
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 13
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 13
initial apicid	: 13
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 14
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 14
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 14
initial apicid	: 14
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel Xeon Processor (Cascadelake)
stepping	: 6
microcode	: 0x1
cpu MHz		: 2095.076
cache size	: 16384 KB
physical id	: 15
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 15
initial apicid	: 15
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa
bogomips	: 4190.15
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

(In reply to Richard W.M. Jones from comment #3)
> Definitely agree we'd want to see the L0 qemu command and the other
> information.
> Also be good to have exact kernel and qemu versions of each layer.
> 
> I believe the scenario is:
> 
>   L0 : RHEL 8.5 (AV or non-AV?)

Yes, 8.5 (not sure what AV is in this context?)
 
>   L1 : Fedora 35      <-- qemu command from comment 0 runs here

Yes
 
>   L2 : libguestfs appliance with same Fedora 35 kernel as L1

yes. Or more importantly qemu from pungi for making images. 

L0: 

qemu-kvm-4.2.0-59.module+el8.5.0+12817+cb650d43.x86_64
kernel-4.18.0-348.el8.x86_64

L1:

kernel-5.14.16-301.fc35.x86_64
qemu-kvm-6.1.0-10.fc35 (broken)
qemu-kvm-5.2.0-8.fc34.x86_64 (works)

Comment 6 Vitaly Kuznetsov 2021-11-12 14:48:32 UTC
Thanks for the info!

QEMU's error:

"qemu-kvm: error: failed to set MSR 0x345 to 0x2000"

is likely the culprit. MSR 0x345 is MSR_IA32_PERF_CAPABILITIES. '0x2000' is 'full width counting'.
Support for the feature was added in Linux-5.8 (see commit full width counting) and QEMU-5.1 (see
commit ea39f9b643959). 'pdcm' flag in /proc/cpuinfo indicates the presence of the feature and as we
can see, both L0 and L1 have it.

Looking at KVM code, write to MSR_IA32_PERF_CAPABILITIES is denied when guest (L2 in our case)
CPU doesn't have X86_FEATURE_PDCM exposed. L2's QEMU command like looks like

"-cpu max"

Maybe this doesn't expose pdcm? How hard would it be to modify this to '-cpu max,pdcm=on' ?

Comment 7 Daniel Berrangé 2021-11-12 14:57:24 UTC
(In reply to Vitaly Kuznetsov from comment #6)
> Maybe this doesn't expose pdcm? How hard would it be to modify this to '-cpu
> max,pdcm=on' ?

That doesn't make sense as a question. The 'max' CPU model is defined & implemented as exposing *all* the features supported by the accelerator (kvm or tcg). In KVM case 'max' is identical to 'host'. In TCG case 'max' is simply everything TCG supports. So if some feature isn't exposed, it means it is either not implemented in TCG, or not available from the KVM kmod.

Comment 8 Richard W.M. Jones 2021-11-12 15:54:51 UTC
(In reply to Vitaly Kuznetsov from comment #6)
> Maybe this doesn't expose pdcm? How hard would it be to modify this to '-cpu
> max,pdcm=on' ?

To answer just this question, you can't test it with libguestfs directly,
but one way to test it would be to run the following commands (in L1):

$ libguestfs-test-tool
$ cd /var/tmp/.guestfs-`id -u`/appliance.d
$ qemu-system-x86_64 -no-user-config -nodefaults -display none -no-reboot -machine accel=kvm -cpu max -m 1280 -kernel ./kernel -initrd ./initrd  -append 'panic=1 console=ttyS0' -serial stdio

vs

$ qemu-system-x86_64 -no-user-config -nodefaults -display none -no-reboot -machine accel=kvm -cpu max,pdcm=on -m 1280 -kernel ./kernel -initrd ./initrd  -append 'panic=1 console=ttyS0' -serial stdio

The first one is expected to fail with the MSRs assert fail.

If the second one starts to run the guest kernel, that would indicate
that the problem is fixed by adding pdcm=on.

Comment 9 Kevin Fenzi 2021-11-12 21:51:23 UTC
Both fail: 

# qemu-system-x86_64 -no-user-config -nodefaults -display none -no-reboot -machine accel=kvm -cpu max -m 1280 -kernel ./kernel -initrd ./initrd  -append 'panic=1 console=ttyS0' -serial stdio
qemu-system-x86_64: error: failed to set MSR 0x345 to 0x2000
qemu-system-x86_64: ../target/i386/kvm/kvm.c:2833: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
Aborted (core dumped)
# qemu-system-x86_64 -no-user-config -nodefaults -display none -no-reboot -machine accel=kvm -cpu max,pdcm=on -m 1280 -kernel ./kernel -initrd ./initrd -append 'panic=1 console=ttyS0' -serial stdio
qemu-system-x86_64: error: failed to set MSR 0x345 to 0x2000
qemu-system-x86_64: ../target/i386/kvm/kvm.c:2833: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
Aborted (core dumped)

FWIW, the backtrack on the core is: 

                Stack trace of thread 415021:                                                       
                #0  0x00007fe6bfb5585c __pthread_kill_implementation (libc.so.6 + 0x8f85c)           
                #1  0x00007fe6bfb086b6 raise (libc.so.6 + 0x426b6)                                  
                #2  0x00007fe6bfaf27d3 abort (libc.so.6 + 0x2c7d3)                                  
                #3  0x00007fe6bfaf26fb __assert_fail_base.cold (libc.so.6 + 0x2c6fb)                
                #4  0x00007fe6bfb013a6 __assert_fail (libc.so.6 + 0x3b3a6)                          
                #5  0x000055da10105804 kvm_buf_set_msrs (qemu-system-x86_64 + 0x52b804)             
                #6  0x000055da10107e84 kvm_arch_init_vcpu (qemu-system-x86_64 + 0x52de84)            
                #7  0x000055da102688f3 kvm_init_vcpu (qemu-system-x86_64 + 0x68e8f3)                
                #8  0x000055da1026cb89 kvm_vcpu_thread_fn (qemu-system-x86_64 + 0x692b89)           
                #9  0x000055da103b2033 qemu_thread_start (qemu-system-x86_64 + 0x7d8033)             
                #10 0x00007fe6bfb53b17 start_thread (libc.so.6 + 0x8db17)                           
                #11 0x00007fe6bfbd86c0 __clone3 (libc.so.6 + 0x1126c0)                              
                                                                                                     
                Stack trace of thread 415017:                                                        
                #0  0x00007fe6bfb5077a __futex_abstimed_wait_common (libc.so.6 + 0x8a77a)           
                #1  0x00007fe6bfb52ef0 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x8cef0)         
                #2  0x000055da103b246d qemu_cond_wait_impl (qemu-system-x86_64 + 0x7d846d)          
                #3  0x000055da10189417 qemu_init_vcpu (qemu-system-x86_64 + 0x5af417)                
                #4  0x000055da1014011f x86_cpu_realizefn (qemu-system-x86_64 + 0x56611f)            
                #5  0x000055da1028bc2d device_set_realized (qemu-system-x86_64 + 0x6b1c2d)          
                #6  0x000055da1028e79a property_set_bool (qemu-system-x86_64 + 0x6b479a)            
                #7  0x000055da1029153c object_property_set (qemu-system-x86_64 + 0x6b753c)          
                #8  0x000055da10294b94 object_property_set_qobject (qemu-system-x86_64 + 0x6bab94)  
                #9  0x000055da10291b79 object_property_set_bool (qemu-system-x86_64 + 0x6b7b79)     
                #10 0x000055da10117005 x86_cpu_new (qemu-system-x86_64 + 0x53d005)                   
                #11 0x000055da101170ee x86_cpus_init (qemu-system-x86_64 + 0x53d0ee)                 
                #12 0x000055da1011b53d pc_init1.constprop.0 (qemu-system-x86_64 + 0x54153d)          
                #13 0x000055da1000b263 machine_run_board_init (qemu-system-x86_64 + 0x431263)       
                #14 0x000055da101a5a09 qmp_x_exit_preconfig.part.0 (qemu-system-x86_64 + 0x5cba09)  
                #15 0x000055da101a9737 qemu_init (qemu-system-x86_64 + 0x5cf737)                    
                #16 0x000055da0ff33e7d main (qemu-system-x86_64 + 0x359e7d)                         
                #17 0x00007fe6bfaf3560 __libc_start_call_main (libc.so.6 + 0x2d560)                 
                #18 0x00007fe6bfaf360c __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2d60c)          
                #19 0x000055da0ff37bc5 _start (qemu-system-x86_64 + 0x35dbc5)                       
                                                                                                     
                Stack trace of thread 415018:                                                        
                #0  0x00007fe6bfb9e3b5 clock_nanosleep.5 (libc.so.6 + 0xd83b5)             
                #1  0x00007fe6bfba2fb7 __nanosleep (libc.so.6 + 0xdcfb7)                            
                #2  0x00007fe6c002f6a7 g_usleep (libglib-2.0.so.0 + 0x7f6a7)                        
                #3  0x000055da103bbbb2 call_rcu_thread (qemu-system-x86_64 + 0x7e1bb2)              
                #4  0x000055da103b2033 qemu_thread_start (qemu-system-x86_64 + 0x7d8033)             
                #5  0x00007fe6bfb53b17 start_thread (libc.so.6 + 0x8db17)                           
                #6  0x00007fe6bfbd86c0 __clone3 (libc.so.6 + 0x1126c0)

Happy to gather additional info or try more things. Thanks.

Comment 10 Vitaly Kuznetsov 2021-11-15 17:51:30 UTC
The weird thing here is that 'pdcm' shouldn't be in L1 cpu flags.

QEMU command line for L1 is:

... cpu Cascadelake-Server,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off ...

and while we see explicit 'pdcm=on', QEMU has the following code

cpu_x86_cpuid():

     case 1:
...
         if (!cpu->enable_pmu) {
             *ecx &= ~CPUID_EXT_PDCM;
         }
...

'enable_pmu' is false by default and there's no 'pmu=on' on the command line above.

What's even more weird, that QEMU actually works as expected for me. Without "pmu=on", there's no
'pdcm' in guest's /proc/cpuinfo even with explicit 'pdcm=on' (our case). I'm certainly missing
something important here to be able to reproduce the problem.

What's the exact RHEL8.5 QEMU version in L0? I'll try to recreate the exact setup then.

Comment 11 Kevin Fenzi 2021-11-15 18:02:09 UTC
qemu-kvm-4.2.0-59.module+el8.5.0+12817+cb650d43.x86_64

Comment 12 Vitaly Kuznetsov 2021-11-15 18:20:18 UTC
Oh, I see, it's not from RHEL-AV (advanced virtualization), it's from plain RHEL.

Sad story is: QEMU gained support for 'pdcm' feature bit long time ago (e117f7725af84)
but it wasn't until QEMU-5.1 when support for MSR_IA32_PERF_CAPABILITIES was added (ea39f9b643)
so when qemu-4.2 is used to create L1, MSR_IA32_PERF_CAPABILITIES MSR is actually unsupported
but the feature bit for it is set. Newer QEMU in L1 tries to add MSR_IA32_PERF_CAPABILITIES for
L2 but KVM in L1 wants to access non-existent MSR_IA32_PERF_CAPABILITIES.

Now the question is what can we do about it. Immediate solutions:

1) Drop 'pdcm=on' from QEMU command line in L1
2) Add 'pdcm=off' to QEMU command line in L2
3) Use newer QEMU (probably from 'advanced virt' module) in L0.

No.3 actually makes a lot of sense as nested virt can be broken in multiple other places with QEMU-4.2,
afaiu nobody probably tests it.

What I'm still puzzled about is why 'qemu-kvm-5.2.0-8.fc34.x86_64' in L1 works. Are you using the same
kernel in L1?

Comment 13 Kevin Fenzi 2021-11-18 20:24:45 UTC
(In reply to Vitaly Kuznetsov from comment #12)
> Oh, I see, it's not from RHEL-AV (advanced virtualization), it's from plain
> RHEL.

Yeah. In the past we used qemu from some other channel, but I thought it didn't matter in 8 anymore. I guess it does. ;( 
 
> Sad story is: QEMU gained support for 'pdcm' feature bit long time ago
> (e117f7725af84)
> but it wasn't until QEMU-5.1 when support for MSR_IA32_PERF_CAPABILITIES was
> added (ea39f9b643)
> so when qemu-4.2 is used to create L1, MSR_IA32_PERF_CAPABILITIES MSR is
> actually unsupported
> but the feature bit for it is set. Newer QEMU in L1 tries to add
> MSR_IA32_PERF_CAPABILITIES for
> L2 but KVM in L1 wants to access non-existent MSR_IA32_PERF_CAPABILITIES.
> 
> Now the question is what can we do about it. Immediate solutions:
> 
> 1) Drop 'pdcm=on' from QEMU command line in L1
> 2) Add 'pdcm=off' to QEMU command line in L2
> 3) Use newer QEMU (probably from 'advanced virt' module) in L0.
> 
> No.3 actually makes a lot of sense as nested virt can be broken in multiple
> other places with QEMU-4.2,
> afaiu nobody probably tests it.

Yeah, ok, I can look at moving to that. 
 
> What I'm still puzzled about is why 'qemu-kvm-5.2.0-8.fc34.x86_64' in L1
> works. Are you using the same
> kernel in L1?

L1 is f35: 5.14.16-301.fc35.x86_64

Comment 14 Kevin Fenzi 2021-12-02 00:05:13 UTC
I've switched the L0/rhel8.5/hypervisor box to use the virt packages from the advanced virt 8.3 stream and can confirm that the L1 guest passes libguestfs-test fine now. 
Will see how it does tonight with a rawhide compose.

Comment 15 Ben Cotton 2022-11-29 17:17:22 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 16 Ben Cotton 2022-12-13 15:49:55 UTC
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.