Bug 2209965

Summary: libguestfs-test-tool hangs with Intel Xeon Processor (Icelake) cpu model: KVM: entry failed, hardware error 0x8
Product: Red Hat Enterprise Linux 9 Reporter: YongkuiGuo <yoguo>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
qemu-kvm sub component: CPU Models QA Contact: yduan
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: unspecified CC: ailan, bdas, coli, eesposit, huzhao, imammedo, jinzhao, juzhang, lersek, nilal, peterx, qcheng, qzhang, rjones, tweining, virt-maint, vkuznets, yduan, ymao, zhguo
Version: 9.3   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-03 09:01:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
qemu command none

Description YongkuiGuo 2023-05-25 10:55:15 UTC
Description of problem:
libguestfs-test-tool hangs with cpu model Intel Xeon Processor (Icelake) on OpenStack env.

$ libguestfs-test-tool
     ************************************************************
     *                    IMPORTANT NOTICE
     *
     * When reporting bugs, include the COMPLETE, UNEDITED
     * output below in your bug report.
     *
     ************************************************************
PATH=/home/cloud-user/.local/bin:/home/cloud-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
XDG_RUNTIME_DIR=/run/user/1000
SELinux: Enforcing
guestfs_get_append: (null)
guestfs_get_autosync: 1
guestfs_get_backend: libvirt
guestfs_get_backend_settings: []
guestfs_get_cachedir: /var/tmp
guestfs_get_hv: /usr/libexec/qemu-kvm
guestfs_get_memsize: 1280
guestfs_get_network: 0
guestfs_get_path: /usr/lib64/guestfs
guestfs_get_pgroup: 0
guestfs_get_program: libguestfs-test-tool
guestfs_get_recovery_proc: 1
guestfs_get_smp: 1
guestfs_get_sockdir: /run/user/1000
guestfs_get_tmpdir: /tmp
guestfs_get_trace: 0
guestfs_get_verbose: 1
host_cpu: x86_64
Launching appliance, timeout set to 600 seconds.
libguestfs: launch: program=libguestfs-test-tool
libguestfs: launch: version=1.50.1rhel=9,release=4.el9,libvirt
libguestfs: launch: backend registered: direct
libguestfs: launch: backend registered: libvirt
libguestfs: launch: backend=libvirt
libguestfs: launch: tmpdir=/tmp/libguestfsBkGgeK
libguestfs: launch: umask=0022
libguestfs: launch: euid=1000
libguestfs: libvirt version = 9003000 (9.3.0)
libguestfs: guest random name = guestfs-g04zn61ghazs33su
libguestfs: connect to libvirt
libguestfs: opening libvirt handle: URI = qemu:///session, auth = default+wrapper, flags = 0
libguestfs: successfully opened libvirt handle: conn = 0x55920ba11030
libguestfs: qemu version (reported by libvirt) = 8000000 (8.0.0)
libguestfs: get libvirt capabilities
libguestfs: parsing capabilities XML
libguestfs: parsing domcapabilities XML
libguestfs: build appliance
libguestfs: begin building supermin appliance
libguestfs: run supermin
libguestfs: command: run: /usr/bin/supermin
libguestfs: command: run: \ --build
libguestfs: command: run: \ --verbose
libguestfs: command: run: \ --if-newer
libguestfs: command: run: \ --lock /var/tmp/.guestfs-1000/lock
libguestfs: command: run: \ --copy-kernel
libguestfs: command: run: \ -f ext2
libguestfs: command: run: \ --host-cpu x86_64
libguestfs: command: run: \ /usr/lib64/guestfs/supermin.d
libguestfs: command: run: \ -o /var/tmp/.guestfs-1000/appliance.d
supermin: version: 5.3.3
supermin: rpm: detected RPM version 4.16
supermin: rpm: detected RPM architecture x86_64
supermin: package handler: fedora/rpm
supermin: acquiring lock on /var/tmp/.guestfs-1000/lock
supermin: build: /usr/lib64/guestfs/supermin.d
supermin: reading the supermin appliance
supermin: build: visiting /usr/lib64/guestfs/supermin.d/base.tar.gz type gzip base image (tar)
supermin: build: visiting /usr/lib64/guestfs/supermin.d/daemon.tar.gz type gzip base image (tar)
supermin: build: visiting /usr/lib64/guestfs/supermin.d/excludefiles type uncompressed excludefiles
supermin: build: visiting /usr/lib64/guestfs/supermin.d/hostfiles type uncompressed hostfiles
supermin: build: visiting /usr/lib64/guestfs/supermin.d/init.tar.gz type gzip base image (tar)
supermin: build: visiting /usr/lib64/guestfs/supermin.d/packages type uncompressed packages
supermin: build: visiting /usr/lib64/guestfs/supermin.d/udev-rules.tar.gz type gzip base image (tar)
supermin: mapping package names to installed packages
supermin: resolving full list of package dependencies
supermin: build: 189 packages, including dependencies
supermin: build: 32323 files
supermin: build: 8465 files, after matching excludefiles
supermin: build: 8476 files, after adding hostfiles
supermin: build: 8462 files, after removing unreadable files
supermin: build: 8487 files, after munging
supermin: kernel: looking for kernel using environment variables ...
supermin: kernel: looking for kernels in /lib/modules/*/vmlinuz ...
supermin: kernel: picked vmlinuz /lib/modules/5.14.0-316.el9.x86_64/vmlinuz
supermin: kernel: kernel_version 5.14.0-316.el9.x86_64
supermin: kernel: modpath /lib/modules/5.14.0-316.el9.x86_64
supermin: ext2: creating empty ext2 filesystem '/var/tmp/.guestfs-1000/appliance.d.t2lwt5zk/root'
supermin: ext2: populating from base image
supermin: ext2: copying files from host filesystem
supermin: warning: /usr/libexec/utempter/utempter: Permission denied (ignored)
Some distro files are not public readable, so supermin cannot copy them
into the appliance.  This is a problem with your Linux distro.  Please ask
your distro to stop doing pointless security by obscurity.
You can ignore these warnings.  You *do not* need to use sudo.
supermin: warning: /usr/sbin/unix_update: Permission denied (ignored)
supermin: warning: /var/lib/systemd/random-seed: Permission denied (ignored)
supermin: ext2: copying kernel modules
supermin: warning: /lib/modules/5.14.0-316.el9.x86_64/System.map: Permission denied (ignored)
supermin: ext2: creating minimal initrd '/var/tmp/.guestfs-1000/appliance.d.t2lwt5zk/initrd'
supermin: ext2: wrote 38 modules to minimal initrd
supermin: renaming /var/tmp/.guestfs-1000/appliance.d.t2lwt5zk to /var/tmp/.guestfs-1000/appliance.d
libguestfs: finished building supermin appliance
libguestfs: command: run: qemu-img --help | grep -sqE -- '\binfo\b.*-U\b'
libguestfs: command: run: qemu-img
libguestfs: command: run: \ info
libguestfs: command: run: \ -U
libguestfs: command: run: \ --output json
libguestfs: command: run: \ /var/tmp/.guestfs-1000/appliance.d/root
libguestfs: parse_json: qemu-img info JSON output:\n{\n    "children": [\n        {\n            "name": "file",\n            "info": {\n                "children": [\n                ],\n                "virtual-size": 4294967296,\n                "filename": "/var/tmp/.guestfs-1000/appliance.d/root",\n                "format": "file",\n                "actual-size": 298606592,\n                "format-specific": {\n                    "type": "file",\n                    "data": {\n                    }\n                },\n                "dirty-flag": false\n            }\n        }\n    ],\n    "virtual-size": 4294967296,\n    "filename": "/var/tmp/.guestfs-1000/appliance.d/root",\n    "format": "raw",\n    "actual-size": 298606592,\n    "dirty-flag": false\n}\n\n
libguestfs: command: run: qemu-img
libguestfs: command: run: \ create
libguestfs: command: run: \ -f qcow2
libguestfs: command: run: \ -o backing_file=/var/tmp/.guestfs-1000/appliance.d/root,backing_fmt=raw
libguestfs: command: run: \ /tmp/libguestfsBkGgeK/overlay2.qcow2
Formatting '/tmp/libguestfsBkGgeK/overlay2.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=4294967296 backing_file=/var/tmp/.guestfs-1000/appliance.d/root backing_fmt=raw lazy_refcounts=off refcount_bits=16
libguestfs: create libvirt XML
libguestfs: libvirt XML:\n<?xml version="1.0"?>\n<domain type="kvm" xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0">\n  <name>guestfs-g04zn61ghazs33su</name>\n  <memory unit="MiB">1280</memory>\n  <currentMemory unit="MiB">1280</currentMemory>\n  <cpu mode="maximum">\n    <feature policy="disable" name="la57"/>\n  </cpu>\n  <vcpu>1</vcpu>\n  <clock offset="utc">\n    <timer name="rtc" tickpolicy="catchup"/>\n    <timer name="pit" tickpolicy="delay"/>\n    <timer name="hpet" present="no"/>\n  </clock>\n  <os>\n    <type machine="q35">hvm</type>\n    <kernel>/var/tmp/.guestfs-1000/appliance.d/kernel</kernel>\n    <initrd>/var/tmp/.guestfs-1000/appliance.d/initrd</initrd>\n    <cmdline>panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=UUID=23e8de0a-4a72-4ffc-b711-b66d736f097d selinux=0 guestfs_verbose=1 TERM=xterm-256color</cmdline>\n    <bios useserial="yes"/>\n  </os>\n  <on_reboot>destroy</on_reboot>\n  <devices>\n    <rng model="virtio">\n      <backend model="random">/dev/urandom</backend>\n    </rng>\n    <controller type="scsi" index="0" model="virtio-scsi"/>\n    <disk device="disk" type="file">\n      <source file="/tmp/libguestfsBkGgeK/scratch1.img"/>\n      <target dev="sda" bus="scsi"/>\n      <driver name="qemu" type="raw" cache="unsafe"/>\n      <address type="drive" controller="0" bus="0" target="0" unit="0"/>\n    </disk>\n    <disk type="file" device="disk">\n      <source file="/tmp/libguestfsBkGgeK/overlay2.qcow2"/>\n      <target dev="sdb" bus="scsi"/>\n      <driver name="qemu" type="qcow2" cache="unsafe"/>\n      <address type="drive" controller="0" bus="0" target="1" unit="0"/>\n    </disk>\n    <serial type="unix">\n      <source mode="connect" path="/run/user/1000/libguestfsc6FGM1/console.sock"/>\n      <target port="0"/>\n    </serial>\n    <channel type="unix">\n      <source mode="connect" path="/run/user/1000/libguestfsc6FGM1/guestfsd.sock"/>\n      <target type="virtio" name="org.libguestfs.channel.0"/>\n    </channel>\n    <controller type="usb" model="none"/>\n    <memballoon model="none"/>\n  </devices>\n  <qemu:commandline>\n    <qemu:env name="TMPDIR" value="/var/tmp"/>\n  </qemu:commandline>\n</domain>\n
libguestfs: command: run: ls
libguestfs: command: run: \ -a
libguestfs: command: run: \ -l
libguestfs: command: run: \ -R
libguestfs: command: run: \ -Z /var/tmp/.guestfs-1000
libguestfs: /var/tmp/.guestfs-1000:
libguestfs: total 4
libguestfs: drwxr-xr-x. 3 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0   37 May 25 04:12 .
libguestfs: drwxrwxrwt. 8 root       root       system_u:object_r:tmp_t:s0          4096 May 25 04:12 ..
libguestfs: drwxr-xr-x. 2 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0   46 May 25 04:12 appliance.d
libguestfs: -rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0    0 May 25 04:12 lock
libguestfs:
libguestfs: /var/tmp/.guestfs-1000/appliance.d:
libguestfs: total 311140
libguestfs: drwxr-xr-x. 2 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0         46 May 25 04:12 .
libguestfs: drwxr-xr-x. 3 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0         37 May 25 04:12 ..
libguestfs: -rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0    7569408 May 25 04:12 initrd
libguestfs: -rwxr-xr-x. 1 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0   12429176 May 25 04:12 kernel
libguestfs: -rw-r--r--. 1 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0 4294967296 May 25 04:12 root
libguestfs: command: run: ls
libguestfs: command: run: \ -a
libguestfs: command: run: \ -l
libguestfs: command: run: \ -Z /run/user/1000/libguestfsc6FGM1
libguestfs: total 0
libguestfs: drwx------. 2 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0  80 May 25 04:12 .
libguestfs: drwx------. 5 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0 120 May 25 04:12 ..
libguestfs: srwxr-xr-x. 1 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0   0 May 25 04:12 console.sock
libguestfs: srwxr-xr-x. 1 cloud-user cloud-user unconfined_u:object_r:user_tmp_t:s0   0 May 25 04:12 guestfsd.sock
libguestfs: launch libvirt guest      <------- hang


$ cat ~/.cache/libvirt/qemu/log/guestfs-g04zn61ghazs33su.log
2023-05-25 08:12:45.952+0000: starting up libvirt version: 9.3.0, package: 2.el9 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2023-05-16-10:12:41, ), qemu version: 8.0.0qemu-kvm-8.0.0-3.el9, kernel: 5.14.0-316.el9.x86_64, hostname: yoguo-test
LC_ALL=C \
PATH=/home/cloud-user/.local/bin:/home/cloud-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin \
HOME=/home/cloud-user \
USER=cloud-user \
LOGNAME=cloud-user \
XDG_CACHE_HOME=/home/cloud-user/.config/libvirt/qemu/lib/domain-1-guestfs-g04zn61ghazs/.cache \
TMPDIR=/var/tmp \
/usr/libexec/qemu-kvm \
-name guest=guestfs-g04zn61ghazs33su,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/home/cloud-user/.config/libvirt/qemu/lib/domain-1-guestfs-g04zn61ghazs/master-key.aes"}' \
-machine pc-q35-rhel9.2.0,usb=off,dump-guest-core=off,memory-backend=pc.ram,graphics=off,hpet=off,acpi=off \
-accel kvm \
-cpu max,la57=off \
-m 1280 \
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":1342177280}' \
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 5a9a35ba-e32f-450c-adfb-c32f44644822 \
-display none \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=22,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-shutdown \
-boot strict=on \
-kernel /var/tmp/.guestfs-1000/appliance.d/kernel \
-initrd /var/tmp/.guestfs-1000/appliance.d/initrd \
-append 'panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=UUID=23e8de0a-4a72-4ffc-b711-b66d736f097d selinux=0 guestfs_verbose=1 TERM=xterm-256color' \
-device '{"driver":"pcie-root-port","port":8,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x1"}' \
-device '{"driver":"pcie-root-port","port":9,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x1.0x1"}' \
-device '{"driver":"pcie-root-port","port":10,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x1.0x2"}' \
-device '{"driver":"pcie-root-port","port":11,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x1.0x3"}' \
-device '{"driver":"virtio-scsi-pci","id":"scsi0","bus":"pci.1","addr":"0x0"}' \
-device '{"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.2","addr":"0x0"}' \
-blockdev '{"driver":"file","filename":"/tmp/libguestfsBkGgeK/scratch1.img","node-name":"libvirt-2-storage","cache":{"direct":false,"no-flush":true},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":false,"no-flush":true},"driver":"raw","file":"libvirt-2-storage"}' \
-device '{"driver":"scsi-hd","bus":"scsi0.0","channel":0,"scsi-id":0,"lun":0,"device_id":"drive-scsi0-0-0-0","drive":"libvirt-2-format","id":"scsi0-0-0-0","bootindex":1,"write-cache":"on"}' \
-blockdev '{"driver":"file","filename":"/var/tmp/.guestfs-1000/appliance.d/root","node-name":"libvirt-3-storage","cache":{"direct":false,"no-flush":true},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-3-format","read-only":true,"cache":{"direct":false,"no-flush":true},"driver":"raw","file":"libvirt-3-storage"}' \
-blockdev '{"driver":"file","filename":"/tmp/libguestfsBkGgeK/overlay2.qcow2","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":true},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":false,"no-flush":true},"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-3-format"}' \
-device '{"driver":"scsi-hd","bus":"scsi0.0","channel":0,"scsi-id":1,"lun":0,"device_id":"drive-scsi0-0-1-0","drive":"libvirt-1-format","id":"scsi0-0-1-0","write-cache":"on"}' \
-chardev socket,id=charserial0,path=/run/user/1000/libguestfsc6FGM1/console.sock \
-device '{"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}' \
-chardev socket,id=charchannel0,path=/run/user/1000/libguestfsc6FGM1/guestfsd.sock \
-device '{"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.libguestfs.channel.0"}' \
-audiodev '{"id":"audio1","driver":"none"}' \
-global ICH9-LPC.noreboot=off \
-watchdog-action reset \
-object '{"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"}' \
-device '{"driver":"virtio-rng-pci","rng":"objrng0","id":"rng0","bus":"pci.3","addr":"0x0"}' \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2023-05-25 08:12:45.953+0000: Domain id=1 is tainted: custom-argv
KVM: entry failed, hardware error 0x8
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=04 66 41 eb f1 66 83 c9 ff 66 89 c8 66 5b 66 5e 66 5f 66 c3 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
2023-05-25T08:13:15.885647Z qemu-kvm: terminating on signal 15 from pid 3101 (<unknown process>)
2023-05-25 08:13:16.086+0000: shutting down, reason=destroyed


$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel Xeon Processor (Icelake)
    CPU family:          6
    Model:               134
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           2
    Stepping:            0
    BogoMIPS:            4589.21
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdt
                         scp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
                          movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single
                         ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erm
                         s invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xge
                         tbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpop
                         cntdq la57 md_clear arch_capabilities
Virtualization features:
  Virtualization:        VT-x
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):    
  L1d:                   64 KiB (2 instances)
  L1i:                   64 KiB (2 instances)
  L2:                    8 MiB (2 instances)
  L3:                    32 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0,1
Vulnerabilities:        
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected


Version-Release number of selected component (if applicable):
libguestfs-1.50.1-4.el9.x86_64
kernel-5.14.0-316.el9.x86_64
qemu-kvm-common-8.0.0-3.el9.x86_64
qemu-kvm-core-8.0.0-3.el9.x86_64
libvirt-libs-9.3.0-2.el9.x86_64


How reproducible:
100%


Steps:

1. Create a VM on OpenStack env with the latest rhel9.3 nightly compose
2. Run libguestfs-test-tool

Actual results:
As above

Expected results:
libguestfs-test-tool works.

Additional info:
1. No such issue with AMD EPYC-Rome Processor cpu model on OpenStack env

Comment 1 Laszlo Ersek 2023-05-25 11:04:21 UTC
(In reply to YongkuiGuo from comment #0)

> Steps:
> 
> 1. Create a VM on OpenStack env with the latest rhel9.3 nightly compose
> 2. Run libguestfs-test-tool
> 
> [...]
> 
> Expected results:
> libguestfs-test-tool works.

No, libguestfs has never been expected to work in L2, in a nested virtualization setup (when L1 is a KVM guest and L2, i.e. the libguestfs appliance, is also a KVM guest). Such attempts have always been defeated by non-deterministic nested virtualization bugs.

If you insist you can reassign to kernel/KVM. It's a problem that's potentially triggered by specifying "-cpu max" on the qemu cmdline that is invoked in L1, while the L0 CPU is Icelake.

Comment 2 Richard W.M. Jones 2023-05-25 11:15:52 UTC
Interestingly this is 16/32 bit code:

0:  04 66                   add    al,0x66
2:  41                      inc    ecx
3:  eb f1                   jmp    0xfffffff6
5:  66 83 c9 ff             or     cx,0xffff
9:  66 89 c8                mov    ax,cx
c:  66 5b                   pop    bx
e:  66 5e                   pop    si
10: 66 5f                   pop    di
12: 66 c3                   retw
14: ea 5b e0 00 f0 30 36    jmp    0x3630:0xf000e05b
1b: 2f                      das
1c: 32 33                   xor    dh,BYTE PTR [ebx]
1e: 2f                      das
1f: 39 39                   cmp    DWORD PTR [ecx],edi
21: 00 fc                   add    ah,bh

> No, libguestfs has never been expected to work in L2

My reading is that this would be an L1 guest, with the libguestfs appliance running as L2,
which is something we do try to support (albeit, as you say, subject to various and multiple
nested virtualization bugs).

Comment 5 Laszlo Ersek 2023-05-25 11:38:37 UTC
(In reply to Richard W.M. Jones from comment #2)
> Interestingly this is 16/32 bit code:
> 
> 14: ea 5b e0 00 f0 30 36    jmp    0x3630:0xf000e05b

Hm. This raises vague memories.

According to the Intel SDM, this seems like "JMP ptr16:32", "Jump far, absolute, address given in operand"; valid in compat/legacy mode, invalid in 64-bit mode.

This reminds me of early Linux boot code that performs the CPU mode switches, for entering long mode.

Something something about "unrestricted guest" support. This kind of jump had been very problematic in OVMF on KVM years ago (OVMF switches CPU modes), kept triggering KVM emulation failures. I don't remember the details.

Check "/sys/module/kvm_intel/parameters/unrestricted_guest" perhaps? If the processor lacks unrestricted guest support, I expect it will just not work; otherwise it should work fine. Seems like a pretty modern PCPU, so I'm unsure.

Comment 6 Richard W.M. Jones 2023-05-25 11:41:19 UTC
> Check "/sys/module/kvm_intel/parameters/unrestricted_guest" perhaps?

We don't have access to the L0 host (and unlikely to be able to get access).  However
in L1 where we run libguestfs that file is:

$ cat /sys/module/kvm_intel/parameters/unrestricted_guest
Y

Comment 7 Richard W.M. Jones 2023-05-25 11:45:24 UTC
Created attachment 1966861 [details]
qemu command

Further information:

TCG works (not surprising).  Another indication that it's a problem with nested virt.

The qemu command is attached.

Comment 8 Laszlo Ersek 2023-05-25 11:46:23 UTC
Confusing:

(In reply to YongkuiGuo from comment #0)

> EIP=0000fff0 [...]
> [...]
> CS =f000 ffff0000 0000ffff 00009b00

this points to the reset vector where the BIOS is supposed to start... I wouldn't expect a far jump there (to a different code segment), before setting up segment descriptors. I'm unsure how reliable this register dump from QEMU is.

Comment 9 Richard W.M. Jones 2023-05-25 11:46:59 UTC
Also same error when running qemu directly:

/usr/libexec/qemu-kvm \
    -global virtio-blk-pci.scsi=off \
    -no-user-config \
    -nodefaults \
    -display none \
    -machine q35,accel=kvm:tcg,graphics=off \
    -cpu max,la57=off \
    -m 1280 \
    -no-reboot \
    -rtc driftfix=slew \
    -no-hpet \
    -global kvm-pit.lost_tick_policy=discard \
    -kernel /var/tmp/.guestfs-1000/appliance.d/kernel \
    -initrd /var/tmp/.guestfs-1000/appliance.d/initrd \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-pci,rng=rng0 \
    -device virtio-scsi-pci,id=scsi \
    -drive file=/tmp/libguestfsfGeYfb/scratch1.img,cache=unsafe,format=raw,id=hd0,if=none \
    -device scsi-hd,drive=hd0 \
    -drive file=/var/tmp/.guestfs-1000/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none \
    -device scsi-hd,drive=appliance \
    -device virtio-serial-pci \
    -serial stdio \
    -chardev socket,path=/run/user/1000/libguestfsHuYYGU/guestfsd.sock,id=channel0 \
    -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \
    -append "panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=UUID=23e8de0a-4a72-4ffc-b711-b66d736f097d selinux=0 guestfs_verbose=1 TERM=xterm-256color"
qemu-kvm: -no-hpet: warning: -no-hpet is deprecated, use '-machine hpet=off' instead
KVM: entry failed, hardware error 0x8
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=04 66 41 eb f1 66 83 c9 ff 66 89 c8 66 5b 66 5e 66 5f 66 c3 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

Comment 10 Laszlo Ersek 2023-05-25 12:11:09 UTC
(In reply to Richard W.M. Jones from comment #2)

> 14: ea 5b e0 00 f0 30 36    jmp    0x3630:0xf000e05b
> 1b: 2f                      das
> 1c: 32 33                   xor    dh,BYTE PTR [ebx]

Wait, this disassembly is incorrect. The debugger / disassembler assumed an incorrect CPU mode, or it's a plain disassembler bug. The JMP instruction is actually

EA5BE000F0        jmp 0xf000:0xe05b

(i.e., ptr16:16, not ptr16:32), meaning a jump to 0xfe05b in real mode. And such a jump instruction is indeed valid as the first (and only) instruction that the reset vector points at.

The constant 0xe05b is present in SeaBIOS's "src/romlayout.S":

        ORG 0xe05b
entry_post:
        cmpl $0, %cs:HaveRunPost                // Check for resume/reboot
        jnz entry_resume
        ENTRY_INTO32 _cfunc32flat_handle_post   // Normal entry point

so that's the code we're trying to jump *to*, but the L2 domain crashes in that very jump first instruction, at the reset vector. IOW, the CS:EIP info in the QEMU register dump is actually correct.

Comment 11 Laszlo Ersek 2023-05-25 12:20:31 UTC
hardware error 0x8 seems to be EXIT_REASON_NMI_WINDOW.

Comment 12 Richard W.M. Jones 2023-05-25 12:31:30 UTC
Setting kvm_intel.dump_invalid_vmcs=1 (in L1):

[18499.856522] VMCS 00000000b780e336, last attempted VM-entry on CPU 0
[18499.856526] *** Guest State ***
[18499.856527] CR0: actual=0x0000000000000030, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
[18499.856530] CR4: actual=0x0000000000002040, shadow=0x0000000000000000, gh_mask=fffffffffffef871
[18499.856531] CR3 = 0x0000000000000000
[18499.856534] PDPTR0 = 0x0000000000000000  PDPTR1 = 0x0000000000000000
[18499.856538] PDPTR2 = 0x0000000000000000  PDPTR3 = 0x0000000000000000
[18499.856539] RSP = 0x0000000000000000  RIP = 0x000000000000fff0
[18499.856540] RFLAGS=0x00000002         DR7 = 0x0000000000000400
[18499.856545] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
[18499.856550] CS:   sel=0xf000, attr=0x0009b, limit=0x0000ffff, base=0x00000000ffff0000
[18499.856557] DS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18499.856562] SS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18499.856568] ES:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18499.856574] FS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18499.856579] GS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18499.856582] GDTR:                           limit=0x0000ffff, base=0x0000000000000000
[18499.856594] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff, base=0x0000000000000000
[18499.856598] IDTR:                           limit=0x0000ffff, base=0x0000000000000000
[18499.856603] TR:   sel=0x0000, attr=0x0008b, limit=0x0000ffff, base=0x0000000000000000
[18499.856605] EFER= 0x0000000000000000
[18499.856607] PAT = 0x0007040600070406
[18499.856611] DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
[18499.856613] Interruptibility = 00000000  ActivityState = 00000000
[18499.856614] InterruptStatus = 0000
[18499.856617] *** Host State ***
[18499.856620] RIP = 0xffffffffc0a0e863  RSP = 0xff2b812dc0947cb0
[18499.856627] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
[18499.856629] FSBase=00007fe67bdff640 GSBase=ff1ce2653ba00000 TRBase=fffffe0000003000
[18499.856633] GDTBase=fffffe0000001000 IDTBase=fffffe0000000000
[18499.856637] CR0=0000000080050033 CR3=0000000103f2c003 CR4=0000000000773ef0
[18499.856642] Sysenter RSP=fffffe0000003000 CS:RIP=0010:ffffffffaa0015f0
[18499.856644] EFER= 0x0000000000000d01
[18499.856646] PAT = 0x0407050600070106
[18499.856647] *** Control State ***
[18499.856648] CPUBased=0xb5a06dfa SecondaryExec=0x000033eb TertiaryExec=0x0000000000000000
[18499.856649] PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
[18499.856652] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
[18499.856653] VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
[18499.856654] VMExit: intr_info=00000000 errcode=00000000 ilen=00000000
[18499.856654]         reason=00000000 qualification=0000000000000000
[18499.856655] IDTVectoring: info=00000000 errcode=00000000
[18499.856657] TSC Offset = 0xffffd96342ad3040
[18499.856657] SVI|RVI = 00|00 TPR Threshold = 0x00
[18499.856660] APIC-access addr = 0x000000010776d000 virt-APIC addr = 0x0000000108c43000
[18499.856664] PostedIntrVec = 0xf2
[18499.856665] EPT pointer = 0x000000000122305e
[18499.856667] Virtual processor ID = 0x0001
[18503.601139] VMCS 00000000cd5f9037, last attempted VM-entry on CPU 0
[18503.601144] *** Guest State ***
[18503.601145] CR0: actual=0x0000000000000030, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
[18503.601148] CR4: actual=0x0000000000002040, shadow=0x0000000000000000, gh_mask=fffffffffffef871
[18503.601149] CR3 = 0x0000000000000000
[18503.601152] PDPTR0 = 0x0000000000000000  PDPTR1 = 0x0000000000000000
[18503.601155] PDPTR2 = 0x0000000000000000  PDPTR3 = 0x0000000000000000
[18503.601155] RSP = 0x0000000000000000  RIP = 0x000000000000fff0
[18503.601157] RFLAGS=0x00000002         DR7 = 0x0000000000000400
[18503.601162] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
[18503.601166] CS:   sel=0xf000, attr=0x0009b, limit=0x0000ffff, base=0x00000000ffff0000
[18503.601173] DS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18503.601178] SS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18503.601184] ES:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18503.601189] FS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18503.601195] GS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
[18503.601198] GDTR:                           limit=0x0000ffff, base=0x0000000000000000
[18503.601204] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff, base=0x0000000000000000
[18503.601207] IDTR:                           limit=0x0000ffff, base=0x0000000000000000
[18503.601213] TR:   sel=0x0000, attr=0x0008b, limit=0x0000ffff, base=0x0000000000000000
[18503.601215] EFER= 0x0000000000000000
[18503.601216] PAT = 0x0007040600070406
[18503.601219] DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
[18503.601221] Interruptibility = 00000000  ActivityState = 00000000
[18503.601222] InterruptStatus = 0000
[18503.601225] *** Host State ***
[18503.601228] RIP = 0xffffffffc0a0e863  RSP = 0xff2b812dc0c17ca0
[18503.601235] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
[18503.601237] FSBase=00007f93a6930640 GSBase=ff1ce2653ba00000 TRBase=fffffe0000003000
[18503.601240] GDTBase=fffffe0000001000 IDTBase=fffffe0000000000
[18503.601245] CR0=0000000080050033 CR3=0000000001316006 CR4=0000000000773ef0
[18503.601249] Sysenter RSP=fffffe0000003000 CS:RIP=0010:ffffffffaa0015f0
[18503.601251] EFER= 0x0000000000000d01
[18503.601253] PAT = 0x0407050600070106
[18503.601255] *** Control State ***
[18503.601255] CPUBased=0xb5a06dfa SecondaryExec=0x000033eb TertiaryExec=0x0000000000000000
[18503.601256] PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
[18503.601260] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
[18503.601260] VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
[18503.601261] VMExit: intr_info=00000000 errcode=00000000 ilen=00000000
[18503.601262]         reason=00000000 qualification=0000000000000000
[18503.601262] IDTVectoring: info=00000000 errcode=00000000
[18503.601264] TSC Offset = 0xffffd9612628d62e
[18503.601265] SVI|RVI = 00|00 TPR Threshold = 0x00
[18503.601267] APIC-access addr = 0x000000000771c000 virt-APIC addr = 0x000000000114f000
[18503.601271] PostedIntrVec = 0xf2
[18503.601273] EPT pointer = 0x000000000117d05e
[18503.601275] Virtual processor ID = 0x0002

Comment 13 Laszlo Ersek 2023-05-25 12:33:34 UTC
The Intel SDM writes in appendix "VMX BASIC EXIT REASONS" (emphasis
mine):

> Every VM exit writes a 32-bit exit reason to the VMCS (see Section
> 21.9.1). *Certain VM-entry failures also do this* (see Section 23.7)
> [...]
>
> 8 -- NMI window. At the beginning of an instruction, there was no
>      virtual-NMI blocking; events were not blocked by MOV SS; and the
>      "NMI-window exiting" VM-execution control was 1.

Can we perhaps disable the "vnmi" kvm_intel module parameter in L1? Or
else (I assume: equivalently) remove the vmx-vnmi CPU model feature on
the QEMU command line?

... I've tried the former, it doesn't help.

Comment 14 Bandan Das 2023-05-25 14:34:18 UTC
Emanuele, any guesses on this one ? A long shot but could this be related to bug 2127128 even
though the error there was related to invalid control fields ?

Comment 15 Laszlo Ersek 2023-05-26 09:01:19 UTC
(In reply to Bandan Das from comment #14)
> Emanuele, any guesses on this one ? A long shot but could this be related to
> bug 2127128 even
> though the error there was related to invalid control fields ?

That bug leads to some other bugs that appear more fitting: bug 2103118, bug 2105408, bug 2099216.

What I find relatively annoying is that bug 2103118 -- which was originally encountered via virt-customize, and which produced identical symptoms, i.e., crashing at the SeaBIOS reset vector: <ea> 5b e0 00 f0 -- had never been *root-caused*. We only said "broken in 8.2, functional in 8.6, let's go shopping". We never located the actual fix! In such cases a reverse bisection is recommended, to see what precisely fixed the problem. If we had done that there, for RHEL-8, now we wouldn't be standing here, with our trousers around our ankles.

Comment 16 Richard W.M. Jones 2023-05-26 09:49:33 UTC
I agree, and added a comment to the other bug to alert people that the bug
has probably not been fixed.

Comment 17 Qinghua Cheng 2023-05-26 14:33:25 UTC
Latest rhel9.3 nightly compose is the L1 guest vm ? How about the host version? 

Dose the workaround listed in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2103118#c50 work?

Comment 18 Richard W.M. Jones 2023-05-27 06:50:05 UTC
In L1:
host kernel = 5.14.0-316.el9.x86_64
qemu = 8.0.0-3.el9
seabios = 1.16.1-1.el9

I don't think we have any information about the L0 host, but maybe Yongkui has access.

The workaround probably only applies to OpenStack, but we're not using OpenStack to
launch the L2 VM.  The L2 VM is started with -cpu max so it'll have all features
potentially enabled.  If there is a particular CPU feature implicated then we could
try modifying the -cpu parameter if you can give us guidance.

Comment 19 YongkuiGuo 2023-05-29 02:40:34 UTC
(In reply to Qinghua Cheng from comment #17)
> Latest rhel9.3 nightly compose is the L1 guest VM ? 

Yes.

> How about the host version? 

I created this VM on our OpenStack env, and I have no permission to access the LO host.

Comment 21 YongkuiGuo 2023-06-01 10:38:04 UTC
OpenStack      flavor                     L1 guest CPU model                       L1 guest           libguestfs-test-tool      
--------------------------------------------------------------------------------------------------------------------------  
rhos-d         ci.standard.medium         Intel Xeon Processor (Icelake)           RHEL9.3 nightly      failed
rhos-d         ci.nested.virt.m1.medium   Intel Xeon Processor (Skylake, IBRS)     RHEL9.3 nightly      passed
rhos-01        ci.standard.medium         AMD EPYC-Rome Processor                  RHEL9.3 nightly      passed
--------------------------------------------------------------------------------------------------------------------------

1. In short, libguestfs-test-tool fails when the ci.xxx (excluding ci.nested.xxx) flavor is used and the L1 guest CPU model is Icelake in rhos-d OpenStack env. 
2. According to these docs[1][2], nested Virtualization is enabled on ci.nested.xxx flavor but disabled on ci.xxx flavor in rhos-d.
 
[1] https://docs.engineering.redhat.com/pages/viewpage.action?spaceKey=KB&title=PSI+OpenStack+Onboarding#PSIOpenStackOnboarding-NestedVirtualizationinRHOS-D
[2] https://docs.engineering.redhat.com/display/KB/PSI+OpenStack+Use+Cases+-+Aggregates

Comment 23 Peter Xu 2023-07-18 21:19:51 UTC
With my pretty limited knowledge on cpu models... here's what I see as a summary so far.

Hardware error seems to be caused by vmenter failure, which means 0x8 (VM_INSTRUCTION_ERROR) -> VM entry with invalid host-state field(s) (according to SDM 30.4).  It got trapped in L0 then delivered to L1 kvm with the same error.

So some host state seems wrong in L2's VMCS when hardware checks, probably something falls into SDM "26.2 CHECKS ON VMX CONTROLS AND HOST-STATE AREA".

I can try to read more in the vmsd dump in comment 12 in the latter days this week (which seems to be quite useful),  Before that I am curious on two things:

1. It seems that we're not be able to access the host (even so far I don't think anything mentioned on the host kernel version), then does it mean that even if we know it's a host kvm bug and we know a fix, it won't be fixed (because it'll need an upgrade of host kvm)?

2. Can we try a similar workaround as mentioned in comment 17 by Qinghua?  Is "-cpu max" required?  According to comment 21 where we do have a PASS use case, I'd give it a shot starting with "-cpu Skylake-Server-IBRS".

Comment 24 Igor Mammedov 2023-07-24 14:11:38 UTC
Also adding Vitaly (who sometimes has dealt with nested issues) to CC.

Comment 25 Vitaly Kuznetsov 2023-07-25 11:29:49 UTC
Getting information about L0 is crucial: most nested bugs end up being L0 KVM bugs, L1 hypervisor is usually innocent...

Comment 26 Nitesh Narayan Lal 2023-08-01 08:45:53 UTC
(In reply to Vitaly Kuznetsov from comment #25)
> Getting information about L0 is crucial: most nested bugs end up being L0
> KVM bugs, L1 hypervisor is usually innocent...

I am assuming this is something that reporter can provide. Hence, setting the needinfo.

Comment 27 YongkuiGuo 2023-08-02 05:28:26 UTC
(In reply to Nitesh Narayan Lal from comment #26)
> (In reply to Vitaly Kuznetsov from comment #25)
> > Getting information about L0 is crucial: most nested bugs end up being L0
> > KVM bugs, L1 hypervisor is usually innocent...
> 
> I am assuming this is something that reporter can provide. Hence, setting
> the needinfo.

I cannot get any info about L0 host(see comment 19). Currently, I use ci.nested.xxx flavor or create VM on rhos-01 as a workaround(see comment 21).

Comment 30 Bandan Das 2023-08-02 19:01:15 UTC
(In reply to Nitesh Narayan Lal from comment #29)
> (In reply to yduan from comment #28)
> > (In reply to Peter Xu from comment #23)
> > > With my pretty limited knowledge on cpu models... here's what I see as a
> > > summary so far.
> > > 
> > > Hardware error seems to be caused by vmenter failure, which means 0x8
> > > (VM_INSTRUCTION_ERROR) -> VM entry with invalid host-state field(s)
> > > (according to SDM 30.4).  It got trapped in L0 then delivered to L1 kvm with
> > > the same error.
> > > 
> > > So some host state seems wrong in L2's VMCS when hardware checks, probably
> > > something falls into SDM "26.2 CHECKS ON VMX CONTROLS AND HOST-STATE AREA".
> > > 
> > > I can try to read more in the vmsd dump in comment 12 in the latter days
> > > this week (which seems to be quite useful),  Before that I am curious on two
> > > things:
> > > 
> > > 1. It seems that we're not be able to access the host (even so far I don't
> > > think anything mentioned on the host kernel version), then does it mean that
> > > even if we know it's a host kvm bug and we know a fix, it won't be fixed
> > > (because it'll need an upgrade of host kvm)?
> > > 
> > 
> > IIUC, QE has no permission to touch the underlying host in PSI(PnT Shared
> > Infrastructure) OpenStack environment.
> > 
> > Additional info:
> > The test environment, Production Cloud-D - RHOS 16.1[1], is running on RHEL
> > 8.2 [2] which is same as Red
> > Hathttps://bugzilla.redhat.com/show_bug.cgi?id=2103118#c30.
> 
> Hi, Since the 8.2 kernel is pretty old and we need L0 information for
> further investigation.
> Can you please reproduce this issue on the latest rhel9/rhel8 host/guest
> combination?
>  
The problem is, as mentioned above, is that this is RHOS 16.1 and it's tied to the 
8.2 kernel and to fix this for RHOS, we have to fix the 8.2 kernel (which brings us 
to your comment above that 8.2 is pretty old and we probably don't want to be pushing 
nested virtualization fixes to it.)

Coming back to his bug, I was able to reproduce it fairly easily with a Icelake system
running 8.2 for L0 and a RHEL 9 for L1.

Unlike bug 2127128 that I linked above, it's not the hardware that's complaining. Rather, L0 KVM 
is crafting a message similar to what the host would do in such cases.

In my test, the error comes from L0 finding a mismatch between L1's CR4 (vmcs12->cr4) and the
acceptable values that it has kept for cr4_fixed1, specifically CR4.LA57[12]. I am pretty sure 
this could happen for any other cpuid bits that the 8.2 kernel doesn't know about.

to_vmx(vcpu)->nested.msrs.cr4_fixed1 is filled up in nested_vmx_cr_fixed1_bits_update() by checking
if the guest cpuid has the bit set and if the host too supports it and all would have been fine if we had this 
one line  in the 8.2 kernel: 
cr4_fixed1_update(X86_CR4_LA57, ecx, bit(X86_FEATURE_LA57)); What this would have done is set LA57 in 
nested.msrs.cr4_fixed1 if the host supported 5 level page tables and it's also set in the guest's cpu
model. (CR4_FIXED1 basically says that that bit in CR4 can be set to 1 is it's set in CR4_FIXED1
but if it's 0, setting that bit in CR4 would cause an exception)

Anyway, without that oneliner, to_vmx(vcpu)->nested.msrs.cr4_fixed1[12] = 0. When L1 tries to read MSR_IA32_VMX_CR0_FIXED1,
vmx_get_vmx_msr() returns the real hardware value of the msr(Another issue in the 8.2 kernel! The 8.2 kernel returns
vmx_get_vmx_msr(&vmcs_config.nested, msr->index, &msr->data) where as newer kernels return 
vmx_get_vmx_msr(&vmx->nested.msrs...)) and assuming the host has 5 level page tables, L1 would want to 
set that bit in CR4. 

So, in the end, we would hit (on L0):
if (CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)))
return -EINVAL
  |
  |
 \ /
if (nested_vmx_check_host_state(vcpu, vmcs12))
                return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_HOST_STATE_FIELD);  


Regarding a workaround, the only one I can think of is booting L1 with la57=off but given that this is
a managed instance, I don't know how much of it would be possible.

Regarding a fix, I think the change is trivial. I am not too keen on it though. This is 
nvmx after all and I can't guarantee that the la57 feature that's affecting my setup applies
to the reporter's environment. We would have to know what the host hardware is. Honestly,
I am inclined to close this as WONTFIX.

Comment 31 Richard W.M. Jones 2023-08-02 19:22:16 UTC
Thanks - I very much agree that we should close this WONTFIX.  I didn't
initially realise that the L0 kernel was so ancient.

By the way would you mind making your analysis in comment 30 public?
That way if someone hits this error message (which to be honest is not
a very good one) then they will find this bug and have an explanation
of what is going on.

Comment 32 Nitesh Narayan Lal 2023-08-03 09:01:30 UTC
Thanks Bandan for the anaylsis, closing this bug.