Bug 1804224
Summary: | libvirtd: error : virCPUx86UpdateLive:3110 : operation failed: guest CPU doesn't match specification: missing features: fxsr_opt | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Riccardo Pittau <rpittau> | ||||||||||||||||||||||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | jiyan <jiyan> | ||||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||||||||||
Version: | 8.1 | CC: | dyuan, jdenemar, lhuang, lmen, mtessun, virt-maint, xuzhang | ||||||||||||||||||||||||||||
Target Milestone: | rc | Keywords: | FutureFeature, Regression, Triaged, ZStream | ||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||
Fixed In Version: | libvirt-4.5.0-41.el8 | Doc Type: | If docs needed, set a value | ||||||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||
Clone Of: | |||||||||||||||||||||||||||||||
: | 1809510 (view as bug list) | Environment: | |||||||||||||||||||||||||||||
Last Closed: | 2020-04-28 15:33:47 UTC | Type: | Feature Request | ||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||||||
Bug Blocks: | 1809510 | ||||||||||||||||||||||||||||||
Attachments: |
|
Description
Riccardo Pittau
2020-02-18 13:43:11 UTC
Created attachment 1663747 [details]
libvirt cpu map
Created attachment 1663748 [details]
libvirtd logs
Created attachment 1663749 [details]
ls cpu output
Created attachment 1663750 [details]
testvm1 xml dump
Created attachment 1663752 [details]
libvirt cpu map
Created attachment 1663754 [details]
libvirtd logs
Created attachment 1663755 [details]
ls cpu output
Created attachment 1663756 [details]
testvm1 xml dump
Created attachment 1663758 [details]
testvm1 logs
Created attachment 1663759 [details]
virsh capabilities
Created attachment 1663760 [details]
virsh domcapabilities
Oh, you're creating a TCG domain, i.e., of type 'qemu'. This is not supported on RHEL. We only support KVM domains (type='kvm'). I guess TCG is used because you're doing all this in a VM, where you would normally use nested KVM, but that first level VM was created with a very poor CPU model. Since the libvirt log does not contain anything usefule, could you please enable debug logs (see https://wiki.libvirt.org/page/DebugLogs guidance) and reproduce the issue again? The testvm1_dump.xml describes the already running domain. Could you also provide the XML used to initially create the domain? And what does "restart the vm" mean here? If you use virsh or libvirt API, could you provide the exact commands you use to reproduce this bug. Created attachment 1664083 [details]
libvirt debug logs
Created attachment 1664084 [details]
libvirt template
Hey Jiri, I attached the libvirt debug logs and the libvirt domain template used to create the domain. About the commands, we define the vm using the template attached and then start it using virtualbmc (so using libvirt api with libvirt-python) with 'vbmc start testvm1' (kind of the same as 'virsh start [domain]'). Thank you again for your help. It doesn't seem libvirt_template.xml is actually used for starting the domain. At least not after you "restart" it (whatever that means). The error suggests there must be a check='full' in the inactive domain XML. BTW, it will always be shown in the XML describing the running domain (i.e., in the "virsh dumpxml testvm1" output while the domain is running). Looking at the debug logs I can see the domain was defined with a host-model CPU and started. The real CPU definition which was used as host-model is not visible in the logs, but we can infer it from the QEMU command line: /usr/libexec/qemu-kvm \ -name guest=testvm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-testvm1/master-key.aes \ -machine pc-i440fx-rhel7.6.0,accel=tcg,usb=off,dump-guest-core=off \ -cpu EPYC,acpi=on,ss=on,hypervisor=on,erms=on,mpx=on,pcommit=on,clwb=on,\ pku=on,ospke=on,la57=on,3dnowext=on,3dnow=on,vme=off,fma=off,avx=off,\ f16c=off,rdrand=off,avx2=off,rdseed=off,sha-ni=off,xsavec=off,\ misalignsse=off,3dnowprefetch=off,osvw=off,topoext=off \ -m 3072 \ ... which would translate to <cpu match='exact'> <model>EPYC</model> <feature name='acpi' policy='require'/> <feature name='ss' policy='require'/> <feature name='hypervisor' policy='require'/> <feature name='erms' policy='require'/> <feature name='mpx' policy='require'/> <feature name='pcommit' policy='require'/> <feature name='clwb' policy='require'/> <feature name='pku' policy='require'/> <feature name='ospke' policy='require'/> <feature name='la57' policy='require'/> <feature name='3dnowext' policy='require'/> <feature name='3dnow' policy='require'/> <feature name='vme' policy='disable'/> <feature name='fma' policy='disable'/> <feature name='avx' policy='disable'/> <feature name='f16c' policy='disable'/> <feature name='rdrand' policy='disable'/> <feature name='avx2' policy='disable'/> <feature name='rdseed' policy='disable'/> <feature name='sha-ni' policy='disable'/> <feature name='xsavec' policy='disable'/> <feature name='misalignsse' policy='disable'/> <feature name='3dnowprefetch' policy='disable'/> <feature name='osvw' policy='disable'/> <feature name='topoext' policy='disable'/> </cpu> The fxsr-opt feature was not explicitly disabled and when QEMU starts, we can detect that fxsr-opt (enabled implicitly by asking for EPYC model) could not be enabled. This would result in <feature policy="disable" name="fxsr_opt"/> being added to the active domain XML (in addition to setting the cpu/@check attribute to 'full'). A bit later it seems the domain is redefined using its active definition, i.e., the host-model CPU is replaced with <cpu check="full" match="exact" mode="custom"> <model fallback="forbid">EPYC</model> <vendor>AMD</vendor> <feature name="acpi" policy="require" /> <feature name="ss" policy="require" /> <feature name="hypervisor" policy="require" /> <feature name="erms" policy="require" /> <feature name="mpx" policy="require" /> <feature name="pcommit" policy="require" /> <feature name="clwb" policy="require" /> <feature name="pku" policy="require" /> <feature name="ospke" policy="require" /> <feature name="la57" policy="require" /> <feature name="3dnowext" policy="require" /> <feature name="3dnow" policy="require" /> <feature name="vme" policy="disable" /> <feature name="fma" policy="disable" /> <feature name="avx" policy="disable" /> <feature name="f16c" policy="disable" /> <feature name="rdrand" policy="disable" /> <feature name="avx2" policy="disable" /> <feature name="rdseed" policy="disable" /> <feature name="sha-ni" policy="disable" /> <feature name="xsavec" policy="disable" /> <feature name="misalignsse" policy="disable" /> <feature name="3dnowprefetch" policy="disable" /> <feature name="osvw" policy="disable" /> <feature name="topoext" policy="disable" /> <feature name="fxsr_opt" policy="disable" /> </cpu> This happens twice in 10 seconds. And a bit later a domain is shut down. Then libvirt is asked to start the domain. So in other words, the answer to my original ``And what does "restart the vm" mean here?'' question is: redefining the domain using its active definition, stopping the domain and starting it again with the following QEMU command line: /usr/libexec/qemu-kvm \ -name guest=testvm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-testvm1/master-key.aes \ -machine pc-i440fx-rhel7.6.0,accel=tcg,usb=off,dump-guest-core=off \ -cpu EPYC,acpi=on,ss=on,hypervisor=on,erms=on,mpx=on,pcommit=on,clwb=on,\ pku=on,ospke=on,la57=on,3dnowext=on,3dnow=on,vme=off,fma=off,avx=off,\ f16c=off,rdrand=off,avx2=off,rdseed=off,sha-ni=off,xsavec=off,\ misalignsse=off,3dnowprefetch=off,osvw=off,topoext=off \ -m 3072 \ ... However, fxsr-opt=off is not passed to the -cpu option even though we know for sure the domain XML contains <feature name="fxsr_opt" policy="disable"/> Thus, when we check what features could not be provided by QEMU, we get fxsr_opt again as it wasn't explicitly disabled. The cpu/@check attribute is set to 'full', which turns this situation into a fatal error. This makes me think that even the CPU definition used to replace the original host-model CPU when the domain was first started explicitly disabled the fxsr_opt feature. But since the cpu/@check attribute was set to 'partial' at that point, the domain started just fine. And indeed, even asking libvirt to create a QEMU command line for the following simple XML reveals the issue. # cat test.xml <domain type='kvm'> <name>test</name> <memory>4096</memory> <os> <type arch='x86_64'>hvm</type> </os> <cpu mode='custom' match='exact' check='none'> <model>EPYC</model> <feature name='fxsr_opt' policy='disable'/> </cpu> </domain> # virsh domxml-to-native qemu-argv --xml test.xml ... /usr/libexec/qemu-kvm \ -name guest=test,debug-threads=on \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain--1-test/master-key.aes \ -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off \ -cpu EPYC \ -m 4 \ ... Upstream and even RHEL-AV-8.1.1 work fine, which means this bug is limited to the 4.5.0 based version of libvirt used in RHEL-8.1. Bisecting pointed to the following patch backported for libvirt-4.5.0-35.1.el8 (RHEL-8.1.0.z) and libvirt-4.5.0-36.el8 (RHEL-8.2.0): commit ac34e141596fab70fbe91a396311f80db6cb57c5 Refs: v5.9.0-145-gac34e14159 Author: Jiri Denemark <jdenemar> AuthorDate: Fri Oct 18 14:33:19 2019 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Nov 12 20:14:16 2019 +0100 qemu: Drop disabled CPU features unknown to QEMU When a CPU definition wants to explicitly disable some features that are unknown to QEMU, we can safely drop them from the definition before starting QEMU. Naturally QEMU won't enable such features implicitly. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Daniel P. Berrangé <berrange> However, this commit works as expected upstream, because the function used for gathering CPU features supported by QEMU (virQEMUCapsGetCPUFeatures) was fixed upstream earlier by: commit 1fd28a2e79692babd63d6b8e9eea90168dd0897e Refs: v5.5.0-333-g1fd28a2e79 Author: Jiri Denemark <jdenemar> AuthorDate: Thu Jul 25 10:27:45 2019 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Fri Jul 26 16:37:30 2019 +0200 qemu: Translate features in virQEMUCapsGetCPUFeatures Starting with QEMU 4.1 qemuMonitorCPUModelInfo structure in virQEMUCaps stores only canonical feature names which may differ from the name used by libvirt. We need translate these canonical names into libvirt names for further consumption. This fixes a bug in qemuConnectBaselineHypervisorCPU which would remove all features for which libvirt's spelling differs from the QEMU's preferred name. For example, the following result of qemuConnectBaselineHypervisorCPU on my host with QEMU 4.1 is wrong: <cpu mode='custom' match='exact'> <model fallback='forbid'>Skylake-Client</model> <vendor>Intel</vendor> <feature policy='require' name='ss'/> <feature policy='require' name='vmx'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='clflushopt'/> <feature policy='require' name='umip'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='xsaves'/> <feature policy='require' name='pdpe1gb'/> <feature policy='require' name='invtsc'/> <feature policy='disable' name='pclmuldq'/> <feature policy='disable' name='lahf_lm'/> </cpu> The 'pclmuldq' and 'lahf_lm' should not be disabled in the baseline CPU as they are supported by QEMU on this host. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Andrea Bolognani <abologna> Unfortunately, this earlier fix was never backported downstream. This is a regression since RHEL-8.1.0 GA (in both 8.2.0 and 8.1.0.z). Reproduced this bug with libvirt-4.5.0-35.2.module+el8.1.0+5256+4b9ab730.x86_64. Verified this bug with libvirt-4.5.0-35.3.module+el8.1.0+5931+8897e7e1.x86_64. Version: libvirt-4.5.0-35.2.module+el8.1.0+5256+4b9ab730.x86_64 qemu-kvm-2.12.0-88.module+el8.1.0+5708+85d8e057.3.x86_64 kernel-4.18.0-147.el8.x86_64 Steps: 1. Preprare a shutdown VM with the following conf # virsh domstate test81 shut off # virsh dumpxml test81 --inactive |grep "<domain t" <domain type='qemu'> # virsh dumpxml test81 --inactive |grep "<cpu" -A29 <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC</model> <vendor>AMD</vendor> <feature policy='require' name='acpi'/> <feature policy='require' name='ss'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='erms'/> <feature policy='require' name='mpx'/> <feature policy='require' name='pcommit'/> <feature policy='require' name='clwb'/> <feature policy='require' name='pku'/> <feature policy='require' name='ospke'/> <feature policy='require' name='la57'/> <feature policy='require' name='3dnowext'/> <feature policy='require' name='3dnow'/> <feature policy='disable' name='vme'/> <feature policy='disable' name='fma'/> <feature policy='disable' name='avx'/> <feature policy='disable' name='f16c'/> <feature policy='disable' name='rdrand'/> <feature policy='disable' name='avx2'/> <feature policy='disable' name='rdseed'/> <feature policy='disable' name='sha-ni'/> <feature policy='disable' name='xsavec'/> <feature policy='disable' name='misalignsse'/> <feature policy='disable' name='3dnowprefetch'/> <feature policy='disable' name='osvw'/> <feature policy='disable' name='topoext'/> <feature policy='disable' name='fxsr_opt'/> </cpu> 2. Start VM and check the log # virsh start test81 error: Failed to start domain test81 error: operation failed: guest CPU doesn't match specification: missing features: fxsr_opt # tail -f /var/log/libvirt/qemu/test81.log 2020-03-09T02:23:15.501270Z qemu-kvm: warning: TCG doesn't support requested feature: CPUID.80000001H:EDX.fxsr-opt [bit 25] 2020-03-09T02:23:15.501943Z qemu-kvm: warning: TCG doesn't support requested feature: CPUID.80000001H:EDX.fxsr-opt [bit 25] 3. Update libvirt and restart libvirtd # yum update libvirt* -y # systemctl restart libvirtd # rpm -qa libvirt libvirt-4.5.0-35.3.module+el8.1.0+5931+8897e7e1.x86_64 4. Start the VM again and check the qemu cmd line/active dumpxml # virsh start test81 Domain test81 started # ps -ef |grep test81 -cpu EPYC,acpi=on,ss=on,hypervisor=on,erms=on,mpx=on,pcommit=on,clwb=on,pku=on,ospke=on,la57=on,3dnowext=on,3dnow=on,vme=off,fma=off,avx=off,f16c=off,rdrand=off,avx2=off,rdseed=off,sha-ni=off,xsavec=off,misalignsse=off,3dnowprefetch=off,osvw=off,topoext=off,fxsr-opt=off # virsh dumpxml test81 | grep "<cpu" -A30 <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC</model> <vendor>AMD</vendor> <feature policy='require' name='acpi'/> <feature policy='require' name='ss'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='erms'/> <feature policy='require' name='mpx'/> <feature policy='require' name='pcommit'/> <feature policy='require' name='clwb'/> <feature policy='require' name='pku'/> <feature policy='require' name='ospke'/> <feature policy='require' name='la57'/> <feature policy='require' name='3dnowext'/> <feature policy='require' name='3dnow'/> <feature policy='disable' name='vme'/> <feature policy='disable' name='fma'/> <feature policy='disable' name='avx'/> <feature policy='disable' name='f16c'/> <feature policy='disable' name='rdrand'/> <feature policy='disable' name='avx2'/> <feature policy='disable' name='rdseed'/> <feature policy='disable' name='sha-ni'/> <feature policy='disable' name='xsavec'/> <feature policy='disable' name='misalignsse'/> <feature policy='disable' name='3dnowprefetch'/> <feature policy='disable' name='osvw'/> <feature policy='disable' name='topoext'/> <feature policy='disable' name='fxsr_opt'/> </cpu> Sry for the wrong update for this bug, the previous comment should be updated for Bug 1809510. I will update the right versions and steps for this bug later. Reproduced this bug with libvirt-4.5.0-40.module+el8.2.0+5761+d16d25e7.x86_64 Verified this bug with libvirt-4.5.0-41.module+el8.2.0+5928+db9eea38.x86_64 Version: libvirt-4.5.0-40.module+el8.2.0+5761+d16d25e7.x86_64 qemu-kvm-2.12.0-99.module+el8.2.0+5827+8c39933c.x86_64 kernel-4.18.0-185.el8.x86_64 Steps: 1. Prepare a shutdown VM with the following conf: # virsh domstate test82 shut off # virsh dumpxml test82 --inactive | grep "<domain t" <domain type='qemu'> # virsh dumpxml test82 --inactive | grep "<cpu" -A29 <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC</model> <vendor>AMD</vendor> <feature policy='require' name='acpi'/> <feature policy='require' name='ss'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='erms'/> <feature policy='require' name='mpx'/> <feature policy='require' name='pcommit'/> <feature policy='require' name='clwb'/> <feature policy='require' name='pku'/> <feature policy='require' name='ospke'/> <feature policy='require' name='la57'/> <feature policy='require' name='3dnowext'/> <feature policy='require' name='3dnow'/> <feature policy='disable' name='vme'/> <feature policy='disable' name='fma'/> <feature policy='disable' name='avx'/> <feature policy='disable' name='f16c'/> <feature policy='disable' name='rdrand'/> <feature policy='disable' name='avx2'/> <feature policy='disable' name='rdseed'/> <feature policy='disable' name='sha-ni'/> <feature policy='disable' name='xsavec'/> <feature policy='disable' name='misalignsse'/> <feature policy='disable' name='3dnowprefetch'/> <feature policy='disable' name='osvw'/> <feature policy='disable' name='topoext'/> <feature policy='disable' name='fxsr_opt'/> </cpu> 2. Start VM # virsh start test82 error: Failed to start domain test82 error: operation failed: guest CPU doesn't match specification: missing features: fxsr_opt # tail -f /var/log/libvirt/qemu/test82.log 2020-03-09T03:06:06.780420Z qemu-kvm: warning: TCG doesn't support requested feature: CPUID.80000001H:EDX.fxsr-opt [bit 25] 2020-03-09T03:06:06.781249Z qemu-kvm: warning: TCG doesn't support requested feature: CPUID.80000001H:EDX.fxsr-opt [bit 25] 3. Update libvirt and restart libvirtd # yum update libvirt* -y # systemctl restart libvirtd # rpm -qa libvirt libvirt-4.5.0-41.module+el8.2.0+5928+db9eea38.x86_64 4. Start VM again # virsh start test82 Domain test82 started # ps -ef |grep test82 ...cpu EPYC,acpi=on,ss=on,hypervisor=on,erms=on,mpx=on,pcommit=on,clwb=on,pku=on,ospke=on,la57=on,3dnowext=on,3dnow=on,vme=off,fma=off,avx=off,f16c=off,rdrand=off,avx2=off,rdseed=off,sha-ni=off,xsavec=off,misalignsse=off,3dnowprefetch=off,osvw=off,topoext=off,fxsr-opt=off # virsh dumpxml test82 | grep "<cpu" -A30 <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC</model> <vendor>AMD</vendor> <feature policy='require' name='acpi'/> <feature policy='require' name='ss'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='erms'/> <feature policy='require' name='mpx'/> <feature policy='require' name='pcommit'/> <feature policy='require' name='clwb'/> <feature policy='require' name='pku'/> <feature policy='require' name='ospke'/> <feature policy='require' name='la57'/> <feature policy='require' name='3dnowext'/> <feature policy='require' name='3dnow'/> <feature policy='disable' name='vme'/> <feature policy='disable' name='fma'/> <feature policy='disable' name='avx'/> <feature policy='disable' name='f16c'/> <feature policy='disable' name='rdrand'/> <feature policy='disable' name='avx2'/> <feature policy='disable' name='rdseed'/> <feature policy='disable' name='sha-ni'/> <feature policy='disable' name='xsavec'/> <feature policy='disable' name='misalignsse'/> <feature policy='disable' name='3dnowprefetch'/> <feature policy='disable' name='osvw'/> <feature policy='disable' name='topoext'/> <feature policy='disable' name='fxsr_opt'/> </cpu> All the test results are as expected, move this bug to be verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:1587 |