Bug 1897948
| Summary: | VM live migration fails with AMD: virt-ssbd flag gets lost on migration, guest VM freezes | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Oliver Freyermuth <o.freyermuth> |
| Component: | libvirt | Assignee: | Tim Wiederhake <twiederh> |
| Status: | CLOSED WONTFIX | QA Contact: | Fangge Jin <fjin> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.9 | CC: | hu.zhou, jsuchane, juzhang, o.freyermuth, twiederh, wienemann, xiaohli, yalzhang, ymankad |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-13 17:27:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Oliver Freyermuth
2020-11-15 19:33:15 UTC
Hi Oliver, just to confirm to qemu version, is it qemu-kvm-1.5.3? I have tried to reproduce the bug with the latest libvirt/qemu-kvm for rhel7, the details as below. The virt_ssbd disappears in the qemu cmd line and guest live xml, but the guest do not freeze.
# rpm -q libvirt qemu-kvm kernel
libvirt-4.5.0-36.el7_9.3.x86_64
qemu-kvm-1.5.3-175.el7_9.1.x86_64
kernel-3.10.0-1160.6.1.el7.x86_64
guest kernel:
3.10.0-1160.el7.x86_64
1. Prepare 2 hosts with same cpu "EPYC-IBPB";
2. Start vm with cpu as "host-model" on src host:
# virsh dumpxml rhel
...
<cpu mode='host-model' check='partial'>
<model fallback='allow'/>
</cpu>
...
# virsh start rhel
# virsh dumpxml rhel
...
<cpu mode='custom' match='exact' check='full'>
<model fallback='forbid'>EPYC-IBPB</model>
<vendor>AMD</vendor>
<feature policy='disable' name='ht'/>
<feature policy='disable' name='osxsave'/>
<feature policy='require' name='cmp_legacy'/>
<feature policy='disable' name='extapic'/>
<feature policy='disable' name='skinit'/>
<feature policy='disable' name='wdt'/>
<feature policy='disable' name='tce'/>
<feature policy='disable' name='topoext'/>
<feature policy='disable' name='perfctr_core'/>
<feature policy='disable' name='perfctr_nb'/>
**** <feature policy='require' name='virt-ssbd'/> *****
<feature policy='disable' name='monitor'/>
<feature policy='require' name='hypervisor'/>
<feature policy='disable' name='arat'/>
<feature policy='disable' name='svm'/>
</cpu>
...
check the qemu cmd line
# ps aux | grep qemu
...
-cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb,*** +virt-ssbd ***
...
3. Migrate to another host, then check on the dst host:
# ps aux |grep qemu
-cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb =====> no "+virt-ssbd" here
# virsh dumpxml rhel
...
<cpu mode='custom' match='exact' check='full'>
<model fallback='forbid'>EPYC-IBPB</model>
<vendor>AMD</vendor>
<feature policy='disable' name='ht'/>
<feature policy='disable' name='osxsave'/>
<feature policy='require' name='cmp_legacy'/>
<feature policy='disable' name='extapic'/>
<feature policy='disable' name='skinit'/>
<feature policy='disable' name='wdt'/>
<feature policy='disable' name='tce'/>
<feature policy='disable' name='topoext'/>
<feature policy='disable' name='perfctr_core'/>
<feature policy='disable' name='perfctr_nb'/>
<feature policy='disable' name='monitor'/>
<feature policy='require' name='hypervisor'/>
<feature policy='disable' name='arat'/>
<feature policy='disable' name='svm'/> =======> no "<feature policy='require' name='virt-ssbd'/>" here
</cpu>
login the guest to check, the "ssbd" and "virt_ssbd" exists, and the guest do not freeze, which is expected.
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC Processor (with IBPB)
Stepping: 2
CPU MHz: 2096.060
BogoMIPS: 4192.12
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ****ssbd**** ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 ****virt_ssbd**** arat
# uname -a
Linux localhost.localdomain 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
4. Reboot the guest and check the cpu again, the ssbd and virt-ssbd disappears.
# reboot
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC Processor (with IBPB)
Stepping: 2
CPU MHz: 2096.060
BogoMIPS: 4192.12
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
# lscpu | grep ssbd
( no outputs)
Yes, this is with: # rpm -q libvirt qemu-kvm kernel libvirt-4.5.0-36.el7_9.2.x86_64 qemu-kvm-1.5.3-175.el7_9.1.x86_6 kernel-3.10.0-1160.2.2.el7.x86_64 Potentially, the actual freeze might be workload, guest OS and CPU model related? We used CentOS 8.2 in several of the hanging guests. I did several migration tests now, it seems the guest freezes do not appear with 100 % probability. Sometimes, the guest freezes immediately, sometimes a few seconds after migration, and in one case I only got a kernel trace and the VM continued running: kernel: unchecked MSR access error: WRMSR to 0xc001011f (tried to write 0x0000000000000004) at rIP: 0xffffffffb8661f14 (native_write_msr+0x4/0x20) kernel: Call Trace: kernel: __switch_to_xtra+0x2e1/0x5e0 What seems to help to reproduce the crashes is to run something inside the VM during the migration. I was quite succesful with while true; do date; sleep 1; done For reference, this is with a: Model name: AMD Opteron 63xx class CPUF lags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm ssbd ibpb vmmcall bmi1 virt_ssbd arat In general, I would expect the loss of the flag from the hypervisor configuration while it is still present in the VM is reason enough for potentially undefined behaviour, no? I was unable to reproduce this issue with a newer version of libvirt, 8.8.0. As RHEL 7 is currently in the Maintenance Support 2 phase, I have to close this issue as WONTFIX. |