Bug 1897948
Summary: | VM live migration fails with AMD: virt-ssbd flag gets lost on migration, guest VM freezes | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Oliver Freyermuth <o.freyermuth> |
Component: | libvirt | Assignee: | Tim Wiederhake <twiederh> |
Status: | CLOSED WONTFIX | QA Contact: | Fangge Jin <fjin> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.9 | CC: | hu.zhou, jsuchane, juzhang, o.freyermuth, twiederh, wienemann, xiaohli, yalzhang, ymankad |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-10-13 17:27:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Oliver Freyermuth
2020-11-15 19:33:15 UTC
Hi Oliver, just to confirm to qemu version, is it qemu-kvm-1.5.3? I have tried to reproduce the bug with the latest libvirt/qemu-kvm for rhel7, the details as below. The virt_ssbd disappears in the qemu cmd line and guest live xml, but the guest do not freeze. # rpm -q libvirt qemu-kvm kernel libvirt-4.5.0-36.el7_9.3.x86_64 qemu-kvm-1.5.3-175.el7_9.1.x86_64 kernel-3.10.0-1160.6.1.el7.x86_64 guest kernel: 3.10.0-1160.el7.x86_64 1. Prepare 2 hosts with same cpu "EPYC-IBPB"; 2. Start vm with cpu as "host-model" on src host: # virsh dumpxml rhel ... <cpu mode='host-model' check='partial'> <model fallback='allow'/> </cpu> ... # virsh start rhel # virsh dumpxml rhel ... <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <feature policy='disable' name='ht'/> <feature policy='disable' name='osxsave'/> <feature policy='require' name='cmp_legacy'/> <feature policy='disable' name='extapic'/> <feature policy='disable' name='skinit'/> <feature policy='disable' name='wdt'/> <feature policy='disable' name='tce'/> <feature policy='disable' name='topoext'/> <feature policy='disable' name='perfctr_core'/> <feature policy='disable' name='perfctr_nb'/> **** <feature policy='require' name='virt-ssbd'/> ***** <feature policy='disable' name='monitor'/> <feature policy='require' name='hypervisor'/> <feature policy='disable' name='arat'/> <feature policy='disable' name='svm'/> </cpu> ... check the qemu cmd line # ps aux | grep qemu ... -cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb,*** +virt-ssbd *** ... 3. Migrate to another host, then check on the dst host: # ps aux |grep qemu -cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb =====> no "+virt-ssbd" here # virsh dumpxml rhel ... <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>EPYC-IBPB</model> <vendor>AMD</vendor> <feature policy='disable' name='ht'/> <feature policy='disable' name='osxsave'/> <feature policy='require' name='cmp_legacy'/> <feature policy='disable' name='extapic'/> <feature policy='disable' name='skinit'/> <feature policy='disable' name='wdt'/> <feature policy='disable' name='tce'/> <feature policy='disable' name='topoext'/> <feature policy='disable' name='perfctr_core'/> <feature policy='disable' name='perfctr_nb'/> <feature policy='disable' name='monitor'/> <feature policy='require' name='hypervisor'/> <feature policy='disable' name='arat'/> <feature policy='disable' name='svm'/> =======> no "<feature policy='require' name='virt-ssbd'/>" here </cpu> login the guest to check, the "ssbd" and "virt_ssbd" exists, and the guest do not freeze, which is expected. # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 1 On-line CPU(s) list: 0 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC Processor (with IBPB) Stepping: 2 CPU MHz: 2096.060 BogoMIPS: 4192.12 Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ****ssbd**** ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 ****virt_ssbd**** arat # uname -a Linux localhost.localdomain 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux 4. Reboot the guest and check the cpu again, the ssbd and virt-ssbd disappears. # reboot # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 1 On-line CPU(s) list: 0 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC Processor (with IBPB) Stepping: 2 CPU MHz: 2096.060 BogoMIPS: 4192.12 Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat # lscpu | grep ssbd ( no outputs) Yes, this is with: # rpm -q libvirt qemu-kvm kernel libvirt-4.5.0-36.el7_9.2.x86_64 qemu-kvm-1.5.3-175.el7_9.1.x86_6 kernel-3.10.0-1160.2.2.el7.x86_64 Potentially, the actual freeze might be workload, guest OS and CPU model related? We used CentOS 8.2 in several of the hanging guests. I did several migration tests now, it seems the guest freezes do not appear with 100 % probability. Sometimes, the guest freezes immediately, sometimes a few seconds after migration, and in one case I only got a kernel trace and the VM continued running: kernel: unchecked MSR access error: WRMSR to 0xc001011f (tried to write 0x0000000000000004) at rIP: 0xffffffffb8661f14 (native_write_msr+0x4/0x20) kernel: Call Trace: kernel: __switch_to_xtra+0x2e1/0x5e0 What seems to help to reproduce the crashes is to run something inside the VM during the migration. I was quite succesful with while true; do date; sleep 1; done For reference, this is with a: Model name: AMD Opteron 63xx class CPUF lags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm ssbd ibpb vmmcall bmi1 virt_ssbd arat In general, I would expect the loss of the flag from the hypervisor configuration while it is still present in the VM is reason enough for potentially undefined behaviour, no? I was unable to reproduce this issue with a newer version of libvirt, 8.8.0. As RHEL 7 is currently in the Maintenance Support 2 phase, I have to close this issue as WONTFIX. |