Bug 1897948

Summary: VM live migration fails with AMD: virt-ssbd flag gets lost on migration, guest VM freezes
Product: Red Hat Enterprise Linux 7 Reporter: Oliver Freyermuth <o.freyermuth>
Component: libvirtAssignee: Tim Wiederhake <twiederh>
Status: CLOSED WONTFIX QA Contact: Fangge Jin <fjin>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.9CC: hu.zhou, jsuchane, juzhang, o.freyermuth, twiederh, wienemann, xiaohli, yalzhang, ymankad
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-13 17:27:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Oliver Freyermuth 2020-11-15 19:33:15 UTC
Description of problem:
After live-migration of VMs on AMD-based hypervisors, guest VMs freeze, lock-up, or otherwise misbehave before finally freezing, since the virt-ssbd flag is lost on migration. 

Version-Release number of selected component (if applicable):
4.5.0-36.el7_9.2

How reproducible:
Always. 

Steps to Reproduce:
1. Start a VM on an AMD-based hypervisor using host-model CPU. 
2. Confirm it has the virt-ssbd flag set (either inside the VM, or via dumpxml, or in the log in /var/log/libvirt/qemu). 
3. Live-migrate it to another hypervisor with the same host CPU. 
4. Observe the virt-ssbd flag gets lost e.g. in the log in /var/log/libvirt/qemu, and the guest freezes. 

Actual results:
Guest freezes. 

Expected results:
Guest continues to run, flags are honoured during migration. 

Additional info:
This is a long-standing regression caused by https://bugzilla.redhat.com/show_bug.cgi?id=1745181 , and was fixed for managedsave via https://bugzilla.redhat.com/show_bug.cgi?id=1815572 , but it still breaks live migration in 7.9.

Comment 2 yalzhang@redhat.com 2020-11-17 01:21:08 UTC
Hi Oliver, just to confirm to qemu version, is it qemu-kvm-1.5.3?

Comment 3 yalzhang@redhat.com 2020-11-17 03:23:40 UTC
I have tried to reproduce the bug with the latest libvirt/qemu-kvm for rhel7, the details as below. The virt_ssbd disappears in the qemu cmd line and guest live xml, but the guest do not freeze.

# rpm -q libvirt qemu-kvm kernel
libvirt-4.5.0-36.el7_9.3.x86_64
qemu-kvm-1.5.3-175.el7_9.1.x86_64
kernel-3.10.0-1160.6.1.el7.x86_64

guest kernel:
3.10.0-1160.el7.x86_64

1. Prepare 2 hosts with same cpu "EPYC-IBPB";
2. Start vm with cpu as "host-model" on src host:

# virsh dumpxml rhel
...
 <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
  </cpu>
...
# virsh start rhel

# virsh dumpxml rhel 
...
<cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <vendor>AMD</vendor>
    <feature policy='disable' name='ht'/>
    <feature policy='disable' name='osxsave'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='disable' name='extapic'/>
    <feature policy='disable' name='skinit'/>
    <feature policy='disable' name='wdt'/>
    <feature policy='disable' name='tce'/>
    <feature policy='disable' name='topoext'/>
    <feature policy='disable' name='perfctr_core'/>
    <feature policy='disable' name='perfctr_nb'/>
 **** <feature policy='require' name='virt-ssbd'/> *****
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='arat'/>
    <feature policy='disable' name='svm'/>
  </cpu>
...

check the qemu cmd line
# ps aux | grep qemu
...
 -cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb,*** +virt-ssbd ***
...

3. Migrate to another host, then check on the dst host:
# ps aux |grep qemu
-cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb  =====> no "+virt-ssbd" here

# virsh dumpxml rhel
...
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <vendor>AMD</vendor>
    <feature policy='disable' name='ht'/>
    <feature policy='disable' name='osxsave'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='disable' name='extapic'/>
    <feature policy='disable' name='skinit'/>
    <feature policy='disable' name='wdt'/>
    <feature policy='disable' name='tce'/>
    <feature policy='disable' name='topoext'/>
    <feature policy='disable' name='perfctr_core'/>
    <feature policy='disable' name='perfctr_nb'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='arat'/>
    <feature policy='disable' name='svm'/>  =======> no "<feature policy='require' name='virt-ssbd'/>" here 
  </cpu>

login the guest to check, the "ssbd" and "virt_ssbd" exists, and the guest do not freeze, which is expected.

# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC Processor (with IBPB)
Stepping:              2
CPU MHz:               2096.060
BogoMIPS:              4192.12
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ****ssbd**** ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 ****virt_ssbd**** arat

# uname -a
Linux localhost.localdomain 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

4. Reboot the guest and check the cpu again, the ssbd and virt-ssbd disappears.
# reboot
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC Processor (with IBPB)
Stepping:              2
CPU MHz:               2096.060
BogoMIPS:              4192.12
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat

# lscpu | grep ssbd
( no outputs)

Comment 4 Oliver Freyermuth 2020-11-17 08:28:52 UTC
Yes, this is with:

# rpm -q libvirt qemu-kvm kernel
libvirt-4.5.0-36.el7_9.2.x86_64
qemu-kvm-1.5.3-175.el7_9.1.x86_6
kernel-3.10.0-1160.2.2.el7.x86_64

Potentially, the actual freeze might be workload, guest OS and CPU model related? 
We used CentOS 8.2 in several of the hanging guests. 

I did several migration tests now, it seems the guest freezes do not appear with 100 % probability. Sometimes, the guest freezes immediately, sometimes a few seconds after migration, and in one case I only got a kernel trace and the VM continued running:
 kernel: unchecked MSR access error: WRMSR to 0xc001011f (tried to write 0x0000000000000004) at rIP: 0xffffffffb8661f14 (native_write_msr+0x4/0x20)
 kernel: Call Trace:
 kernel:  __switch_to_xtra+0x2e1/0x5e0

What seems to help to reproduce the crashes is to run something inside the VM during the migration. I was quite succesful with
 while true; do date; sleep 1; done

For reference, this is with a:
Model name:          AMD Opteron 63xx class CPUF
lags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm ssbd ibpb vmmcall bmi1 virt_ssbd arat

In general, I would expect the loss of the flag from the hypervisor configuration while it is still present in the VM is reason enough for potentially undefined behaviour, no?

Comment 9 Tim Wiederhake 2022-10-13 17:27:30 UTC
I was unable to reproduce this issue with a newer version of libvirt, 8.8.0.

As RHEL 7 is currently in the Maintenance Support 2 phase, I have to close this issue as WONTFIX.