RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1897948 - VM live migration fails with AMD: virt-ssbd flag gets lost on migration, guest VM freezes
Summary: VM live migration fails with AMD: virt-ssbd flag gets lost on migration, gues...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Tim Wiederhake
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-15 19:33 UTC by Oliver Freyermuth
Modified: 2022-10-13 17:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-13 17:27:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1745181 0 urgent CLOSED virt-ssbd not included on CPU mode='host-model' 2023-09-07 20:27:44 UTC
Red Hat Bugzilla 1815572 0 unspecified CLOSED VM live migration fails: the CPU is incompatible with host CPU: Host CPU does not provide required fea-tures: virt-ssbd 2022-09-02 06:30:38 UTC

Description Oliver Freyermuth 2020-11-15 19:33:15 UTC
Description of problem:
After live-migration of VMs on AMD-based hypervisors, guest VMs freeze, lock-up, or otherwise misbehave before finally freezing, since the virt-ssbd flag is lost on migration. 

Version-Release number of selected component (if applicable):
4.5.0-36.el7_9.2

How reproducible:
Always. 

Steps to Reproduce:
1. Start a VM on an AMD-based hypervisor using host-model CPU. 
2. Confirm it has the virt-ssbd flag set (either inside the VM, or via dumpxml, or in the log in /var/log/libvirt/qemu). 
3. Live-migrate it to another hypervisor with the same host CPU. 
4. Observe the virt-ssbd flag gets lost e.g. in the log in /var/log/libvirt/qemu, and the guest freezes. 

Actual results:
Guest freezes. 

Expected results:
Guest continues to run, flags are honoured during migration. 

Additional info:
This is a long-standing regression caused by https://bugzilla.redhat.com/show_bug.cgi?id=1745181 , and was fixed for managedsave via https://bugzilla.redhat.com/show_bug.cgi?id=1815572 , but it still breaks live migration in 7.9.

Comment 2 yalzhang@redhat.com 2020-11-17 01:21:08 UTC
Hi Oliver, just to confirm to qemu version, is it qemu-kvm-1.5.3?

Comment 3 yalzhang@redhat.com 2020-11-17 03:23:40 UTC
I have tried to reproduce the bug with the latest libvirt/qemu-kvm for rhel7, the details as below. The virt_ssbd disappears in the qemu cmd line and guest live xml, but the guest do not freeze.

# rpm -q libvirt qemu-kvm kernel
libvirt-4.5.0-36.el7_9.3.x86_64
qemu-kvm-1.5.3-175.el7_9.1.x86_64
kernel-3.10.0-1160.6.1.el7.x86_64

guest kernel:
3.10.0-1160.el7.x86_64

1. Prepare 2 hosts with same cpu "EPYC-IBPB";
2. Start vm with cpu as "host-model" on src host:

# virsh dumpxml rhel
...
 <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
  </cpu>
...
# virsh start rhel

# virsh dumpxml rhel 
...
<cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <vendor>AMD</vendor>
    <feature policy='disable' name='ht'/>
    <feature policy='disable' name='osxsave'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='disable' name='extapic'/>
    <feature policy='disable' name='skinit'/>
    <feature policy='disable' name='wdt'/>
    <feature policy='disable' name='tce'/>
    <feature policy='disable' name='topoext'/>
    <feature policy='disable' name='perfctr_core'/>
    <feature policy='disable' name='perfctr_nb'/>
 **** <feature policy='require' name='virt-ssbd'/> *****
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='arat'/>
    <feature policy='disable' name='svm'/>
  </cpu>
...

check the qemu cmd line
# ps aux | grep qemu
...
 -cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb,*** +virt-ssbd ***
...

3. Migrate to another host, then check on the dst host:
# ps aux |grep qemu
-cpu EPYC-IBPB,+ht,+osxsave,+cmp_legacy,+extapic,+skinit,+wdt,+tce,+topoext,+perfctr_core,+perfctr_nb  =====> no "+virt-ssbd" here

# virsh dumpxml rhel
...
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <vendor>AMD</vendor>
    <feature policy='disable' name='ht'/>
    <feature policy='disable' name='osxsave'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='disable' name='extapic'/>
    <feature policy='disable' name='skinit'/>
    <feature policy='disable' name='wdt'/>
    <feature policy='disable' name='tce'/>
    <feature policy='disable' name='topoext'/>
    <feature policy='disable' name='perfctr_core'/>
    <feature policy='disable' name='perfctr_nb'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='arat'/>
    <feature policy='disable' name='svm'/>  =======> no "<feature policy='require' name='virt-ssbd'/>" here 
  </cpu>

login the guest to check, the "ssbd" and "virt_ssbd" exists, and the guest do not freeze, which is expected.

# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC Processor (with IBPB)
Stepping:              2
CPU MHz:               2096.060
BogoMIPS:              4192.12
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ****ssbd**** ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 ****virt_ssbd**** arat

# uname -a
Linux localhost.localdomain 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

4. Reboot the guest and check the cpu again, the ssbd and virt-ssbd disappears.
# reboot
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC Processor (with IBPB)
Stepping:              2
CPU MHz:               2096.060
BogoMIPS:              4192.12
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw retpoline_amd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat

# lscpu | grep ssbd
( no outputs)

Comment 4 Oliver Freyermuth 2020-11-17 08:28:52 UTC
Yes, this is with:

# rpm -q libvirt qemu-kvm kernel
libvirt-4.5.0-36.el7_9.2.x86_64
qemu-kvm-1.5.3-175.el7_9.1.x86_6
kernel-3.10.0-1160.2.2.el7.x86_64

Potentially, the actual freeze might be workload, guest OS and CPU model related? 
We used CentOS 8.2 in several of the hanging guests. 

I did several migration tests now, it seems the guest freezes do not appear with 100 % probability. Sometimes, the guest freezes immediately, sometimes a few seconds after migration, and in one case I only got a kernel trace and the VM continued running:
 kernel: unchecked MSR access error: WRMSR to 0xc001011f (tried to write 0x0000000000000004) at rIP: 0xffffffffb8661f14 (native_write_msr+0x4/0x20)
 kernel: Call Trace:
 kernel:  __switch_to_xtra+0x2e1/0x5e0

What seems to help to reproduce the crashes is to run something inside the VM during the migration. I was quite succesful with
 while true; do date; sleep 1; done

For reference, this is with a:
Model name:          AMD Opteron 63xx class CPUF
lags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm ssbd ibpb vmmcall bmi1 virt_ssbd arat

In general, I would expect the loss of the flag from the hypervisor configuration while it is still present in the VM is reason enough for potentially undefined behaviour, no?

Comment 9 Tim Wiederhake 2022-10-13 17:27:30 UTC
I was unable to reproduce this issue with a newer version of libvirt, 8.8.0.

As RHEL 7 is currently in the Maintenance Support 2 phase, I have to close this issue as WONTFIX.


Note You need to log in before you can comment on or make changes to this bug.