Bug 1592276 - EPYC-IBPB not working with Windows 1803
Summary: EPYC-IBPB not working with Windows 1803
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1615160 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-18 10:24 UTC by Michael Lipp
Modified: 2019-07-29 17:53 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-18 18:12:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1593190 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 1593190 1716306

Description Michael Lipp 2018-06-18 10:24:21 UTC
Description of problem:

Windows 1803 does not boot from the (downloaded) installation DVD with the predefined configuration for a Windows 10 guest (BSOD "System Thread exception not handled").

Version-Release number of selected component (if applicable):

libvirt is 3.7.0-4 running on fc27.

I have an EPYC 7351p. lscpu gives:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat

How reproducible:

The virtual machine created when selecting Windows 10 as guest configures cpu EPYC-IBPB. This does not work.

Strange enough, providing the EPYC-IBPB configuration as a delta to an Opteron_G5 (using the information from cpu.xml) works, i.e. this boots:

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='forbid'>Opteron_G5</model>
    <topology sockets='1' cores='4' threads='1'/>
    <feature policy='disable' name='xop'/>
    <feature policy='disable' name='fma4'/>
    <feature policy='disable' name='tbm'/>
    <feature policy='optional' name='adx'/>
    <feature policy='optional' name='arat'/>
    <feature policy='optional' name='avx2'/>
    <feature policy='optional' name='bmi1'/>
    <feature policy='optional' name='bmi2'/>
    <feature policy='optional' name='clflushopt'/>
    <feature policy='optional' name='cr8legacy'/>
    <feature policy='optional' name='fsgsbase'/>
    <feature policy='optional' name='fxsr_opt'/>
    <feature policy='optional' name='ibpb'/>
    <feature policy='optional' name='mmxext'/>
    <feature policy='optional' name='monitor'/>
    <feature policy='optional' name='movbe'/>
    <feature policy='optional' name='osvw'/>
    <feature policy='optional' name='rdrand'/>
    <feature policy='optional' name='rdseed'/>
    <feature policy='optional' name='sha-ni'/>
    <feature policy='optional' name='smap'/>
    <feature policy='optional' name='smep'/>
    <feature policy='optional' name='vme'/>
    <feature policy='optional' name='xgetbv1'/>
    <feature policy='optional' name='xsavec'/>
    <feature policy='optional' name='xsaveopt'/>
  </cpu>
 
I added the features one by one, hoping to identify the culprit, but they all work. Must be something more subtle that prevents the Windows 1803 guest to work in the EPYC. Or I didn't get the cpu/feature-syntax right.

Comment 1 Daniel Berrangé 2018-06-18 10:31:46 UTC
(In reply to Michael Lipp from comment #0)
> 
> The virtual machine created when selecting Windows 10 as guest configures
> cpu EPYC-IBPB. This does not work.
> 
> Strange enough, providing the EPYC-IBPB configuration as a delta to an
> Opteron_G5 (using the information from cpu.xml) works, i.e. this boots:

This changes the CPU family reported to the guest, which suggests the guest OS has some specific logic its applying for EPYC family, that it doesn't do for Opteron_G5 family.

There was someone on the QEMU IRC channel last week complaining of the same problem.  They suggested that adding the 'virt-ssbd' feature fixed the problem.

eg

   <cpu mode='custom'>
     <model fallback='forbid'>EPYC</model>
     <feature policy='require' name='virt-ssbd'/>
   </cpu>

This 'virt-ssbd' feature is for fixing a recent CPU vulnerability on x86. Unfortunately that feature is not yet available in the Fedora RPMs, so its not easy to test it right now.

Comment 2 Daniel Berrangé 2018-06-18 14:51:27 UTC
(In reply to Michael Lipp from comment #0)
> Description of problem:
> 
> Windows 1803 does not boot from the (downloaded) installation DVD with the
> predefined configuration for a Windows 10 guest (BSOD "System Thread
> exception not handled").
> 
> Version-Release number of selected component (if applicable):
> 
> libvirt is 3.7.0-4 running on fc27.

[snip]

> The virtual machine created when selecting Windows 10 as guest configures
> cpu EPYC-IBPB. This does not work.

Are you sure that is correct - neither libvirt 3.7.0, or QEMU in Fedora 27 provide an EPYC CPU, nor a EPYC-IBPB  CPU, so I'm not sure how you can be using them with standard Fedora packages.

Have you installed newer libvirt/qemu by chance ?

Comment 3 Jon Masters 2018-06-18 14:58:52 UTC
Can you confirm a baseline EPYC without the -IBPB variant is booting ok?

Comment 4 Michael Lipp 2018-06-18 16:48:00 UTC
(In reply to Daniel Berrange from comment #2)
> Are you sure that is correct - neither libvirt 3.7.0, or QEMU in Fedora 27
> provide an EPYC CPU, nor a EPYC-IBPB  CPU, so I'm not sure how you can be
> using them with standard Fedora packages.
> 
> Have you installed newer libvirt/qemu by chance ?

Very sorry about that. The server is newly installed and *FC28*. libvirt/qemu are the vanilla fc28 packages.

(My everyday working-horse guest system running on the server is still fc27, must be the reason for my confusion.)

Comment 5 Michael Lipp 2018-06-18 18:13:21 UTC
(In reply to Jon Masters from comment #3)
> Can you confirm a baseline EPYC without the -IBPB variant is booting ok?

Quite the opposite. Configuring EPYC instead of EPYC-IBPB was the first thing I tried to work around the problem. Doesn't work with EPYC either.

Comment 6 Laszlo Ersek 2018-06-18 18:37:41 UTC
Michael,

can you check your host dmesg when the guest crashes?

Also, can you try to load the "kvm" module with "ignore_msrs=1"? (See this thread on vfio-users: <https://www.redhat.com/archives/vfio-users/2018-May/msg00004.html>.)

Thanks.

Comment 7 Daniel Berrangé 2018-06-18 18:50:06 UTC
(In reply to Michael Lipp from comment #4)
> (In reply to Daniel Berrange from comment #2)
> > Are you sure that is correct - neither libvirt 3.7.0, or QEMU in Fedora 27
> > provide an EPYC CPU, nor a EPYC-IBPB  CPU, so I'm not sure how you can be
> > using them with standard Fedora packages.
> > 
> > Have you installed newer libvirt/qemu by chance ?
> 
> Very sorry about that. The server is newly installed and *FC28*.
> libvirt/qemu are the vanilla fc28 packages.
> 
> (My everyday working-horse guest system running on the server is still fc27,
> must be the reason for my confusion.)


No worries, that's in fact good ! 

As Laszlo asks, could you check if you see anything in dmesg / systemd journal that is related to KVM, and/or any warnings in /var/log/libvirt/qemu/$GUEST.log when you get the crashed guest.

I've just built updates for Fedora 28 that provide the new virt-ssbd feature flag 

  https://bodhi.fedoraproject.org/updates/qemu-2.11.1-3.fc28
  https://bodhi.fedoraproject.org/updates/libvirt-4.1.0-3.fc28

If you install those and ensure you're running kernel >= 4.16.10-301 then you should be able to add the virt-ssbd feature flag to your guest XML CPU config. It shouldn't require any microcode changes to use virt-ssbd.

Assuming our hypothesis is correct, virt-ssbd feature should fix the guest, but the ignore_msrs=1 suggestion may well also fix it. Would be good if you are able to confirm both.

Comment 8 Michael Lipp 2018-06-18 21:39:21 UTC
1) Nothing special in the logs (just the usual startup messages). Actually, there is no "crash" of the guest. The guest enters a reboot loop with the BSOD, which it performs very reliably and without any messages showing up in the journal.

2) "kvm" module with "ignore_msrs=1" fixes things.

3) I've downloaded the new RPMs and updated all packages that had been installed:

# rpm -qa | fgrep qemu-
ipxe-roms-qemu-20170710-3.git0600d3ae.fc28.noarch
qemu-block-nfs-2.11.1-3.fc28.x86_64
qemu-system-x86-core-2.11.1-3.fc28.x86_64
qemu-kvm-2.11.1-3.fc28.x86_64
qemu-block-iscsi-2.11.1-3.fc28.x86_64
qemu-block-rbd-2.11.1-3.fc28.x86_64
qemu-img-2.11.1-3.fc28.x86_64
qemu-common-2.11.1-3.fc28.x86_64
qemu-block-curl-2.11.1-3.fc28.x86_64
qemu-block-gluster-2.11.1-3.fc28.x86_64
qemu-block-ssh-2.11.1-3.fc28.x86_64
qemu-system-x86-2.11.1-3.fc28.x86_64
libvirt-daemon-driver-qemu-4.1.0-3.fc28.x86_64
qemu-block-dmg-2.11.1-3.fc28.x86_64

# rpm -qa | fgrep libvirt-
libvirt-daemon-4.1.0-3.fc28.x86_64
libvirt-daemon-config-nwfilter-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-mpath-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-libxl-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-nodedev-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-logical-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-xen-4.1.0-3.fc28.x86_64
libvirt-devel-4.1.0-3.fc28.x86_64
libvirt-glib-1.0.0-5.fc28.x86_64
libvirt-daemon-driver-network-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-secret-4.1.0-3.fc28.x86_64
libvirt-daemon-config-network-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-sheepdog-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-rbd-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-vbox-4.1.0-3.fc28.x86_64
libvirt-client-4.1.0-3.fc28.x86_64
python2-libvirt-4.1.0-1.fc28.x86_64
libvirt-libs-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-nwfilter-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-interface-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-gluster-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-disk-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-iscsi-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-uml-4.1.0-3.fc28.x86_64
libvirt-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-qemu-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-scsi-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-4.1.0-3.fc28.x86_64
libvirt-daemon-kvm-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-core-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-lxc-4.1.0-3.fc28.x86_64
libvirt-daemon-driver-storage-zfs-4.1.0-3.fc28.x86_64
libvirt-bash-completion-4.1.0-3.fc28.x86_64

I restarted the system to make sure that everything is "in place". Then I've changed the configuration to:

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <topology sockets='1' cores='4' threads='1'/>
    <feature policy='require' name='virt-ssbd'/>
  </cpu>

This does NOT fix the problem.

Comment 9 Laszlo Ersek 2018-06-18 21:50:28 UTC
Hello Michael,

your results are fully consistent with my own test results.

The issue seems to be that the Windows guest attempts to read the MSR (Model Specific Register) C001_102C. From a KVM trace I captured:

             CPU-14681 [017]  2685.221364: kvm_msr:              msr_read c001102c = 0x0 (#GP)

Note the "#GP".

According to the latest publicly available AMD documentation [*], there is no such MSR. So this looks like a Windows bug to me. (Or, maybe a KVM bug that misleads Windows to read this MSR? I'm unsure.)

[*] See AMD publication "Preliminary Processor Programming Reference (PPR) for AMD Family 17h Models 00h-0Fh Processors" <https://developer.amd.com/resources/developer-guides-manuals/>, section "2.1.12.5 MSRs - MSRC001_1xxx". The
last MSR before C001_102C is C001_1027 (DR0_ADDR_MASK, [Address Mask For DR0
Breakpoints]), while the first MSR after C001_102C is C001_1030 (IBS_FETCH_CTL,
[IBS Fetch Control]).

Comment 10 Eduardo Habkost 2018-06-20 11:46:24 UTC
There's one match on Google for "MSRc001102c": http://dev.exherbo.org/~arkanoid/atlas-dmesg-3.2.5-20120209-mtrr.txt

So it looks like this MSR exists and is readable on some hosts, we can ask AMD for help figuring out what it's supposed to contain.

Comment 11 Jon Masters 2018-06-27 00:26:04 UTC
Should we ping AMD about this?

Comment 14 Bruce Campbell 2018-07-10 16:21:48 UTC
(In reply to Jon Masters from comment #3)
> Can you confirm a baseline EPYC without the -IBPB variant is booting ok?

I can confirm it is not.

Comment 15 Bruce Campbell 2018-07-10 16:23:49 UTC
(In reply to Jon Masters from comment #3)
> Can you confirm a baseline EPYC without the -IBPB variant is booting ok?

I can confirm it is not.(In reply to Daniel Berrange from comment #7)
> (In reply to Michael Lipp from comment #4)
> > (In reply to Daniel Berrange from comment #2)
> > > Are you sure that is correct - neither libvirt 3.7.0, or QEMU in Fedora 27
> > > provide an EPYC CPU, nor a EPYC-IBPB  CPU, so I'm not sure how you can be
> > > using them with standard Fedora packages.
> > > 
> > > Have you installed newer libvirt/qemu by chance ?
> > 
> > Very sorry about that. The server is newly installed and *FC28*.
> > libvirt/qemu are the vanilla fc28 packages.
> > 
> > (My everyday working-horse guest system running on the server is still fc27,
> > must be the reason for my confusion.)
> 
> 
> No worries, that's in fact good ! 
> 
> As Laszlo asks, could you check if you see anything in dmesg / systemd
> journal that is related to KVM, and/or any warnings in
> /var/log/libvirt/qemu/$GUEST.log when you get the crashed guest.
> 
> I've just built updates for Fedora 28 that provide the new virt-ssbd feature
> flag 
> 
>   https://bodhi.fedoraproject.org/updates/qemu-2.11.1-3.fc28
>   https://bodhi.fedoraproject.org/updates/libvirt-4.1.0-3.fc28
> 
> If you install those and ensure you're running kernel >= 4.16.10-301 then
> you should be able to add the virt-ssbd feature flag to your guest XML CPU
> config. It shouldn't require any microcode changes to use virt-ssbd.
> 
> Assuming our hypothesis is correct, virt-ssbd feature should fix the guest,
> but the ignore_msrs=1 suggestion may well also fix it. Would be good if you
> are able to confirm both.

I can confirm that on a Threaripper 1920x the ignore_msrs workaround doesn/t The system hangs then BSODs with a watchdog timeout.

Comment 16 Daniel Berrangé 2018-08-13 08:16:03 UTC
AMD has confirmed this MSR access is acceptable, albeit unexpected, so KVM kernel module needs enhancing to handle this MSR (probably by ignoring writes to it).

Comment 17 Daniel Berrangé 2018-08-13 08:16:52 UTC
*** Bug 1615160 has been marked as a duplicate of this bug. ***

Comment 18 Bruce Campbell 2018-08-29 20:08:29 UTC
Has there been any motion on this?

Comment 19 Laura Abbott 2018-10-01 21:20:28 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.
 
Fedora 28 has now been rebased to 4.18.10-300.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.
 
If you experience different issues, please open a new bug report for those.

Comment 20 Mikhail 2018-10-09 17:09:28 UTC
Issue still reproduced on
$ uname -r
4.19.0-0.rc6.git4.1.fc30.x86_64

Comment 21 Cole Robinson 2018-10-09 17:25:02 UTC
There's a RHEL bug tracking this as well: https://bugzilla.redhat.com/show_bug.cgi?id=1593190

Comment 22 Daniel Berrangé 2019-02-18 18:12:24 UTC
FYI this was fixed upstream in:

commit 0e1b869fff60c81b510c2d00602d778f8f59dd9a
Author: Eduardo Habkost <ehabkost>
Date:   Mon Dec 17 22:34:18 2018 -0200

    kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
    
    Some guests OSes (including Windows 10) write to MSR 0xc001102c
    on some cases (possibly while trying to apply a CPU errata).
    Make KVM ignore reads and writes to that MSR, so the guest won't
    crash.
    
    The MSR is documented as "Execution Unit Configuration (EX_CFG)",
    at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
    15h Models 00h-0Fh Processors".
    
    Cc: stable.org
    Signed-off-by: Eduardo Habkost <ehabkost>
    Signed-off-by: Paolo Bonzini <pbonzini>


which is part of the Linux v4.20 release. This version is already shipped in Fedora updates, so I'm going to mark this as closed now.


Note You need to log in before you can comment on or make changes to this bug.