Bug 1640140 - cpu mode=host-model causes guest kernel panics for AMD hosts
Summary: cpu mode=host-model causes guest kernel panics for AMD hosts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: libvirt
Version: 29
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-17 12:21 UTC by Lukas Ruzicka
Modified: 2018-12-04 03:01 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-04 03:01:36 UTC
Type: Bug


Attachments (Terms of Use)
AMD FX journalctl (22.18 KB, text/plain)
2018-10-17 12:21 UTC, Lukas Ruzicka
no flags Details
AMD FX lscpu (1.33 KB, text/plain)
2018-10-17 12:21 UTC, Lukas Ruzicka
no flags Details
AMD FX virsh dumpxml (4.58 KB, text/plain)
2018-10-17 12:22 UTC, Lukas Ruzicka
no flags Details
AMD A10 journalctl (22.43 KB, text/plain)
2018-10-17 12:22 UTC, Lukas Ruzicka
no flags Details
AMD A10 lscpu (1.30 KB, text/plain)
2018-10-17 12:24 UTC, Lukas Ruzicka
no flags Details
AMD A10 virsh dumpxml (4.60 KB, text/plain)
2018-10-17 12:25 UTC, Lukas Ruzicka
no flags Details
AMD FX domcapabilities (4.79 KB, text/plain)
2018-10-18 09:02 UTC, Kamil Páral
no flags Details
AMD A10 domcapabilities (4.88 KB, text/plain)
2018-10-18 09:02 UTC, Kamil Páral
no flags Details
AMD FX VM panic (9.68 KB, text/plain)
2018-10-18 09:02 UTC, Kamil Páral
no flags Details
AMD A10 VM panic (9.68 KB, text/plain)
2018-10-18 09:02 UTC, Kamil Páral
no flags Details

Description Lukas Ruzicka 2018-10-17 12:21:12 UTC
Created attachment 1494816 [details]
AMD FX journalctl

Description of problem:

In Fedora 29, it seems that the configuration of CPU is set to "Copy host CPU configuration" by default. This setting works fine on Intel based machines, but it fails to set the correct CPU configuration on some AMD chipsets. As a result, the guest OS will kernel panic and it will be impossible to install or run that in virt-manager. On both AMD machines, virt-manager things that the CPU needed the Opteron-G4 (one machine) or Opteron-G5 (another machine), however none of the machines was an Opteron one.

The workaround to this problem is:
a)  Change the CPU configuration to the correct setting, or to some generic value (kvm64).
b) Create the VM in gnome-boxes which do not have that problem.

Version-Release number of selected component (if applicable):

Fedora 29
virt-manager 2.0.0
libvirt 4.7.0

How reproducible:

Always on selected AMD machines

Steps to Reproduce:
1. Run virt-manager.
2. Create a new virtual machine using the wizard.
3. Try to install a Fedora ISO in that machine.

Actual results:

Because of an incorrect CPU configuration, the guest OS will kernel panic in that machine.

Expected results:

Running the OS in the VM works normally.

Additional info:

The attached logs show information from two of our testing machines that use an AMD CPU, named CM and zalman.

Comment 1 Lukas Ruzicka 2018-10-17 12:21:46 UTC
Created attachment 1494817 [details]
AMD FX lscpu

Comment 2 Lukas Ruzicka 2018-10-17 12:22:13 UTC
Created attachment 1494818 [details]
AMD FX virsh dumpxml

Comment 3 Lukas Ruzicka 2018-10-17 12:22:39 UTC
Created attachment 1494819 [details]
AMD A10 journalctl

Comment 4 Lukas Ruzicka 2018-10-17 12:24:19 UTC
Created attachment 1494820 [details]
AMD A10 lscpu

Comment 5 Lukas Ruzicka 2018-10-17 12:25:21 UTC
Created attachment 1494822 [details]
AMD A10 virsh dumpxml

Comment 6 Fedora Blocker Bugs Application 2018-10-17 12:27:20 UTC
Proposed as a Blocker for 29-final by Fedora user lruzicka using the blocker tracking app because:

 I propose this bug as Fedora final blocker because of the following criteria:

The release must be able host virtual guest instances of the same release.

Comment 7 Stephen Gallagher 2018-10-17 16:13:13 UTC
Can you define "some" chipsets? I see that it happens on Opteron-G4 and -G5... any others?

This is definitely a serious bug, but it has a workaround (manually select a different CPU in the configuration) and it can be fixed in an update. I'm +1 FE for now, but unless we discover has a wider effect than described so far, I'd probably not block on it at Go/No-Go.

Comment 8 Cole Robinson 2018-10-17 18:55:18 UTC
virt-manager 2.0.0 is in updates-testing but not F29 updates proper... does this reproduce with the virt-manager 1.6.0 git snapshot version? We did change the CPU default in 2.0.0

Can you provide 'sudo virsh domcapabilities' from the working and non-working host, and also the actual kernel panic from inside the VM

Comment 9 Kamil Páral 2018-10-18 07:46:17 UTC
(In reply to Stephen Gallagher from comment #7)
> Can you define "some" chipsets? I see that it happens on Opteron-G4 and
> -G5... any others?

No, it happens on these two (as indicated by the lscpu attachments):

AMD FX(tm)-4100 Quad-Core Processor
AMD A10-7870K Radeon R7, 12 Compute Cores 4C+8G

and for these two virt-manager forwards them as "Opteron" into the VM (which I have no idea whether it's the right thing to do or not, but it definitely doesn't boot).

AMD FX and AMD A10 are pretty common AMD processors (only recently succeeded by Ryzen) that are most probably used by many users.

Comment 10 Kamil Páral 2018-10-18 09:00:20 UTC
(In reply to Cole Robinson from comment #8)
> virt-manager 2.0.0 is in updates-testing but not F29 updates proper... does
> this reproduce with the virt-manager 1.6.0 git snapshot version? We did
> change the CPU default in 2.0.0

Good catch. I can confirm that VMs created by virt-manager 1.6 as present in F29 stables repo atm works fine on both machines. New VMs have the following CPU model defined in there:

AMD FX:
  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Opteron_G4</model>
  </cpu>

AMD A10:
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Opteron_G5</model>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='rdtscp'/>
    <feature policy='disable' name='svm'/>
  </cpu>

Notice that these also contain the Opteron models. But the configuration above works perfectly (even with virt-manager 2.0).

However, new VMs under virt-manager 2.0 get this CPU model defined on both machines:

  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
  </cpu>

And that doesn't work (under virt-manager 2.0 nor virt-manager 1.6), even though when the VMs are running, the CPU model dynamically changes in VM properties dialog (GUI) to Opteron_G4 and Opteron_G5.

> 
> Can you provide 'sudo virsh domcapabilities' from the working and
> non-working host, and also the actual kernel panic from inside the VM

I'm attaching domcapabilities from both machines and the panic (looks the same) from both machines. 

I don't know what you mean by the "working host". So far this doesn't work on 2 out of 2 AMD machines we have, and works on all Intel machines we have. Should I pick a random Intel machine and attach the same data as the "working host"?

Comment 11 Kamil Páral 2018-10-18 09:02:17 UTC
Created attachment 1495206 [details]
AMD FX domcapabilities

Comment 12 Kamil Páral 2018-10-18 09:02:26 UTC
Created attachment 1495207 [details]
AMD A10 domcapabilities

Comment 13 Kamil Páral 2018-10-18 09:02:44 UTC
Created attachment 1495208 [details]
AMD FX VM panic

Comment 14 Kamil Páral 2018-10-18 09:02:56 UTC
Created attachment 1495209 [details]
AMD A10 VM panic

Comment 15 Kamil Páral 2018-10-18 09:20:50 UTC
I searched a bit about the CPU families: AMD FX-4100 is from Bulldozer family from 2011, AMD A10-7870K is from Godavari (Kaveri-refresh) family from 2015.

I won't be able to attend today's blocker review meeting, so I'm voting here +1 blocker, under the assumption that this affects probably many (if not all) of AMD's reasonably recent desktop processors (it would be great if somebody could test Ryzen).

Comment 16 Stephen Gallagher 2018-10-18 09:49:42 UTC
(In reply to Kamil Páral from comment #10)
> (In reply to Cole Robinson from comment #8)
> > virt-manager 2.0.0 is in updates-testing but not F29 updates proper... does
> > this reproduce with the virt-manager 1.6.0 git snapshot version? We did
> > change the CPU default in 2.0.0
> 
> Good catch. I can confirm that VMs created by virt-manager 1.6 as present in
> F29 stables repo atm works fine on both machines. New VMs have the following
> CPU model defined in there:
> 
> AMD FX:
>   <cpu mode='custom' match='exact' check='partial'>
>     <model fallback='allow'>Opteron_G4</model>
>   </cpu>
> 
> AMD A10:
>   <cpu mode='custom' match='exact' check='full'>
>     <model fallback='forbid'>Opteron_G5</model>
>     <feature policy='require' name='vme'/>
>     <feature policy='require' name='x2apic'/>
>     <feature policy='require' name='hypervisor'/>
>     <feature policy='disable' name='rdtscp'/>
>     <feature policy='disable' name='svm'/>
>   </cpu>
> 
> Notice that these also contain the Opteron models. But the configuration
> above works perfectly (even with virt-manager 2.0).
> 
> However, new VMs under virt-manager 2.0 get this CPU model defined on both
> machines:
> 
>   <cpu mode='host-model' check='partial'>
>     <model fallback='allow'/>
>   </cpu>
> 
> And that doesn't work (under virt-manager 2.0 nor virt-manager 1.6), even
> though when the VMs are running, the CPU model dynamically changes in VM
> properties dialog (GUI) to Opteron_G4 and Opteron_G5.
> 

Given this information, this cannot be a blocker because the 2.0.0 isn’t part of the repos today. We should negkarma the update to 2.0.0 and ask that it not go stable until this is fixed, of course.

Comment 17 Kamil Páral 2018-10-18 11:59:00 UTC
> Given this information, this cannot be a blocker because the 2.0.0 isn’t part of the repos today. We should negkarma the update to 2.0.0 and ask that it not go stable until this is fixed, of course.

I forgot. Yes, that's correct. I'm removing the blocker nomination. Here's the Bodhi update (that breaks AMD):
https://bodhi.fedoraproject.org/updates/FEDORA-2018-30bb9dbd67

Comment 18 Cole Robinson 2018-10-18 15:25:12 UTC
(In reply to Kamil Páral from comment #10)
> > 
> > Can you provide 'sudo virsh domcapabilities' from the working and
> > non-working host, and also the actual kernel panic from inside the VM
> 
> I'm attaching domcapabilities from both machines and the panic (looks the
> same) from both machines. 
> 
> I don't know what you mean by the "working host". So far this doesn't work
> on 2 out of 2 AMD machines we have, and works on all Intel machines we have.
> Should I pick a random Intel machine and attach the same data as the
> "working host"?

I misunderstood and thought one of the AMD hosts was working with virt-manager 2.0.0. domcapabilities for both AMD hosts is what I was after

Comment 19 Cole Robinson 2018-10-18 15:29:52 UTC
FX machine domcapabilities is reporting:

    <mode name='host-model' supported='yes'>
      <model fallback='forbid'>Opteron_G4</model>
      <vendor>AMD</vendor>
      <feature policy='require' name='vme'/>
      <feature policy='require' name='x2apic'/>
      <feature policy='require' name='tsc-deadline'/>
      <feature policy='require' name='hypervisor'/>
      <feature policy='require' name='arat'/>
      <feature policy='require' name='tsc_adjust'/>
      <feature policy='require' name='mmxext'/>
      <feature policy='require' name='fxsr_opt'/>
      <feature policy='require' name='cmp_legacy'/>
      <feature policy='require' name='cr8legacy'/>
      <feature policy='require' name='osvw'/>
      <feature policy='require' name='topoext'/>
      <feature policy='require' name='perfctr_core'/>
      <feature policy='require' name='invtsc'/>
      <feature policy='require' name='ibpb'/>
      <feature policy='require' name='virt-ssbd'/>
    </mode>

A10 machine domcapabilities is reporting:

    <mode name='host-model' supported='yes'>
      <model fallback='forbid'>Opteron_G5</model>
      <vendor>AMD</vendor>
      <feature policy='require' name='vme'/>
      <feature policy='require' name='x2apic'/>
      <feature policy='require' name='tsc-deadline'/>
      <feature policy='require' name='hypervisor'/>
      <feature policy='require' name='arat'/>
      <feature policy='require' name='fsgsbase'/>
      <feature policy='require' name='tsc_adjust'/>
      <feature policy='require' name='bmi1'/>
      <feature policy='require' name='xsaveopt'/>
      <feature policy='require' name='mmxext'/>
      <feature policy='require' name='fxsr_opt'/>
      <feature policy='require' name='cmp_legacy'/>
      <feature policy='require' name='cr8legacy'/>
      <feature policy='require' name='osvw'/>
      <feature policy='require' name='topoext'/>
      <feature policy='require' name='perfctr_core'/>
      <feature policy='require' name='invtsc'/>
      <feature policy='require' name='virt-ssbd'/>
    </mode>


These CPUs are detected/reported to us by qemu+libvirt, so moving to libvirt for further triage. I can work around this in virt-manager if needed by switching to the new behavior but lets see if it's a simple fix.

Jirka, Eduardo, have you heard any issues with AMD and libvirt host-model?

Comment 20 Cole Robinson 2018-10-18 15:31:30 UTC
Also can one of the reproducers check mode=host-passthrough as well? Stop the VM, sudo virsh edit $vmname, replace 'host-model' with 'host-passthrough', save+exit+start VM

Comment 21 Eduardo Habkost 2018-10-19 00:45:47 UTC
(In reply to Cole Robinson from comment #19)
> Jirka, Eduardo, have you heard any issues with AMD and libvirt host-model?

I didn't hear of issues on host-model, but it's possible that host-model has the same bug that was present in QEMU's "host" CPU model: TOPOEXT might be available on the host (and reported on CPU model "max", that is used for probing for CPU feature availability and not for running VMs), but it doesn't mean that it's safe to enable unconditionally on any CPU model (hence not enabled in CPU model "host", that is used for actually running VMs).

See QEMU commit:

commit 7210a02c58572b2686a3a8d610c6628f87864aed
Author: Eduardo Habkost <ehabkost@redhat.com>
Date:   Thu Aug 9 19:18:52 2018 -0300

    i386: Disable TOPOEXT by default on "-cpu host"
    
    Enabling TOPOEXT is always allowed, but it can't be enabled
    blindly by "-cpu host" because it may make guests crash if the
    rest of the cache topology information isn't provided or isn't
    consistent.
    
    This addresses the bug reported at:
    https://bugzilla.redhat.com/show_bug.cgi?id=1613277
    
    Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
    Message-Id: <20180809221852.15285-1-ehabkost@redhat.com>
    Tested-by: Richard W.M. Jones <rjones@redhat.com>
    Reviewed-by: Babu Moger <babu.moger@amd.com>
    Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>

Comment 22 Cole Robinson 2018-10-19 16:07:18 UTC
Thanks, I'll put that in a scratch build once builds are completing... hitting unbound dep issues right now

Comment 23 Lukas Ruzicka 2018-10-22 12:21:30 UTC
Hello,
so, on both machines, I tried the workaround as suggested by Comment 20. The virtual machine booted without any problems and there was no kernel panic seen. 

So the "host-passthrough" option did the trick.

Comment 24 Cole Robinson 2018-10-23 08:00:51 UTC
Lukas or another reproducer, please try this scratch build and check to see if host-model works or not

https://koji.fedoraproject.org/koji/taskinfo?taskID=30398028

Comment 25 Kamil Páral 2018-10-23 14:23:23 UTC
Jlanda tested for us that this bug doesn't affect AMD Ryzen machines (even using the original builds, not considering the new scratch build).

Comment 26 František Zatloukal 2018-11-05 10:41:25 UTC
(In reply to Cole Robinson from comment #24)
> Lukas or another reproducer, please try this scratch build and check to see
> if host-model works or not
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=30398028

It fixes the issue (at least on AMD FX(tm)-4100 Quad-Core Processor ). Thanks!

Comment 27 Kamil Páral 2018-11-07 12:37:46 UTC
(In reply to František Zatloukal from comment #26)
> It fixes the issue (at least on AMD FX(tm)-4100 Quad-Core Processor ).
> Thanks!

Can you please the other machine (A10) as well?

Comment 28 Lukas Ruzicka 2018-11-08 15:30:45 UTC
The packages for this scratch build have been deleted, so now we do not have a way to test on the other machine. Perhaps, when the fix appears as an update in Bodhi?

Comment 29 Cole Robinson 2018-11-15 17:54:42 UTC
Yeah I'll do an official build with the patch soon and then we can follow up

Comment 30 Fedora Update System 2018-11-17 13:44:22 UTC
qemu-3.0.0-2.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-87f2ace20d

Comment 31 Fedora Update System 2018-11-18 05:21:07 UTC
qemu-3.0.0-2.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-87f2ace20d

Comment 32 Lukas Ruzicka 2018-11-19 15:29:50 UTC
Hello, I am able to run VM installations on the reported AMD machines, so I believe that this bug can be considered verified.

Comment 33 Fedora Update System 2018-12-04 03:01:36 UTC
qemu-3.0.0-2.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.