Bug 1683471 - getDomainCapabilities claims SEV is supported for pc-i440fx-1.4 machine type
Summary: getDomainCapabilities claims SEV is supported for pc-i440fx-1.4 machine type
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-26 23:12 UTC by Adam Spiers
Modified: 2019-04-26 15:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-07 11:44:38 UTC
Embargoed:


Attachments (Terms of Use)

Description Adam Spiers 2019-02-26 23:12:31 UTC
Description of problem:

virsh domcapabilities claims that SEV is supported for legacy i440FX machine types, but in reality q35 is required.  It has been suggested that this might be a symptom of a wider problem with the logic in getDomainCapabilities().

Version-Release number of selected component (if applicable):

# rpm -q libvirt
libvirt-4.0.0-15.3.x86_64

How reproducible:

Run virsh domcapabilities --virttype kvm --emulatorbin /usr/bin/qemu-kvm --arch x86_64 --machine pc-i440fx-1.4 | xq /domainCapabilities/features/sev/@supported

Actual results:

<results>
  <result>yes</result>
</results>

Expected results:

<results>
  <result>no</result>
</results>

Additional info:

Tested on SLES 12 SP4.

Comment 1 Adam Spiers 2019-02-26 23:35:43 UTC
IIRC, the experts told me that SEV does not have a *direct* dependency on q35, only an indirect one since SEV requires iommu_platform which was introduced in virtio 1.0, and q35 ensures virtio 1.0 not 0.9.  So maybe this explains the observed behaviour.

Comment 2 Adam Spiers 2019-02-27 10:07:11 UTC
Since this was tested on SLES 12 SP4 which is a stable release, I've submitted a corresponding downstream bug: https://bugzilla.suse.com/show_bug.cgi?id=1127139

Comment 3 Daniel Berrangé 2019-02-27 10:19:25 UTC
The i440fx machine type is capable of supporting virtio-1.0, however, in PCI slots the devices will operate in transitional mode (both 1.0 and 0.9 enabled in parallel and guest driver decides which to use - modern guests will use 1.0).  Only in PCI-X slots do devices get forced to virtio 1.0.

Comment 4 Erik Skultety 2019-02-27 14:12:36 UTC
In any case, the thing with capabilities is that we probe this via QMP and QEMU doesn't return feature availability according to the machine type, or better said it doesn't even consider it, we call a bunch of QMP commands, like getting the QAPI schema, which libvirt then traverses and enables/disables specific features.

Comment 5 Daniel Berrangé 2019-02-27 14:19:13 UTC
Yes, QMP feature probing wrt machine type specific features is largely unfixable from QEMU's pov. Probing takes place with a "none" machine type, QEMU can't introspect specific machine types without instantiating them, and we don't want to run QEMU once for every machine as that would explode the probing time.

Comment 6 Adam Spiers 2019-02-27 14:58:53 UTC
(In reply to Erik Skultety from comment #4)
> In any case, the thing with capabilities is that we probe this via QMP and
> QEMU doesn't return feature availability according to the machine type, or
> better said it doesn't even consider it, we call a bunch of QMP commands,
> like getting the QAPI schema, which libvirt then traverses and
> enables/disables specific features.

Is there any way to roughly summarise the expected variations in the features returned from getDomainCapabilities which would depend on machine type?

Let me provide some more context to clarify the motivation for this question ...

I suspect that the getDomainCapabilities API was designed with the expectation that it would be called just before defining a domain from XML, at which point the machine type would be known.  Unfortunately this paradigm does not work with OpenStack nova, where the hypervisor's capabilities need to be detected well in advance, so that for example the OpenStack placement service knows which compute nodes out of a cloud of (say) 500 machines is capable of SEV and can ensure that security-sensitive workloads are only launched within that subset of machines.

This means that nova needs to call virConnectGetDomainCapabilities() during the nova-compute service's initialization phase.  At first sight it seems that to do this correctly, the API should be called once for each of the possible (arch, machinetype, virttype) tuples, and the results stored.  (I excluded the emulatorbin parameter since it can be deduced from the others, and also excluded the flags parameter since it's currently unused.)  But that's a lot of API calls, and maybe it's overkill?  Since there are many machine types, it would be helpful if it was possible to make some generalizations to reduce the number of API calls, but maybe that depends on which features we need to detect?

Any advice on how to approach this pragmatically would be much appreciated - thanks a lot!

Comment 7 Daniel Berrangé 2019-02-27 15:20:07 UTC
(In reply to Adam Spiers from comment #6)
> (In reply to Erik Skultety from comment #4)
> > In any case, the thing with capabilities is that we probe this via QMP and
> > QEMU doesn't return feature availability according to the machine type, or
> > better said it doesn't even consider it, we call a bunch of QMP commands,
> > like getting the QAPI schema, which libvirt then traverses and
> > enables/disables specific features.
> 
> Is there any way to roughly summarise the expected variations in the
> features returned from getDomainCapabilities which would depend on machine
> type?

Very little data reported here varies per machine. Looking at /current/ code only the max vcpu count, and type of disks varies (ie we don't report floppy or ide support with some machines). Apps shouldn't rely on the current impl though - we're continually making this data more fine grained as problems are reported. We are limited by what QEMU can report though, so some of the stuff is basically hand written in libvirt.

> I suspect that the getDomainCapabilities API was designed with the
> expectation that it would be called just before defining a domain from XML,
> at which point the machine type would be known.  Unfortunately this paradigm
> does not work with OpenStack nova, where the hypervisor's capabilities need
> to be detected well in advance, so that for example the OpenStack placement
> service knows which compute nodes out of a cloud of (say) 500 machines is
> capable of SEV and can ensure that security-sensitive workloads are only
> launched within that subset of machines.
> 
> This means that nova needs to call virConnectGetDomainCapabilities() during
> the nova-compute service's initialization phase.  At first sight it seems
> that to do this correctly, the API should be called once for each of the
> possible (arch, machinetype, virttype) tuples, and the results stored.  (I
> excluded the emulatorbin parameter since it can be deduced from the others,
> and also excluded the flags parameter since it's currently unused.)  But
> that's a lot of API calls, and maybe it's overkill?  Since there are many
> machine types, it would be helpful if it was possible to make some
> generalizations to reduce the number of API calls, but maybe that depends on
> which features we need to detect?

Since SEV support isn't reported per-machine type, calling domainCapabilities for all machines is overkill. 

The caveat though is that we don't guarantee this. Future domain capabilities might report SEV more fine grained.

In theory you can just look for /dev/sev existing, but there's not a sttrong guarantee that the QEMU binary will actually then support it..

> Any advice on how to approach this pragmatically would be much appreciated -
> thanks a lot!

Comment 8 Jim Fehlig 2019-02-27 21:56:03 UTC
(In reply to Adam Spiers from comment #6)
> This means that nova needs to call virConnectGetDomainCapabilities() during
> the nova-compute service's initialization phase.  At first sight it seems
> that to do this correctly, the API should be called once for each of the
> possible (arch, machinetype, virttype) tuples, and the results stored.  (I
> excluded the emulatorbin parameter since it can be deduced from the others,
> and also excluded the flags parameter since it's currently unused.)  But
> that's a lot of API calls, and maybe it's overkill?  Since there are many
> machine types, it would be helpful if it was possible to make some
> generalizations to reduce the number of API calls, but maybe that depends on
> which features we need to detect?

Daniel already mentioned that a call for each machine type is overkill. I think a call for each arch is sufficient. And in practice I suspect the only interesting arch is the host one.

Comment 9 Erik Skultety 2019-03-07 11:44:38 UTC
(In reply to Daniel Berrange from comment #3)
> The i440fx machine type is capable of supporting virtio-1.0, however, in PCI
> slots the devices will operate in transitional mode (both 1.0 and 0.9
> enabled in parallel and guest driver decides which to use - modern guests
> will use 1.0).  Only in PCI-X slots do devices get forced to virtio 1.0.

I verified that setting model virtio-non-transitional on the virtio devices (libvirt 5.2.0 needed) indeed allowed me to start an i440fx SEV machine. But my first attempt surely wasn't smooth compared to a Q35 machine (I might have forgot to tweak something, also the machine froze during shutdown, no idea why). Having booted a SEV VM successfully, I'm closing this as NOTABUG.

Comment 10 Adam Spiers 2019-03-07 15:38:46 UTC
Thanks for all the info and advice folks - very helpful!

Comment 11 Michal Privoznik 2019-04-26 09:37:26 UTC
Please note, that caching domain capabilities at startup is not desirable. Domain capabilities (including SEV) depend on qemu binary installed. And while all five arguments of getDomainCapabilities() stay the same, the returned set of capabilities may vary depending on which binary is installed. For instance, upgrading qemu binary to a newer one (the older did not support SEV, the newer one does) might result in 'SEV' being reported in domain capabilities XML despite none of the arguments changed. Or vice versa - downgrading qemu might result on losing some features. That is the reason why you never ever cache domain capabilities. Especially if you are not in charge of packages update process (which IIUC Nova is not).

I've seen Adam's patch here:

https://review.opendev.org/#/c/655268/2

but I don't have an account to comment there.

Comment 12 Adam Spiers 2019-04-26 15:04:40 UTC
(In reply to Michal Privoznik from comment #11)
> Please note, that caching domain capabilities at startup is not desirable.
> Domain capabilities (including SEV) depend on qemu binary installed. And
> while all five arguments of getDomainCapabilities() stay the same, the
> returned set of capabilities may vary depending on which binary is
> installed. For instance, upgrading qemu binary to a newer one (the older did
> not support SEV, the newer one does) might result in 'SEV' being reported in
> domain capabilities XML despite none of the arguments changed. Or vice versa
> - downgrading qemu might result on losing some features. That is the reason
> why you never ever cache domain capabilities. Especially if you are not in
> charge of packages update process (which IIUC Nova is not).

This should not be an issue in nova-compute because if the underlying hypervisor stack is changed, even though nova is not in charge of package updates itself, it is well understood[1] that whichever deployment solution *is* in charge of package updates needs to restart nova-compute anyway, which will wipe out the memoised capabilities returned from the API.  I've just uploaded a new patch set which explicitly documents this:

    https://review.opendev.org/#/c/655268/3/nova/virt/libvirt/host.py@722

Of course persisting the capabilities to a disk cache would have been problematic, but I'm not proposing that.

> I've seen Adam's patch here:
> 
> https://review.opendev.org/#/c/655268/2
> 
> but I don't have an account to comment there.

Ah, probably a good idea to register then:

    https://docs.openstack.org/contributors/common/setup-gerrit.html

Certainly your comments would always be welcome :)

[1] At least it seems to be understood by nova developers.  I haven't checked the operator docs or polled the deployment tooling communities to see the level of understanding there, but if that's lower then it needs to be fixed via better docs / communication, and would not justify an attempt to support an unsupportable scenario (i.e. upgrading the hypervisor stack but not restarting nova-compute).


Note You need to log in before you can comment on or make changes to this bug.