Bug 1375571

Summary: KVM_SET_DEVICE_ATTR / KVM_ARM_SET_DEVICE_ADDR fails on ThunderX for the GIC, triggered by virtio-net-pci
Product: Red Hat Enterprise Linux 7 Reporter: Andrea Bolognani <abologna>
Component: qemu-kvm-rhevAssignee: Eric Auger <eauger>
Status: CLOSED NOTABUG QA Contact: Chao Yang <chayang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: abologna, chayang, drjones, eric.auger, jcm, juzhang, knoel, lersek, virt-maint, wei
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-20 12:03:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1173755    

Description Andrea Bolognani 2016-09-13 13:01:21 UTC
I have a guest configured to use virtio-net-pci as follows:

  <interface type='network'>
    <mac address='52:54:00:57:76:ad'/>
    <source network='default'/>
    <model type='virtio'/>
    <rom bar='off'/>
    <address type='pci' domain='0x0000' bus='0x02'
                        slot='0x00' function='0x0'/>
  </interface>

The guest works fine on Moonshot, but fails to start on
ThunderX with the following error:

  error: Failed to start domain abologna-rhel73
  error: internal error: qemu unexpectedly closed the monitor:
  Could not open option rom 'pxe-virtio.rom': No such file or
  directory
  Failed to set device address: No such device

Minimal reproducers that doesn't involve libvirt:

  /usr/libexec/qemu-kvm \
    -M virt,accel=kvm -cpu host \
    -nographic -nodefaults -no-user-config \
    -device virtio-net-pci

  /usr/libexec/qemu-kvm \
    -M virt,accel=kvm -cpu host \
    -nographic -nodefaults -no-user-config \
    -device virtio-net-pci,rombar=0

The first command will fail (as expected, see Bug 1337510)
on both Moonshot and ThunderX with the following error:

  qemu-kvm: -device virtio-net-pci: failed to find romfile
  "pxe-virtio.rom"

The second command will work, with warnings, on Moonshot:

  Could not open option rom 'pxe-virtio.rom': No such file or
  directory
  Warning: nic virtio-net-pci.0 has no peer

but will fail on ThunderX:

  Could not open option rom 'pxe-virtio.rom': No such file or
  directory
  Warning: nic virtio-net-pci.0 has no peer
  Failed to set device address: No such device
  Aborted

As a consequence of this bug, virtio-net-pci is completely
unusable on ThunderX.

kernel-4.5.0-9.el7.aarch64
AAVMF-20160608-3.git988715a.el7.noarch
qemu-kvm-rhev-2.6.0-23.el7.aarch64
libvirt-2.0.0-8.el7.aarch64

Comment 2 Laszlo Ersek 2016-09-13 16:50:02 UTC
The "Warning: nic virtio-net-pci.0 has no peer" message reproduces with TCG as well, and it is related to the fact that no "-netdev TYPE,id=IDENT" (that is, backend) was provided on the command line, for "-device virtio-net-pci,netdev=IDENT" (that is, for the frontend) to reference. So I think we can take that part out of the picture.

The "Failed to set device address" message comes from kvm_arm_set_device_addr(), file "target-arm/kvm.c". For the arm/aarch64 targets, this C source file provides a function called kvm_arm_register_device(). Board code calls this function, which sets up a memory listener. At machine-init-done time, the registered devices are passed to KVM_SET_DEVICE_ATTR / KVM_ARM_SET_DEVICE_ADDR ioctl()s, and that's what fails now.

The only thing the virt board calls kvm_arm_register_device() for is GICv2 and GICv3, indirectly, through the kvm_arm_gic_realize() / kvm_arm_gicv3_realize() functions, respectively:

- create_gic() in hw/arm/virt.c calls gicv3_class_name() vs. gic_class_name(), dependent on the requested GIC type, to find the name of the appropriate GIC class,
- create_gic() then creates an object of this class, with qdev_create(), which is when the kvm_arm_gic_realize() / kvm_arm_gicv3_realize() functions end up being invoked, registering the device for said ioctl()s at machine init time,
- the appropraite ioctl() is used at machine init time, and fails.

In other words, I don't think the issue is related to virtio-net-pci; it seems to be related to the GIC on ThunderX.

I don't know however why only virtio-net-pci triggers the problem. (Well, is virtio-net-pci the only thing that triggers it?)

--*--

As a side point, it is a separate bug (or "wart") that we get the "Could not open option rom" warning after explicitly setting rombar=0. :/ ... Ahh okay, I know where *that* comes from: see pci_add_option_rom() in "hw/pci/pci.c":

    if (!pdev->rom_bar) {
        /*
         * Load rom via fw_cfg instead of creating a rom bar,
         * for 0.11 compatibility.
         */
[...]
        if (class == 0x0300) {
            rom_add_vga(pdev->romfile);
        } else {
            rom_add_option(pdev->romfile, -1);
        }
        return;
    }

So, there's nothing to do about that warning. We can ignore it. It's down to the GIC (likely GICv3) on the ThunderX, and why virtio-net-pci triggers a problem with it.

Comment 3 Laszlo Ersek 2016-09-13 16:51:40 UTC
CC Drew and Wei.

Comment 4 Laszlo Ersek 2016-09-13 17:06:07 UTC
Alright, I suspect what's going on. In the kvm_arm_set_device_addr() function, which is being run at machine-init-done time, and passes the devices to *one* of the KVM_SET_DEVICE_ATTR / KVM_ARM_SET_DEVICE_ADDR ioctl()s, we have the following logic:

    /* If the device control API is available and we have a device fd on the
     * KVMDevice struct, let's use the newer API
     */
    if (kd->dev_fd >= 0) {
        uint64_t addr = kd->kda.addr;
        attr->addr = (uintptr_t)&addr;
        ret = kvm_device_ioctl(kd->dev_fd, KVM_SET_DEVICE_ATTR, attr);
    } else {
        ret = kvm_vm_ioctl(kvm_state, KVM_ARM_SET_DEVICE_ADDR, &kd->kda);
    }

The KVM_SET_DEVICE_ATTR ioctl -- the more modern one -- propagates through the kernel like this:

kvm_device_ioctl()                 [virt/kvm/kvm_main.c]
  kvm_arch_vcpu_ioctl              [arch/arm/kvm/arm.c]
    kvm_arm_vcpu_set_attr()        [arch/arm/kvm/arm.c]
      kvm_arm_vcpu_arch_set_attr() [arch/arm64/kvm/guest.c]

The implementation of this function looks like:

int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
                               struct kvm_device_attr *attr)
{
        int ret;

        switch (attr->group) {
        case KVM_ARM_VCPU_PMU_V3_CTRL:
                ret = kvm_arm_pmu_v3_set_attr(vcpu, attr);
                break;
        default:
                ret = -ENXIO;
                break;
        }

        return ret;
}

Now let's look at the deprecated ioctl, KVM_ARM_SET_DEVICE_ADDR:

kvm_arch_vm_ioctl()              [arch/arm/kvm/arm.c]
  kvm_vm_ioctl_set_device_addr() [arch/arm/kvm/arm.c]

And from the implementation:

        switch (dev_id) {
        case KVM_ARM_DEVICE_VGIC_V2:
                if (!vgic_present)
                        return -ENXIO;
                return kvm_vgic_addr(kvm, type, &dev_addr->addr, true);
        default:
                return -ENODEV;
        }

The "No such device" message corresponds to ENODEV (ENXIO would be "No such device or address"), therefore the failure comes from the deprecated ioctl.

Summary: on the ThunderX, we must have a GICv3, but for some reason, we pass it to the host with KVM_ARM_SET_DEVICE_ADDR, but that ioctl() only supports GICv2.
(Which is BTW documented in "Documentation/virtual/kvm/api.txt".)

Comment 5 Jon Masters 2016-09-13 21:30:42 UTC
Sounds about right. And it's correct that there's a GICv3 on ThunderX (and all other future systems) but only a v2 on Seattle and X-Gene (HPE ProLiant m400).

Comment 6 Laszlo Ersek 2016-09-14 07:36:47 UTC
More digging: in kvm_arm_set_device_addr(), we choose between KVM_SET_DEVICE_ATTR / KVM_ARM_SET_DEVICE_ADDR based on (dev_fd >= 0) -- if that evals to true, we pick KVM_SET_DEVICE_ATTR.

So, since KVM_ARM_SET_DEVICE_ADDR is taken (see comment 4), we know (dev_fd < 0). Where is dev_fd set?

It is set in:

(a) kvm_arm_gicv3_realize(), but that function doesn't actually succeed with (dev_fd < 0):

    /* Try to create the device via the device control API */
    s->dev_fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_ARM_VGIC_V3, false);
    if (s->dev_fd < 0) {
        error_setg_errno(errp, -s->dev_fd, "error creating in-kernel VGIC");
        return;
    }

(b) kvm_arm_gic_realize():

    /* Try to create the device via the device control API */
    s->dev_fd = -1;
    ret = kvm_create_device(kvm_state, KVM_DEV_TYPE_ARM_VGIC_V2, false);
    if (ret >= 0) {
        s->dev_fd = ret;
[...]
    } else if (ret != -ENODEV && ret != -ENOTSUP) {
        error_setg_errno(errp, -ret, "error creating in-kernel VGIC");
        return;
    }

IOW, kvm_arm_gic_realize(), and only kvm_arm_gic_realize(), makes it possible to reach kvm_arm_set_device_addr() with (dev_fd < 0).

So, this clearly means that we're trying to create a guest with GICv2 when the host has GICv3. I googled this a bit more, and found:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1566564

> ARM systems can only boot KVM guests that have the same GIC (Generic
> Interrupt Controller) version as the host. GICv2 is the QEMU default but,
> if your system does not have a GICv2, you need to know your GIC version
> and pass that down to the QEMU command line (e.g. -M virt,gic_version=3).
> If you (or your tools) don't know to do that, your guests will just fail
> to boot on non-GICv2 hosts with the obscure error "Failed to set device
> address: No such device".
> [...]
> Recent changes to QEMU and libvirt have improved this situation. QEMU now
> exposes a "query-gic-capabilities" QMP interface that can let the caller
> ask what GIC types are available for guests to use on that host. libvirt
> can now make use of this QEMU interface, and expose that information to
> users via the domcapabilities interface. Further, the user can specify a
> gic version of "host" to have libvirt choose a detected GIC version, or
> the user can omit the <gic> feature altogether, and libvirt will choose a
> GIC version supported by the host and update the guest XML appropriately.
> This allows tools like virt-install and nova to generate GIC-agnostic XML
> that can boot on any arm64 host.

This seems to be an accurate / educative account of the status quo. The minimal reproducer

  /usr/libexec/qemu-kvm \
    -M virt,accel=kvm -cpu host \
    -nographic -nodefaults -no-user-config \
    -device virtio-net-pci,rombar=0

from comment 0 is simply invalid on the ThunderX, it should say

    -M virt,accel=kvm,gic_version=3

However, I think our libvirt *already* handles this (when virtio-net-pci is not used, anyway) -- after all "normally" libvirt guests do start on ThunderX, right?

Which makes me believe that using virtio-net-pci somehow prevents libvirt from generating the ",gic_version=3" part.

Andrea, when you encounter this problem, can you please capture the QEMU command line from under /var/log/libvirt/qemu, and also the QMP traffic?

(For the latter I put

log_filters="1:qemu_monitor_ 1:qemu_agent"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

in "/etc/libvirt/libvirtd.conf", although I obviously don't need to tell you that :))

Thanks.

Comment 8 Laszlo Ersek 2016-09-14 10:36:48 UTC
So, I reserved a ThunderX, and I cannot reproduce the issue. Andrea, can you please provide all version numbers and the full domain XML?

Here's what I tried:
- first I tried to plug virtio-net-pci directly into the PCI Express root bus (pcie.0), as an integrated device. Libvirt didn't accept that.

- Then I tried to do the topology seen on Q35: dmi-to-pci bridge into pcie.0, then a pci-bridge into dmi-to-pci. This works on upstream QEMU, but not on downstream qemu, where the dmi-to-pci bridge is not built into the binary.

- Finally I tried to plug a PCIe root port into pcie.0, and then place virtio-net-pci on the root port. This fails because the root port (== ioh3420) device cannot be initialized. The reason is this (checked it with gdb):

ioh3420_initfn() [hw/pci-bridge/ioh3420.c]
  msi_init()     [hw/pci/msi.c]

    if (!msi_nonbroken) {
        error_setg(errp, "MSI is not supported by interrupt controller");
        return -ENOTSUP;
    }

And "msi_nonbroken" is false here, because only "hw/intc/arm_gicv2m.c" sets it to true -- MSI is apparently known to be broken on GICv3.

Summary: I cannot reproduce the problem, simply because I'm unable to create a PCI Express / PCI topology with downstream libvirt and QEMU where I could have *any* virtio-net-pci device.

NB: the above error message ("MSI is not supported...") is upstream-only at the moment, but the logic is the same in downstream. See upstream commit 1108b2f8a939f "pci: Convert msi_init() to Error and fix callers to check it"; part of v2.7.0.

Comment 9 Andrea Bolognani 2016-09-14 10:54:47 UTC
(In reply to Laszlo Ersek from comment #6)
> More digging: in kvm_arm_set_device_addr(), we choose between
> KVM_SET_DEVICE_ATTR / KVM_ARM_SET_DEVICE_ADDR based on (dev_fd >= 0) -- if
> that evals to true, we pick KVM_SET_DEVICE_ATTR.
> 
> So, since KVM_ARM_SET_DEVICE_ADDR is taken (see comment 4), we know (dev_fd
> < 0). Where is dev_fd set?
> 
> It is set in:
> 
> (a) kvm_arm_gicv3_realize(), but that function doesn't actually succeed with
> (dev_fd < 0):
> 
>     /* Try to create the device via the device control API */
>     s->dev_fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_ARM_VGIC_V3,
> false);
>     if (s->dev_fd < 0) {
>         error_setg_errno(errp, -s->dev_fd, "error creating in-kernel VGIC");
>         return;
>     }
> 
> (b) kvm_arm_gic_realize():
> 
>     /* Try to create the device via the device control API */
>     s->dev_fd = -1;
>     ret = kvm_create_device(kvm_state, KVM_DEV_TYPE_ARM_VGIC_V2, false);
>     if (ret >= 0) {
>         s->dev_fd = ret;
> [...]
>     } else if (ret != -ENODEV && ret != -ENOTSUP) {
>         error_setg_errno(errp, -ret, "error creating in-kernel VGIC");
>         return;
>     }
> 
> IOW, kvm_arm_gic_realize(), and only kvm_arm_gic_realize(), makes it
> possible to reach kvm_arm_set_device_addr() with (dev_fd < 0).
> 
> So, this clearly means that we're trying to create a guest with GICv2 when
> the host has GICv3. I googled this a bit more, and found:
> 
> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1566564
> 
> > ARM systems can only boot KVM guests that have the same GIC (Generic
> > Interrupt Controller) version as the host. GICv2 is the QEMU default but,
> > if your system does not have a GICv2, you need to know your GIC version
> > and pass that down to the QEMU command line (e.g. -M virt,gic_version=3).
> > If you (or your tools) don't know to do that, your guests will just fail
> > to boot on non-GICv2 hosts with the obscure error "Failed to set device
> > address: No such device".
> > [...]
> > Recent changes to QEMU and libvirt have improved this situation. QEMU now
> > exposes a "query-gic-capabilities" QMP interface that can let the caller
> > ask what GIC types are available for guests to use on that host. libvirt
> > can now make use of this QEMU interface, and expose that information to
> > users via the domcapabilities interface. Further, the user can specify a
> > gic version of "host" to have libvirt choose a detected GIC version, or
> > the user can omit the <gic> feature altogether, and libvirt will choose a
> > GIC version supported by the host and update the guest XML appropriately.
> > This allows tools like virt-install and nova to generate GIC-agnostic XML
> > that can boot on any arm64 host.
> 
> This seems to be an accurate / educative account of the status quo.

It is, except for

> > [...] the user can specify a
> > gic version of "host" to have libvirt choose a detected GIC version

because

  <gic version='host'/>

is translated directly into the QEMU counterpart

  gic-version=host

with no further detection (detective?) work performed by
libvirt. It's a very minor detail anyway :)

> The
> minimal reproducer
> 
>   /usr/libexec/qemu-kvm \
>     -M virt,accel=kvm -cpu host \
>     -nographic -nodefaults -no-user-config \
>     -device virtio-net-pci,rombar=0
> 
> from comment 0 is simply invalid on the ThunderX, it should say
> 
>     -M virt,accel=kvm,gic_version=3

Yup, you're absolutely right.

> However, I think our libvirt *already* handles this (when virtio-net-pci is
> not used, anyway) -- after all "normally" libvirt guests do start on
> ThunderX, right?

They do, and yes, libvirt will automatically add

  <gic version='3'/>

to the guest configuration on ThunderX based on the output
of QEMU's query-gic-capabilities QMP command.

> Which makes me believe that using virtio-net-pci somehow prevents libvirt
> from generating the ",gic_version=3" part.

That's actually not the case: turns out that I simply had

  <gic version='2'/>

in my guest configuration for some reason. How embarassing :S

So I guess GICv2 emulation is working now? I'm pretty sure
a few months back this was not the case, and guest configured
to use GICv2 would fail to start at all.

... And that's still the case, actually. I changed one of my
virtio-mmio guests to GICv2 and it fails to start with the
expected

  Failed to set device address: No such device

error. It's just that the virtio-net-pci error was being
triggered earlier, and ended up shadowing this one.

Now that I've changed my virtio-pci guest to use GICv3
instead of GICv2 I've run into another issue with ioh3420,
but as usual you've been faster than me and you've already
described it in Comment 8 :)

So if MSI is known to be broken on GICv3, and ioh3420
requires it, we can conclude that virtio-pci guests are
simply not possible on ThunderX, a known limitation we'll
have to overcome for 7.4.

> Andrea, when you encounter this problem, can you please capture the QEMU
> command line from under /var/log/libvirt/qemu, and also the QMP traffic?
> 
> (For the latter I put
> 
> log_filters="1:qemu_monitor_ 1:qemu_agent"
> log_outputs="1:file:/var/log/libvirt/libvirtd.log"
> 
> in "/etc/libvirt/libvirtd.conf", although I obviously don't need to tell you
> that :))
> 
> Thanks.

Comment 11 Laszlo Ersek 2016-09-14 12:14:41 UTC
(In reply to Andrea Bolognani from comment #9)

> So if MSI is known to be broken on GICv3, and ioh3420
> requires it, we can conclude that virtio-pci guests are
> simply not possible on ThunderX, a known limitation we'll
> have to overcome for 7.4.

I agree.

Shall we repurpose this BZ for "fix MSI on GICv3"? That seems to be a serious enough gap in functionality that I suspect it must already be tracked somewhere.

If we don't repurpose this bug, then we should close it as NOTABUG.

Thanks!

Comment 12 Laszlo Ersek 2016-09-14 12:17:32 UTC
(In reply to Laszlo Ersek from comment #11)
> (In reply to Andrea Bolognani from comment #9)
> 
> > So if MSI is known to be broken on GICv3, and ioh3420
> > requires it, we can conclude that virtio-pci guests are
> > simply not possible on ThunderX, a known limitation we'll
> > have to overcome for 7.4.
> 
> I agree.
> 
> Shall we repurpose this BZ for "fix MSI on GICv3"? That seems to be a
> serious enough gap in functionality that I suspect it must already be
> tracked somewhere.

Eric, do you know anything about this? What's the status of MSI-on-GICv3? Thank you.

Comment 13 Andrew Jones 2016-09-20 12:03:11 UTC
Thanks for the accurate, educative account of the status quo :-) As we're already tracking gicv3 ITS support with other bugs (assigned to Eric for 7.4), then we can close this one as not-a-bug.

Comment 14 Eric Auger 2016-09-28 09:36:11 UTC
Status of vITS support is:
[PATCH v7 0/8] vITS support
https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg06134.html

It should be close to upstream gate.

for MSI support on ACPI guest, we also need the IORT generation linking
root complex and ITS node. I prototyped this and it is now functional. I am currently waiting for Prem Malappa since he was the first to write with a minimalist IORT support in the context of his vSMMU series. Depending on him I may send an independent series with RC and ITS node only (no smmu).

Comment 15 Eric Auger 2022-10-21 09:12:25 UTC
This work is complete for a while and suprisingly the needInfo has just started to nag me few days ago. Clear it.