RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1450524 - qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f
Summary: qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: All
OS: Linux
unspecified
high
Target Milestone: pre-dev-freeze
: 7.4
Assignee: jason wang
QA Contact: xiywang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-12 21:59 UTC by Jack Waterworth
Modified: 2020-12-14 08:39 UTC (History)
32 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-19 13:34:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
systemtap debug script (653 bytes, text/plain)
2017-06-02 12:00 UTC, jason wang
no flags Details
new systemtap debug script (487 bytes, text/plain)
2017-06-06 06:02 UTC, jason wang
no flags Details

Description Jack Waterworth 2017-05-12 21:59:57 UTC
Description of problem:

When we migrate kvm instances from one compute to another compute(scheduler not in play), We get a small number of kvm instances that go into an error/shutoff state on the destination node. The virsh process continues to run on the source node, but no libvirt.xml file is on the source. Nova indicates that the VM is on the destination node, but no libvirt.xml file is present. 

nova logs show the following error:

2017-05-11 09:06:06.924 344980 ERROR nova.virt.libvirt.driver [req-df99cebc-5072-4667-a5b4-70b4434a68e2 Z013PX2 7072da2726e04c3687b53954234e412f - - -] [instance: c668bbbb-3e60-47ac-8c6e-4449107022ec] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2017-05-11T14:06:06.706004Z qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f

Version-Release number of selected component (if applicable):
osp7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
openstack-nova-compute-2015.1.4-32.el7ost.noarch

How reproducible:
sometimes

Comment 2 Kashyap Chamarthy 2017-05-15 11:47:58 UTC
Can you please also provide the complete QEMU log, so we could examine which specific Virtio device is being affected?

Comment 3 Kashyap Chamarthy 2017-05-15 11:59:08 UTC
(In reply to Kashyap Chamarthy from comment #2)
> Can you please also provide the complete QEMU log, so we could examine which
> specific Virtio device is being affected?

Err, the logs are in comment #1.  Please disregard me.

Comment 5 Daniel Berrangé 2017-05-17 08:39:47 UTC
Given the QEMU version 2.6.0, and the machine type pc-i440fx-rhel7.2.0, I presume this migration is run from a RHEL-7.2 (src) -> RHEL-7.3 (dst) host

Comment 6 Daniel Berrangé 2017-05-17 08:58:38 UTC
The QEMU version on the source host is qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64, this is quite a few releases behind the latest errata available.

There is a bug fix in qemu-kvm-rhev-2.3.0-31.el7_2.25 (bug 1400551) with a superficially similar error message:

  2016-12-02T05:20:27.182466Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2

but 

  - bug 1400551 is wrt virtio-balloon, whereas this failure is hitting virtio-net
  - bug 1400551 should only hit if they had the CVE-2016-5403 / bug 1359731 fix applied from rhev-2.3.0-31.el7_2.20, which they don't appear to have

so I think on balance this is probably a different bug.

Comment 7 Dr. David Alan Gilbert 2017-05-17 09:01:01 UTC
lprosek: Is this actually the same bug? I see:
  https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
suggest it might be ?

Comment 12 Ladi Prosek 2017-05-17 11:22:57 UTC
(In reply to Dr. David Alan Gilbert from comment #7)
> lprosek: Is this actually the same bug? I see:
>   https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
> suggest it might be ?

I don't think so. The only suspicious virtqueue operation in virtio-net is virtqueue_discard (which was actually also fixed as part of the CVE follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was added in:

commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3
Author: Jason Wang <jasowang>
Date:   Fri Sep 25 13:21:30 2015 +0800

    virtio-net: correctly drop truncated packets


Is it possible that the VM had been migrated from another host running a different version of qemu-kvm-rhev or did it start on qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh?

Comment 13 Daniel Berrangé 2017-05-17 11:26:47 UTC
(In reply to Ladi Prosek from comment #12)
> (In reply to Dr. David Alan Gilbert from comment #7)
> > lprosek: Is this actually the same bug? I see:
> >   https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
> > suggest it might be ?
> 
> I don't think so. The only suspicious virtqueue operation in virtio-net is
> virtqueue_discard (which was actually also fixed as part of the CVE
> follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was
> added in:
> 
> commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3
> Author: Jason Wang <jasowang>
> Date:   Fri Sep 25 13:21:30 2015 +0800
> 
>     virtio-net: correctly drop truncated packets
> 
> 
> Is it possible that the VM had been migrated from another host running a
> different version of qemu-kvm-rhev or did it start on
> qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh?

The libvirt logs on the source host do not container any -incoming arg, so it must have been a cold boot with qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64

Comment 14 Ladi Prosek 2017-05-17 12:20:59 UTC
(In reply to Daniel Berrange from comment #13)
> The libvirt logs on the source host do not container any -incoming arg, so
> it must have been a cold boot with
> qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64

Thanks, I did an experiment to see what happens if the rx queue fills up - the problem that 0cf33fb addresses - and it just fills up, no migration problems. So this was likely a false lead :/

Comment 21 Dr. David Alan Gilbert 2017-05-19 17:17:29 UTC
I'm not having any luck replicating this here; it's ran ~230 iterations of a migration from:
   qemu-kvm-rhev-2.3.0-31.el7_2.13.x86_64
to
   qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64

with a 7.3 guest doing network IO without any problems.

Note I'm running 3.10.0-514.16.1.el7.x86_64->3.10.0-663.el7.test.x86_64 for the kernels.

Comment 49 jason wang 2017-06-02 12:00:43 UTC
Created attachment 1284416 [details]
systemtap debug script

Comment 54 jason wang 2017-06-06 06:02:36 UTC
Created attachment 1285206 [details]
new systemtap debug script

Comment 62 Matt Riedemann 2018-04-06 16:28:17 UTC
Seeing the same thing in OpenStack's live migration CI job runs in Queens and Rocky:

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/screen-n-cpu.txt.gz?level=TRACE#_Apr_05_21_48_43_258043

Those are at present using the Pike ubuntu cloud archive, so:

ii  libvirt-bin                         3.6.0-1ubuntu6.2~cloud0

ii  qemu-system-x86                     1:2.10+dfsg-0ubuntu3.5~cloud0

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22VQ%200%20size%5C%22%20AND%20message%3A%5C%22Guest%20index%5C%22%20AND%20message%3A%5C%22inconsistent%20with%20Host%20index%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Note that these CI jobs are running tests in parallel, so there are concurrently running live migrations on the same hosts.

Comment 63 Matt Riedemann 2018-04-06 16:38:35 UTC
Tracking this in OpenStack with this bug:

https://bugs.launchpad.net/nova/+bug/1761798


Note You need to log in before you can comment on or make changes to this bug.