1450524 – qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1450524 - qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f

Summary: qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	pre-dev-freeze
Target Release:	7.4
Assignee:	jason wang
QA Contact:	xiywang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-12 21:59 UTC by Jack Waterworth
Modified:	2020-12-14 08:39 UTC (History)
CC List:	32 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-19 13:34:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
systemtap debug script (653 bytes, text/plain) 2017-06-02 12:00 UTC, jason wang	no flags	Details
new systemtap debug script (487 bytes, text/plain) 2017-06-06 06:02 UTC, jason wang	no flags	Details
Show Obsolete (1) View All

Description Jack Waterworth 2017-05-12 21:59:57 UTC

Description of problem:

When we migrate kvm instances from one compute to another compute(scheduler not in play), We get a small number of kvm instances that go into an error/shutoff state on the destination node. The virsh process continues to run on the source node, but no libvirt.xml file is on the source. Nova indicates that the VM is on the destination node, but no libvirt.xml file is present. 

nova logs show the following error:

2017-05-11 09:06:06.924 344980 ERROR nova.virt.libvirt.driver [req-df99cebc-5072-4667-a5b4-70b4434a68e2 Z013PX2 7072da2726e04c3687b53954234e412f - - -] [instance: c668bbbb-3e60-47ac-8c6e-4449107022ec] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2017-05-11T14:06:06.706004Z qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f

Version-Release number of selected component (if applicable):
osp7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
openstack-nova-compute-2015.1.4-32.el7ost.noarch

How reproducible:
sometimes

Comment 2 Kashyap Chamarthy 2017-05-15 11:47:58 UTC

Can you please also provide the complete QEMU log, so we could examine which specific Virtio device is being affected?

Comment 3 Kashyap Chamarthy 2017-05-15 11:59:08 UTC

(In reply to Kashyap Chamarthy from comment #2)
> Can you please also provide the complete QEMU log, so we could examine which
> specific Virtio device is being affected?

Err, the logs are in comment #1.  Please disregard me.

Comment 5 Daniel Berrangé 2017-05-17 08:39:47 UTC

Given the QEMU version 2.6.0, and the machine type pc-i440fx-rhel7.2.0, I presume this migration is run from a RHEL-7.2 (src) -> RHEL-7.3 (dst) host

Comment 6 Daniel Berrangé 2017-05-17 08:58:38 UTC

The QEMU version on the source host is qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64, this is quite a few releases behind the latest errata available.

There is a bug fix in qemu-kvm-rhev-2.3.0-31.el7_2.25 (bug 1400551) with a superficially similar error message:

  2016-12-02T05:20:27.182466Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2

but 

  - bug 1400551 is wrt virtio-balloon, whereas this failure is hitting virtio-net
  - bug 1400551 should only hit if they had the CVE-2016-5403 / bug 1359731 fix applied from rhev-2.3.0-31.el7_2.20, which they don't appear to have

so I think on balance this is probably a different bug.

Comment 7 Dr. David Alan Gilbert 2017-05-17 09:01:01 UTC

lprosek: Is this actually the same bug? I see:
  https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
suggest it might be ?

Comment 12 Ladi Prosek 2017-05-17 11:22:57 UTC

(In reply to Dr. David Alan Gilbert from comment #7)
> lprosek: Is this actually the same bug? I see:
>   https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
> suggest it might be ?

I don't think so. The only suspicious virtqueue operation in virtio-net is virtqueue_discard (which was actually also fixed as part of the CVE follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was added in:

commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3
Author: Jason Wang <jasowang>
Date:   Fri Sep 25 13:21:30 2015 +0800

    virtio-net: correctly drop truncated packets


Is it possible that the VM had been migrated from another host running a different version of qemu-kvm-rhev or did it start on qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh?

Comment 13 Daniel Berrangé 2017-05-17 11:26:47 UTC

(In reply to Ladi Prosek from comment #12)
> (In reply to Dr. David Alan Gilbert from comment #7)
> > lprosek: Is this actually the same bug? I see:
> >   https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
> > suggest it might be ?
> 
> I don't think so. The only suspicious virtqueue operation in virtio-net is
> virtqueue_discard (which was actually also fixed as part of the CVE
> follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was
> added in:
> 
> commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3
> Author: Jason Wang <jasowang>
> Date:   Fri Sep 25 13:21:30 2015 +0800
> 
>     virtio-net: correctly drop truncated packets
> 
> 
> Is it possible that the VM had been migrated from another host running a
> different version of qemu-kvm-rhev or did it start on
> qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh?

The libvirt logs on the source host do not container any -incoming arg, so it must have been a cold boot with qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64

Comment 14 Ladi Prosek 2017-05-17 12:20:59 UTC

(In reply to Daniel Berrange from comment #13)
> The libvirt logs on the source host do not container any -incoming arg, so
> it must have been a cold boot with
> qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64

Thanks, I did an experiment to see what happens if the rx queue fills up - the problem that 0cf33fb addresses - and it just fills up, no migration problems. So this was likely a false lead :/

Comment 21 Dr. David Alan Gilbert 2017-05-19 17:17:29 UTC

I'm not having any luck replicating this here; it's ran ~230 iterations of a migration from:
   qemu-kvm-rhev-2.3.0-31.el7_2.13.x86_64
to
   qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64

with a 7.3 guest doing network IO without any problems.

Note I'm running 3.10.0-514.16.1.el7.x86_64->3.10.0-663.el7.test.x86_64 for the kernels.

Comment 49 jason wang 2017-06-02 12:00:43 UTC

Created attachment 1284416 [details]
systemtap debug script

Comment 54 jason wang 2017-06-06 06:02:36 UTC

Created attachment 1285206 [details]
new systemtap debug script

Comment 62 Matt Riedemann 2018-04-06 16:28:17 UTC

Seeing the same thing in OpenStack's live migration CI job runs in Queens and Rocky:

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/screen-n-cpu.txt.gz?level=TRACE#_Apr_05_21_48_43_258043

Those are at present using the Pike ubuntu cloud archive, so:

ii  libvirt-bin                         3.6.0-1ubuntu6.2~cloud0

ii  qemu-system-x86                     1:2.10+dfsg-0ubuntu3.5~cloud0

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22VQ%200%20size%5C%22%20AND%20message%3A%5C%22Guest%20index%5C%22%20AND%20message%3A%5C%22inconsistent%20with%20Host%20index%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Note that these CI jobs are running tests in parallel, so there are concurrently running live migrations on the same hosts.

Comment 63 Matt Riedemann 2018-04-06 16:38:35 UTC

Tracking this in OpenStack with this bug:

https://bugs.launchpad.net/nova/+bug/1761798

Note You need to log in before you can comment on or make changes to this bug.

ailan
berrange
chayang
dasmith
dgilbert
eglynn
hhuang
huding
jasowang
jdenemar
juzhang
jwaterwo
kchamart
knoel
laine
lprosek
michen
mriedem
mschuppe
mst
pezhang
qzhang
rbryant
sbauza
sferdjao
sgordon
srevivo
virt-bugs
virt-maint
vromanso
wexu
xianwang