Bug 1450524

Summary: qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f
Product: Red Hat Enterprise Linux 7 Reporter: Jack Waterworth <jwaterwo>
Component: qemu-kvm-rhevAssignee: jason wang <jasowang>
Status: CLOSED WORKSFORME QA Contact: xiywang
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.3CC: ailan, berrange, chayang, dasmith, dgilbert, eglynn, hhuang, huding, jasowang, jdenemar, juzhang, jwaterwo, kchamart, knoel, laine, lprosek, michen, mriedem, mschuppe, mst, pezhang, qzhang, rbryant, sbauza, sferdjao, sgordon, srevivo, virt-bugs, virt-maint, vromanso, wexu, xianwang
Target Milestone: pre-dev-freezeKeywords: Unconfirmed
Target Release: 7.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-19 13:34:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
systemtap debug script
none
new systemtap debug script none

Description Jack Waterworth 2017-05-12 21:59:57 UTC
Description of problem:

When we migrate kvm instances from one compute to another compute(scheduler not in play), We get a small number of kvm instances that go into an error/shutoff state on the destination node. The virsh process continues to run on the source node, but no libvirt.xml file is on the source. Nova indicates that the VM is on the destination node, but no libvirt.xml file is present. 

nova logs show the following error:

2017-05-11 09:06:06.924 344980 ERROR nova.virt.libvirt.driver [req-df99cebc-5072-4667-a5b4-70b4434a68e2 Z013PX2 7072da2726e04c3687b53954234e412f - - -] [instance: c668bbbb-3e60-47ac-8c6e-4449107022ec] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2017-05-11T14:06:06.706004Z qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f

Version-Release number of selected component (if applicable):
osp7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
openstack-nova-compute-2015.1.4-32.el7ost.noarch

How reproducible:
sometimes

Comment 2 Kashyap Chamarthy 2017-05-15 11:47:58 UTC
Can you please also provide the complete QEMU log, so we could examine which specific Virtio device is being affected?

Comment 3 Kashyap Chamarthy 2017-05-15 11:59:08 UTC
(In reply to Kashyap Chamarthy from comment #2)
> Can you please also provide the complete QEMU log, so we could examine which
> specific Virtio device is being affected?

Err, the logs are in comment #1.  Please disregard me.

Comment 5 Daniel Berrangé 2017-05-17 08:39:47 UTC
Given the QEMU version 2.6.0, and the machine type pc-i440fx-rhel7.2.0, I presume this migration is run from a RHEL-7.2 (src) -> RHEL-7.3 (dst) host

Comment 6 Daniel Berrangé 2017-05-17 08:58:38 UTC
The QEMU version on the source host is qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64, this is quite a few releases behind the latest errata available.

There is a bug fix in qemu-kvm-rhev-2.3.0-31.el7_2.25 (bug 1400551) with a superficially similar error message:

  2016-12-02T05:20:27.182466Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2

but 

  - bug 1400551 is wrt virtio-balloon, whereas this failure is hitting virtio-net
  - bug 1400551 should only hit if they had the CVE-2016-5403 / bug 1359731 fix applied from rhev-2.3.0-31.el7_2.20, which they don't appear to have

so I think on balance this is probably a different bug.

Comment 7 Dr. David Alan Gilbert 2017-05-17 09:01:01 UTC
lprosek: Is this actually the same bug? I see:
  https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
suggest it might be ?

Comment 12 Ladi Prosek 2017-05-17 11:22:57 UTC
(In reply to Dr. David Alan Gilbert from comment #7)
> lprosek: Is this actually the same bug? I see:
>   https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
> suggest it might be ?

I don't think so. The only suspicious virtqueue operation in virtio-net is virtqueue_discard (which was actually also fixed as part of the CVE follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was added in:

commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3
Author: Jason Wang <jasowang>
Date:   Fri Sep 25 13:21:30 2015 +0800

    virtio-net: correctly drop truncated packets


Is it possible that the VM had been migrated from another host running a different version of qemu-kvm-rhev or did it start on qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh?

Comment 13 Daniel Berrangé 2017-05-17 11:26:47 UTC
(In reply to Ladi Prosek from comment #12)
> (In reply to Dr. David Alan Gilbert from comment #7)
> > lprosek: Is this actually the same bug? I see:
> >   https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15
> > suggest it might be ?
> 
> I don't think so. The only suspicious virtqueue operation in virtio-net is
> virtqueue_discard (which was actually also fixed as part of the CVE
> follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was
> added in:
> 
> commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3
> Author: Jason Wang <jasowang>
> Date:   Fri Sep 25 13:21:30 2015 +0800
> 
>     virtio-net: correctly drop truncated packets
> 
> 
> Is it possible that the VM had been migrated from another host running a
> different version of qemu-kvm-rhev or did it start on
> qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh?

The libvirt logs on the source host do not container any -incoming arg, so it must have been a cold boot with qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64

Comment 14 Ladi Prosek 2017-05-17 12:20:59 UTC
(In reply to Daniel Berrange from comment #13)
> The libvirt logs on the source host do not container any -incoming arg, so
> it must have been a cold boot with
> qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64

Thanks, I did an experiment to see what happens if the rx queue fills up - the problem that 0cf33fb addresses - and it just fills up, no migration problems. So this was likely a false lead :/

Comment 21 Dr. David Alan Gilbert 2017-05-19 17:17:29 UTC
I'm not having any luck replicating this here; it's ran ~230 iterations of a migration from:
   qemu-kvm-rhev-2.3.0-31.el7_2.13.x86_64
to
   qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64

with a 7.3 guest doing network IO without any problems.

Note I'm running 3.10.0-514.16.1.el7.x86_64->3.10.0-663.el7.test.x86_64 for the kernels.

Comment 49 jason wang 2017-06-02 12:00:43 UTC
Created attachment 1284416 [details]
systemtap debug script

Comment 54 jason wang 2017-06-06 06:02:36 UTC
Created attachment 1285206 [details]
new systemtap debug script

Comment 62 Matt Riedemann 2018-04-06 16:28:17 UTC
Seeing the same thing in OpenStack's live migration CI job runs in Queens and Rocky:

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz

http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/screen-n-cpu.txt.gz?level=TRACE#_Apr_05_21_48_43_258043

Those are at present using the Pike ubuntu cloud archive, so:

ii  libvirt-bin                         3.6.0-1ubuntu6.2~cloud0

ii  qemu-system-x86                     1:2.10+dfsg-0ubuntu3.5~cloud0

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22VQ%200%20size%5C%22%20AND%20message%3A%5C%22Guest%20index%5C%22%20AND%20message%3A%5C%22inconsistent%20with%20Host%20index%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Note that these CI jobs are running tests in parallel, so there are concurrently running live migrations on the same hosts.

Comment 63 Matt Riedemann 2018-04-06 16:38:35 UTC
Tracking this in OpenStack with this bug:

https://bugs.launchpad.net/nova/+bug/1761798