Bug 1450524
Summary: | qemu-kvm: VQ 0 size 0x100 Guest index 0x2010 inconsistent with Host index 0x2171: delta 0xfe9f | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jack Waterworth <jwaterwo> | ||||||
Component: | qemu-kvm-rhev | Assignee: | jason wang <jasowang> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | xiywang | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 7.3 | CC: | ailan, berrange, chayang, dasmith, dgilbert, eglynn, hhuang, huding, jasowang, jdenemar, juzhang, jwaterwo, kchamart, knoel, laine, lprosek, michen, mriedem, mschuppe, mst, pezhang, qzhang, rbryant, sbauza, sferdjao, sgordon, srevivo, virt-bugs, virt-maint, vromanso, wexu, xianwang | ||||||
Target Milestone: | pre-dev-freeze | Keywords: | Unconfirmed | ||||||
Target Release: | 7.4 | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-06-19 13:34:05 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Jack Waterworth
2017-05-12 21:59:57 UTC
Can you please also provide the complete QEMU log, so we could examine which specific Virtio device is being affected? (In reply to Kashyap Chamarthy from comment #2) > Can you please also provide the complete QEMU log, so we could examine which > specific Virtio device is being affected? Err, the logs are in comment #1. Please disregard me. Given the QEMU version 2.6.0, and the machine type pc-i440fx-rhel7.2.0, I presume this migration is run from a RHEL-7.2 (src) -> RHEL-7.3 (dst) host The QEMU version on the source host is qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64, this is quite a few releases behind the latest errata available. There is a bug fix in qemu-kvm-rhev-2.3.0-31.el7_2.25 (bug 1400551) with a superficially similar error message: 2016-12-02T05:20:27.182466Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x1 - used_idx 0x2 but - bug 1400551 is wrt virtio-balloon, whereas this failure is hitting virtio-net - bug 1400551 should only hit if they had the CVE-2016-5403 / bug 1359731 fix applied from rhev-2.3.0-31.el7_2.20, which they don't appear to have so I think on balance this is probably a different bug. lprosek: Is this actually the same bug? I see: https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15 suggest it might be ? (In reply to Dr. David Alan Gilbert from comment #7) > lprosek: Is this actually the same bug? I see: > https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15 > suggest it might be ? I don't think so. The only suspicious virtqueue operation in virtio-net is virtqueue_discard (which was actually also fixed as part of the CVE follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was added in: commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3 Author: Jason Wang <jasowang> Date: Fri Sep 25 13:21:30 2015 +0800 virtio-net: correctly drop truncated packets Is it possible that the VM had been migrated from another host running a different version of qemu-kvm-rhev or did it start on qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh? (In reply to Ladi Prosek from comment #12) > (In reply to Dr. David Alan Gilbert from comment #7) > > lprosek: Is this actually the same bug? I see: > > https://bugzilla.redhat.com/show_bug.cgi?id=1388465#c15 > > suggest it might be ? > > I don't think so. The only suspicious virtqueue operation in virtio-net is > virtqueue_discard (which was actually also fixed as part of the CVE > follow-up last year) but it's not included in qemu-kvm-rhev-2.3.0. It was > added in: > > commit 0cf33fb6b49a19de32859e2cdc6021334f448fb3 > Author: Jason Wang <jasowang> > Date: Fri Sep 25 13:21:30 2015 +0800 > > virtio-net: correctly drop truncated packets > > > Is it possible that the VM had been migrated from another host running a > different version of qemu-kvm-rhev or did it start on > qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 fresh? The libvirt logs on the source host do not container any -incoming arg, so it must have been a cold boot with qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 (In reply to Daniel Berrange from comment #13) > The libvirt logs on the source host do not container any -incoming arg, so > it must have been a cold boot with > qemu-kvm-common-rhev-2.3.0-31.el7_2.13.x86_64 Thanks, I did an experiment to see what happens if the rx queue fills up - the problem that 0cf33fb addresses - and it just fills up, no migration problems. So this was likely a false lead :/ I'm not having any luck replicating this here; it's ran ~230 iterations of a migration from: qemu-kvm-rhev-2.3.0-31.el7_2.13.x86_64 to qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 with a 7.3 guest doing network IO without any problems. Note I'm running 3.10.0-514.16.1.el7.x86_64->3.10.0-663.el7.test.x86_64 for the kernels. Created attachment 1284416 [details]
systemtap debug script
Created attachment 1285206 [details]
new systemtap debug script
Seeing the same thing in OpenStack's live migration CI job runs in Queens and Rocky: http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/subnode-2/libvirt/qemu/instance-00000002.txt.gz http://logs.openstack.org/37/522537/20/check/legacy-tempest-dsvm-multinode-live-migration/8de6e74/logs/screen-n-cpu.txt.gz?level=TRACE#_Apr_05_21_48_43_258043 Those are at present using the Pike ubuntu cloud archive, so: ii libvirt-bin 3.6.0-1ubuntu6.2~cloud0 ii qemu-system-x86 1:2.10+dfsg-0ubuntu3.5~cloud0 http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22VQ%200%20size%5C%22%20AND%20message%3A%5C%22Guest%20index%5C%22%20AND%20message%3A%5C%22inconsistent%20with%20Host%20index%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d Note that these CI jobs are running tests in parallel, so there are concurrently running live migrations on the same hosts. Tracking this in OpenStack with this bug: https://bugs.launchpad.net/nova/+bug/1761798 |