Bug 1775251

Summary:

qemu-kvm crashes when Windows VM is migrated with multiqueue

Product:

Red Hat Enterprise Linux 7

Reporter:

Radek Duda <rduda>

Component:

qemu-kvm-rhev

Assignee:

Dr. David Alan Gilbert <dgilbert>

Status:

CLOSED ERRATA

QA Contact:

Li Xiaohui <xiaohli>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.7

CC:

hhuang, jinzhao, juzhang, mrezanin, mtessun, rh-spice-bugs, virt-maint

Target Milestone:

Keywords:

Regression, TestBlocker, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qemu-kvm-rhev-2.12.0-40.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1776808 1779100 (view as bug list)

Environment:

Last Closed:

2020-03-31 14:34:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1776808, 1779100

Attachments:

Description	Flags
qemu.log	none
vdsm.log	none
qemu-kvm full backtrace	none
qemu-kvm_debug.log	none

Description Radek Duda 2019-11-21 15:38:52 UTC

Created attachment 1638503 [details]
qemu.log

Description of problem:
If I migrate Windows VM, qemu-kvm crashes with error:

(process:63218): GLib-ERROR **: 15:21:00.329: gmem.c:135: failed to allocate 562949953421312 bytes
2019-11-21 14:21:10.152+0000: shutting down, reason=crashed

Version-Release number of selected component (if applicable):
hosts rhel7.7:
qemu-kvm-rhev-2.12.0-33.el7_7.6.x86_64
vdsm-4.30.38-1.el7ev.x86_64
spice-server-0.14.0-7.el7.x86_64

guest:
Win10 
rhv-guest-tools 4.3-10.el7 installed

Use RHV-4.3.7.2-0.1.el7 to migrate 
How reproducible:
always

Steps to Reproduce:
1. Make Windows VM in rhv
2. migrate to other host (press migrate)
3.

Actual results:
qemu-kvm crashes with error in qemu.log:

(process:63218): Spice-DEBUG: 15:21:00.329: reds.c:4308:spice_server_migrate_start: trace
tcmalloc: large alloc 562949953421312 bytes == (nil) @  0x7f5ab473a35f 0x7f5ab475a1e7 0x7f5acc23e6e6 0x563dafbcbac8 0x563dafbcc8e5 0x563dafb7cbae 0x563dafb837b6 0x563dafd4bde5 0x563dafd47bdf 0x7f5ab36e2ea5 0x7f5ab340b8cd

(process:63218): GLib-ERROR **: 15:21:00.329: gmem.c:135: failed to allocate 562949953421312 bytes
2019-11-21 14:21:10.152+0000: shutting down, reason=crashed




Expected results:
no crash, succesfull migration


Additional info:
I did not reproduced this with rhel VMs

Comment 2 Radek Duda 2019-11-21 15:45:28 UTC

Created attachment 1638509 [details]
vdsm.log

Comment 5 Dr. David Alan Gilbert 2019-11-21 18:49:01 UTC

Hi Radek,
  Any chance you can get a full backtrace from the destination ?
This smells like it's a spice bug somewhere; the:

failed to allocate 562949953421312 bytes

is something I've not seen before; that's 2^49 !

Did this happen previously - is it something new with perhaps a new guest driver, or new qemu version or new spice?

Comment 6 Radek Duda 2019-11-25 11:39:45 UTC

Created attachment 1639445 [details]
qemu-kvm full backtrace

Comment 7 Dr. David Alan Gilbert 2019-11-25 11:48:32 UTC

Huh that's fun; this is on the saving side in vhost (not spice related)

#4  0x00007fb24fde171c in g_malloc0 (n_bytes=n_bytes@entry=562949953421312) at gmem.c:134
        mem = <optimized out>
#5  0x00005650c5ef9ac8 in vhost_log_alloc (share=false, size=70368744177664)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:196
        err = 0x0
        log = 0x5650c951da80
        logsize = 562949953421312
        fd = -1
#6  vhost_log_get (size=size@entry=70368744177664, share=<optimized out>)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:211
        log = <optimized out>
#7  0x00005650c5efa8e5 in vhost_dev_log_resize (size=70368744177664, dev=0x5650c8593500)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:263
        log = <optimized out>
        log_base = <optimized out>
        r = <optimized out>
#8  vhost_migration_log (enable=1, listener=0x5650c8593508)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:791
        dev = 0x5650c8593500
        r = <optimized out>
#9  vhost_log_global_start (listener=0x5650c8593508) at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:805
No locals.
#10 0x00005650c5eaabae in memory_global_dirty_log_start () at /usr/src/debug/qemu-2.12.0/memory.c:2541
        _listener = 0x5650c8593508
#11 0x00005650c5eb17b6 in ram_init_bitmaps (rs=<optimized out>)
    at /usr/src/debug/qemu-2.12.0/migration/ram.c:2221
No locals.
#12 ram_init_all (rsp=0x5650c6a0bdc0 <ram_state>) at /usr/src/debug/qemu-2.12.0/migration/ram.c:2240
No locals.
#13 ram_save_setup (f=0x5650c8b1a000, opaque=0x5650c6a0bdc0 <ram_state>)
    at /usr/src/debug/qemu-2.12.0/migration/ram.c:2271
        rsp = 0x5650c6a0bdc0 <ram_state>
        block = <optimized out>
#14 0x00005650c6079de5 in qemu_savevm_state_setup (f=0x5650c8b1a000) at migration/savevm.c:1004
        se = 0x5650c853f760
        ret = <optimized out>
#15 0x00005650c6075bdf in migration_thread (opaque=0x5650c853a280) at migration/migration.c:2407
        s = 0x5650c853a280
        setup_start = 1574678886573
#16 0x00007fb237285ea5 in start_thread (arg=0x7fb1559ce700) at pthread_create.c:307
        __res = <optimized out>
        pd = 0x7fb1559ce700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140399622285056, -382482930179187553, 0, 8392704, 3, 
                140399622285056, 419836370933308575, 420325280375037087}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>

Comment 8 Dr. David Alan Gilbert 2019-11-25 11:57:53 UTC

My guess here is that something somewhere has a size of 2^64-1 and it's getting divided down by the log page size (4k=12 bits, 64-12=52)
and then divided down by 8, possibily from vhost_log_bits, so getting from 52 to 49 and the 2^49 we're seeing.
I'll build a debug build for you to try.

Comment 9 Radek Duda 2019-11-25 14:59:47 UTC

| Did this happen previously - is it something new with perhaps a new guest driver, or new qemu version or new spice?

I do not know - last time when I successfully migrated Windows VM was 26.9.2019. I have not been migrating VMs since this date up to now

Comment 10 Dr. David Alan Gilbert 2019-11-25 18:19:45 UTC

Hi Radek,
  Can you please try: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=24946820  
it should give some debug like:

vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3221225471 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=9663676415 gpa=100000000 pre-log_size=12288
vhost_get_log_size: vq 0 used_phys=9417298496 used_size=2054 last=9417300549 pre-log_size=36864
vhost_get_log_size: vq 1 used_phys=9416602176 used_size=2054 last=9416604229 pre-log_size=36864
vhost_dev_log_resize: virtio-net: size=36864
vhost_log_get: share=0 size=36864
vhost_log_alloc: size=36864 logsize=294912

please give me a full copy of that (probably from the source side).

Dave

Comment 11 Radek Duda 2019-11-26 10:08:49 UTC

Created attachment 1639764 [details]
qemu-kvm_debug.log

I tried you build. Here is output.

Comment 12 Dr. David Alan Gilbert 2019-11-26 10:11:46 UTC

vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3145727999 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=4093640703 gpa=f0000000 pre-log_size=12000
vhost_get_log_size: region 3 last=4160749567 gpa=f4000000 pre-log_size=15616
vhost_get_log_size: region 4 last=4227858431 gpa=f8000000 pre-log_size=15872
vhost_get_log_size: region 5 last=4236247039 gpa=fc000000 pre-log_size=16128
vhost_get_log_size: region 6 last=4244635647 gpa=fc800000 pre-log_size=16160
vhost_get_log_size: region 7 last=4253024255 gpa=fd000000 pre-log_size=16192
vhost_get_log_size: region 8 last=4261412863 gpa=fd800000 pre-log_size=16224
vhost_get_log_size: region 9 last=4261765119 gpa=fe054000 pre-log_size=16256
vhost_get_log_size: region 10 last=4261773311 gpa=fe056000 pre-log_size=16258
vhost_get_log_size: region 11 last=4261781503 gpa=fe058000 pre-log_size=16258
vhost_get_log_size: region 12 last=4261789695 gpa=fe05a000 pre-log_size=16258
vhost_get_log_size: vq 0 used_phys=449200704 used_size=2052 last=449202755 pre-log_size=16258
vhost_get_log_size: vq 1 used_phys=446005824 used_size=2052 last=446007875 pre-log_size=16258
vhost_dev_log_resize: virtio-net: size=16258
vhost_log_get: share=0 size=16258
vhost_log_alloc: size=16258 logsize=130064
vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3145727999 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=4093640703 gpa=f0000000 pre-log_size=12000
vhost_get_log_size: region 3 last=4160749567 gpa=f4000000 pre-log_size=15616
vhost_get_log_size: region 4 last=4227858431 gpa=f8000000 pre-log_size=15872
vhost_get_log_size: region 5 last=4236247039 gpa=fc000000 pre-log_size=16128
vhost_get_log_size: region 6 last=4244635647 gpa=fc800000 pre-log_size=16160
vhost_get_log_size: region 7 last=4253024255 gpa=fd000000 pre-log_size=16192
vhost_get_log_size: region 8 last=4261412863 gpa=fd800000 pre-log_size=16224
vhost_get_log_size: region 9 last=4261765119 gpa=fe054000 pre-log_size=16256
vhost_get_log_size: region 10 last=4261773311 gpa=fe056000 pre-log_size=16258
vhost_get_log_size: region 11 last=4261781503 gpa=fe058000 pre-log_size=16258
vhost_get_log_size: region 12 last=4261789695 gpa=fe05a000 pre-log_size=16258
vhost_get_log_size: vq 0 used_phys=448209472 used_size=2052 last=448211523 pre-log_size=16258
vhost_get_log_size: vq 1 used_phys=538899008 used_size=2052 last=538901059 pre-log_size=16258
vhost_dev_log_resize: virtio-net: size=16258
vhost_log_get: share=0 size=16258
vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3145727999 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=4093640703 gpa=f0000000 pre-log_size=12000
vhost_get_log_size: region 3 last=4160749567 gpa=f4000000 pre-log_size=15616
vhost_get_log_size: region 4 last=4227858431 gpa=f8000000 pre-log_size=15872
vhost_get_log_size: region 5 last=4236247039 gpa=fc000000 pre-log_size=16128
vhost_get_log_size: region 6 last=4244635647 gpa=fc800000 pre-log_size=16160
vhost_get_log_size: region 7 last=4253024255 gpa=fd000000 pre-log_size=16192
vhost_get_log_size: region 8 last=4261412863 gpa=fd800000 pre-log_size=16224
vhost_get_log_size: region 9 last=4261765119 gpa=fe054000 pre-log_size=16256
vhost_get_log_size: region 10 last=4261773311 gpa=fe056000 pre-log_size=16258
vhost_get_log_size: region 11 last=4261781503 gpa=fe058000 pre-log_size=16258
vhost_get_log_size: region 12 last=4261789695 gpa=fe05a000 pre-log_size=16258
vhost_get_log_size: vq 0 used_phys=0 used_size=0 last=-1 pre-log_size=16258                  <<<<<<<<<<<
vhost_get_log_size: vq 1 used_phys=0 used_size=0 last=-1 pre-log_size=70368744177664
vhost_dev_log_resize: virtio-net: size=70368744177664
vhost_log_get: share=0 size=70368744177664
vhost_log_alloc: size=70368744177664 logsize=562949953421312

Comment 13 Dr. David Alan Gilbert 2019-11-26 10:29:34 UTC

This looks like it should be fixed (mostly) by upstream: 240e647

'vhost: fix vhost_log size overflow during migration'

Comment 21 Li Xiaohui 2019-12-09 10:56:08 UTC

reproduce bz on hosts(kernel-3.10.0-1118.el7.x86_64&qemu-img-rhev-2.12.0-39.el7.x86_64),
and verified bz on the same hosts, but qemu version is qemu-img-rhev-2.12.0-40.el7.x86_64, migration finished succussfully, guest on dst host run/reboot/shutdown ok

Comment 22 Li Xiaohui 2019-12-12 10:01:28 UTC

add some contents for Comment21:
when verify this bz, test with win7/win10/win2019/rhel8.2 guests, they all migrated successfully and reboot/shutdown ok.
So make this bz verified according to Comment21&22

Comment 24 errata-xmlrpc 2020-03-31 14:34:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1216