1775251 – qemu-kvm crashes when Windows VM is migrated with multiqueue

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1775251 - qemu-kvm crashes when Windows VM is migrated with multiqueue

Summary: qemu-kvm crashes when Windows VM is migrated with multiqueue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Dr. David Alan Gilbert
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1776808 1779100
TreeView+	depends on / blocked

Reported:	2019-11-21 15:38 UTC by Radek Duda
Modified:	2020-03-31 14:36 UTC (History)
CC List:	7 users (show)
Fixed In Version:	qemu-kvm-rhev-2.12.0-40.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1776808 1779100 (view as bug list)
Environment:
Last Closed:	2020-03-31 14:34:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
qemu.log (26.79 KB, text/plain) 2019-11-21 15:38 UTC, Radek Duda	no flags	Details
vdsm.log (101.03 KB, text/plain) 2019-11-21 15:45 UTC, Radek Duda	no flags	Details
qemu-kvm full backtrace (32.24 KB, text/plain) 2019-11-25 11:39 UTC, Radek Duda	no flags	Details
qemu-kvm_debug.log (46.29 KB, text/plain) 2019-11-26 10:08 UTC, Radek Duda	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:1216	0	None	None	None	2020-03-31 14:36:59 UTC

Description Radek Duda 2019-11-21 15:38:52 UTC

Created attachment 1638503 [details]
qemu.log

Description of problem:
If I migrate Windows VM, qemu-kvm crashes with error:

(process:63218): GLib-ERROR **: 15:21:00.329: gmem.c:135: failed to allocate 562949953421312 bytes
2019-11-21 14:21:10.152+0000: shutting down, reason=crashed

Version-Release number of selected component (if applicable):
hosts rhel7.7:
qemu-kvm-rhev-2.12.0-33.el7_7.6.x86_64
vdsm-4.30.38-1.el7ev.x86_64
spice-server-0.14.0-7.el7.x86_64

guest:
Win10 
rhv-guest-tools 4.3-10.el7 installed

Use RHV-4.3.7.2-0.1.el7 to migrate 
How reproducible:
always

Steps to Reproduce:
1. Make Windows VM in rhv
2. migrate to other host (press migrate)
3.

Actual results:
qemu-kvm crashes with error in qemu.log:

(process:63218): Spice-DEBUG: 15:21:00.329: reds.c:4308:spice_server_migrate_start: trace
tcmalloc: large alloc 562949953421312 bytes == (nil) @  0x7f5ab473a35f 0x7f5ab475a1e7 0x7f5acc23e6e6 0x563dafbcbac8 0x563dafbcc8e5 0x563dafb7cbae 0x563dafb837b6 0x563dafd4bde5 0x563dafd47bdf 0x7f5ab36e2ea5 0x7f5ab340b8cd

(process:63218): GLib-ERROR **: 15:21:00.329: gmem.c:135: failed to allocate 562949953421312 bytes
2019-11-21 14:21:10.152+0000: shutting down, reason=crashed




Expected results:
no crash, succesfull migration


Additional info:
I did not reproduced this with rhel VMs

Comment 2 Radek Duda 2019-11-21 15:45:28 UTC

Created attachment 1638509 [details]
vdsm.log

Comment 5 Dr. David Alan Gilbert 2019-11-21 18:49:01 UTC

Hi Radek,
  Any chance you can get a full backtrace from the destination ?
This smells like it's a spice bug somewhere; the:

failed to allocate 562949953421312 bytes

is something I've not seen before; that's 2^49 !

Did this happen previously - is it something new with perhaps a new guest driver, or new qemu version or new spice?

Comment 6 Radek Duda 2019-11-25 11:39:45 UTC

Created attachment 1639445 [details]
qemu-kvm full backtrace

Comment 7 Dr. David Alan Gilbert 2019-11-25 11:48:32 UTC

Huh that's fun; this is on the saving side in vhost (not spice related)

#4  0x00007fb24fde171c in g_malloc0 (n_bytes=n_bytes@entry=562949953421312) at gmem.c:134
        mem = <optimized out>
#5  0x00005650c5ef9ac8 in vhost_log_alloc (share=false, size=70368744177664)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:196
        err = 0x0
        log = 0x5650c951da80
        logsize = 562949953421312
        fd = -1
#6  vhost_log_get (size=size@entry=70368744177664, share=<optimized out>)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:211
        log = <optimized out>
#7  0x00005650c5efa8e5 in vhost_dev_log_resize (size=70368744177664, dev=0x5650c8593500)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:263
        log = <optimized out>
        log_base = <optimized out>
        r = <optimized out>
#8  vhost_migration_log (enable=1, listener=0x5650c8593508)
    at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:791
        dev = 0x5650c8593500
        r = <optimized out>
#9  vhost_log_global_start (listener=0x5650c8593508) at /usr/src/debug/qemu-2.12.0/hw/virtio/vhost.c:805
No locals.
#10 0x00005650c5eaabae in memory_global_dirty_log_start () at /usr/src/debug/qemu-2.12.0/memory.c:2541
        _listener = 0x5650c8593508
#11 0x00005650c5eb17b6 in ram_init_bitmaps (rs=<optimized out>)
    at /usr/src/debug/qemu-2.12.0/migration/ram.c:2221
No locals.
#12 ram_init_all (rsp=0x5650c6a0bdc0 <ram_state>) at /usr/src/debug/qemu-2.12.0/migration/ram.c:2240
No locals.
#13 ram_save_setup (f=0x5650c8b1a000, opaque=0x5650c6a0bdc0 <ram_state>)
    at /usr/src/debug/qemu-2.12.0/migration/ram.c:2271
        rsp = 0x5650c6a0bdc0 <ram_state>
        block = <optimized out>
#14 0x00005650c6079de5 in qemu_savevm_state_setup (f=0x5650c8b1a000) at migration/savevm.c:1004
        se = 0x5650c853f760
        ret = <optimized out>
#15 0x00005650c6075bdf in migration_thread (opaque=0x5650c853a280) at migration/migration.c:2407
        s = 0x5650c853a280
        setup_start = 1574678886573
#16 0x00007fb237285ea5 in start_thread (arg=0x7fb1559ce700) at pthread_create.c:307
        __res = <optimized out>
        pd = 0x7fb1559ce700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140399622285056, -382482930179187553, 0, 8392704, 3, 
                140399622285056, 419836370933308575, 420325280375037087}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>

Comment 8 Dr. David Alan Gilbert 2019-11-25 11:57:53 UTC

My guess here is that something somewhere has a size of 2^64-1 and it's getting divided down by the log page size (4k=12 bits, 64-12=52)
and then divided down by 8, possibily from vhost_log_bits, so getting from 52 to 49 and the 2^49 we're seeing.
I'll build a debug build for you to try.

Comment 9 Radek Duda 2019-11-25 14:59:47 UTC

| Did this happen previously - is it something new with perhaps a new guest driver, or new qemu version or new spice?

I do not know - last time when I successfully migrated Windows VM was 26.9.2019. I have not been migrating VMs since this date up to now

Comment 10 Dr. David Alan Gilbert 2019-11-25 18:19:45 UTC

Hi Radek,
  Can you please try: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=24946820  
it should give some debug like:

vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3221225471 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=9663676415 gpa=100000000 pre-log_size=12288
vhost_get_log_size: vq 0 used_phys=9417298496 used_size=2054 last=9417300549 pre-log_size=36864
vhost_get_log_size: vq 1 used_phys=9416602176 used_size=2054 last=9416604229 pre-log_size=36864
vhost_dev_log_resize: virtio-net: size=36864
vhost_log_get: share=0 size=36864
vhost_log_alloc: size=36864 logsize=294912

please give me a full copy of that (probably from the source side).

Dave

Comment 11 Radek Duda 2019-11-26 10:08:49 UTC

Created attachment 1639764 [details]
qemu-kvm_debug.log

I tried you build. Here is output.

Comment 12 Dr. David Alan Gilbert 2019-11-26 10:11:46 UTC

vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3145727999 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=4093640703 gpa=f0000000 pre-log_size=12000
vhost_get_log_size: region 3 last=4160749567 gpa=f4000000 pre-log_size=15616
vhost_get_log_size: region 4 last=4227858431 gpa=f8000000 pre-log_size=15872
vhost_get_log_size: region 5 last=4236247039 gpa=fc000000 pre-log_size=16128
vhost_get_log_size: region 6 last=4244635647 gpa=fc800000 pre-log_size=16160
vhost_get_log_size: region 7 last=4253024255 gpa=fd000000 pre-log_size=16192
vhost_get_log_size: region 8 last=4261412863 gpa=fd800000 pre-log_size=16224
vhost_get_log_size: region 9 last=4261765119 gpa=fe054000 pre-log_size=16256
vhost_get_log_size: region 10 last=4261773311 gpa=fe056000 pre-log_size=16258
vhost_get_log_size: region 11 last=4261781503 gpa=fe058000 pre-log_size=16258
vhost_get_log_size: region 12 last=4261789695 gpa=fe05a000 pre-log_size=16258
vhost_get_log_size: vq 0 used_phys=449200704 used_size=2052 last=449202755 pre-log_size=16258
vhost_get_log_size: vq 1 used_phys=446005824 used_size=2052 last=446007875 pre-log_size=16258
vhost_dev_log_resize: virtio-net: size=16258
vhost_log_get: share=0 size=16258
vhost_log_alloc: size=16258 logsize=130064
vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3145727999 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=4093640703 gpa=f0000000 pre-log_size=12000
vhost_get_log_size: region 3 last=4160749567 gpa=f4000000 pre-log_size=15616
vhost_get_log_size: region 4 last=4227858431 gpa=f8000000 pre-log_size=15872
vhost_get_log_size: region 5 last=4236247039 gpa=fc000000 pre-log_size=16128
vhost_get_log_size: region 6 last=4244635647 gpa=fc800000 pre-log_size=16160
vhost_get_log_size: region 7 last=4253024255 gpa=fd000000 pre-log_size=16192
vhost_get_log_size: region 8 last=4261412863 gpa=fd800000 pre-log_size=16224
vhost_get_log_size: region 9 last=4261765119 gpa=fe054000 pre-log_size=16256
vhost_get_log_size: region 10 last=4261773311 gpa=fe056000 pre-log_size=16258
vhost_get_log_size: region 11 last=4261781503 gpa=fe058000 pre-log_size=16258
vhost_get_log_size: region 12 last=4261789695 gpa=fe05a000 pre-log_size=16258
vhost_get_log_size: vq 0 used_phys=448209472 used_size=2052 last=448211523 pre-log_size=16258
vhost_get_log_size: vq 1 used_phys=538899008 used_size=2052 last=538901059 pre-log_size=16258
vhost_dev_log_resize: virtio-net: size=16258
vhost_log_get: share=0 size=16258
vhost_get_log_size: virtio-net
vhost_get_log_size: region 0 last=655359 gpa=0 pre-log_size=0
vhost_get_log_size: region 1 last=3145727999 gpa=c0000 pre-log_size=3
vhost_get_log_size: region 2 last=4093640703 gpa=f0000000 pre-log_size=12000
vhost_get_log_size: region 3 last=4160749567 gpa=f4000000 pre-log_size=15616
vhost_get_log_size: region 4 last=4227858431 gpa=f8000000 pre-log_size=15872
vhost_get_log_size: region 5 last=4236247039 gpa=fc000000 pre-log_size=16128
vhost_get_log_size: region 6 last=4244635647 gpa=fc800000 pre-log_size=16160
vhost_get_log_size: region 7 last=4253024255 gpa=fd000000 pre-log_size=16192
vhost_get_log_size: region 8 last=4261412863 gpa=fd800000 pre-log_size=16224
vhost_get_log_size: region 9 last=4261765119 gpa=fe054000 pre-log_size=16256
vhost_get_log_size: region 10 last=4261773311 gpa=fe056000 pre-log_size=16258
vhost_get_log_size: region 11 last=4261781503 gpa=fe058000 pre-log_size=16258
vhost_get_log_size: region 12 last=4261789695 gpa=fe05a000 pre-log_size=16258
vhost_get_log_size: vq 0 used_phys=0 used_size=0 last=-1 pre-log_size=16258                  <<<<<<<<<<<
vhost_get_log_size: vq 1 used_phys=0 used_size=0 last=-1 pre-log_size=70368744177664
vhost_dev_log_resize: virtio-net: size=70368744177664
vhost_log_get: share=0 size=70368744177664
vhost_log_alloc: size=70368744177664 logsize=562949953421312

Comment 13 Dr. David Alan Gilbert 2019-11-26 10:29:34 UTC

This looks like it should be fixed (mostly) by upstream: 240e647

'vhost: fix vhost_log size overflow during migration'

Comment 21 Li Xiaohui 2019-12-09 10:56:08 UTC

reproduce bz on hosts(kernel-3.10.0-1118.el7.x86_64&qemu-img-rhev-2.12.0-39.el7.x86_64),
and verified bz on the same hosts, but qemu version is qemu-img-rhev-2.12.0-40.el7.x86_64, migration finished succussfully, guest on dst host run/reboot/shutdown ok

Comment 22 Li Xiaohui 2019-12-12 10:01:28 UTC

add some contents for Comment21:
when verify this bz, test with win7/win10/win2019/rhel8.2 guests, they all migrated successfully and reboot/shutdown ok.
So make this bz verified according to Comment21&22

Comment 24 errata-xmlrpc 2020-03-31 14:34:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1216

Note You need to log in before you can comment on or make changes to this bug.