Bug 1953283 - qemu process is terminated while trying to migrate VM with sr-iov failover device after hotunplug/hotplug
Summary: qemu process is terminated while trying to migrate VM with sr-iov failover de...
Keywords:
Status: CLOSED DUPLICATE of bug 1953045
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.4
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 8.4
Assignee: Virtualization Maintenance
QA Contact: Yanghang Liu
URL:
Whiteboard:
Depends On:
Blocks: 1688177
TreeView+ depends on / blocked
 
Reported: 2021-04-25 07:58 UTC by Michael Burman
Modified: 2021-04-27 02:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-27 02:40:35 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael Burman 2021-04-25 07:58:31 UTC
Description of problem:
qemu process is terminated while trying to migrate VM with sr-iov failover device after hotunplug/hotplug.

RHV 4.4.6 introduced the SR-IOV failover device feature bz 1688177 and while testing it we hit a bug that blocking us badly.

When trying to migrate a VM with a failover sr-iov device after performing hotunplug + replug to the failover device, the migration is killed immediately with:
" VM Vm2 is down with error. Exit message: Lost connection with qemu process."

Apr 17 11:59:13 <hostname> kernel: qemu-kvm[15186]: segfault at 0 ip 0000000000000000 sp 00007ffca581b248 error 14 in qemu-kvm[55daaec72000+b13000]

Here is the stack trace that we were able to catch:

Thread 1 "qemu-kvm" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000560bf72d4f04 in notifier_list_notify (list=list@entry=0x560bf7b3f4c8 <migration_state_notifiers>, data=data@entry=0x560bf993efd0) at ../util/notify.c:39
#2  0x0000560bf70208e2 in migrate_fd_connect (s=s@entry=0x560bf993efd0, error_in=<optimized out>) at ../migration/migration.c:3636
#3  0x0000560bf6fb2eaa in migration_channel_connect (s=s@entry=0x560bf993efd0, ioc=ioc@entry=0x560bf9dc9810, hostname=hostname@entry=0x0, error=<optimized out>, error@entry=0x0) at ../migration/channel.c:92
#4  0x0000560bf6f7262e in fd_start_outgoing_migration (s=0x560bf993efd0, fdname=<optimized out>, errp=<optimized out>) at ../migration/fd.c:42
#5  0x0000560bf701f056 in qmp_migrate (uri=0x560bf9d6ade0 "fd:migrate", has_blk=<optimized out>, blk=<optimized out>, has_inc=<optimized out>, inc=<optimized out>, has_detach=<optimized out>, detach=true, has_resume=false, resume=false, 
    errp=0x7ffc3006c718) at ../migration/migration.c:2177
#6  0x0000560bf72b4a3e in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f14890bdec0) at qapi/qapi-commands-migration.c:533
#7  0x0000560bf72f87fd in do_qmp_dispatch_bh (opaque=0x7f14890bded0) at ../qapi/qmp-dispatch.c:110
#8  0x0000560bf72c9a8d in aio_bh_call (bh=0x7f13e4006080) at ../util/async.c:164
#9  aio_bh_poll (ctx=ctx@entry=0x560bf98dd340) at ../util/async.c:164
#10 0x0000560bf72cf772 in aio_dispatch (ctx=0x560bf98dd340) at ../util/aio-posix.c:381
#11 0x0000560bf72c9972 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306
#12 0x00007f1487f1877d in g_main_context_dispatch () from target:/lib64/libglib-2.0.so.0
#13 0x0000560bf72ca9f0 in glib_pollfds_poll () at ../util/main-loop.c:221
#14 os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244
#15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
#16 0x0000560bf71b2251 in qemu_main_loop () at ../softmmu/vl.c:1679
#17 0x0000560bf6f33942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50

Version-Release number of selected component (if applicable):
qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64

How reproducible:
100% all the time

Steps to Reproduce:
From RHV point fo view, I have no idea how to reproduce this with qemu cmd, also see discussion in bz 1946981 from comment 32 and below:

RHV step 1. (Start a VM with sriov + failover vm nics) should roughly translate to:
1) Create bridge for failover network
2) Add VF on PF and unbind it
3) Start VM with VF and failover virtio network attached
 
RHV step 2. (Unplug sriov + failover):
4) Unplug VF
5) Rebind VF back to network driver
6) Unplug virtio failover network

RHV step 3. (Plug sriov + failover) 
7) Unbind VF
8) Plug VF
9) Plug virtio failover network 

RHV step 4. (Migrate VM to a different host)
10) Uplug VF 
11) Rebind VF back to network driver
12) Start migration 

step 10 and step 11 mean that you *manually* hot-unplug the failover VF before migrating the vm

Actual results:
qemu process is terminated right away and VM dies and become unusable. 

Expected results:
Must work as expected, migration should succeed. 

Additional info:
See also bz 1946981

Comment 2 Chao Yang 2021-04-25 08:28:18 UTC
Should be a dup of Bug 1953045

Comment 3 Chao Yang 2021-04-26 01:23:19 UTC
Hi Igor,

Could you please confirm if this bug should be closed as dup of Bug 1953045? Thanks.

Comment 4 Igor Mammedov 2021-04-26 09:34:34 UTC
(In reply to Chao Yang from comment #3)
> Hi Igor,
> 
> Could you please confirm if this bug should be closed as dup of Bug 1953045?
> Thanks.

Yes, it's duplicate
(I didn't notice that Michael created a BZ for it before submitting 1953045, and it's not limited to SRIOV only)

Comment 5 Chao Yang 2021-04-27 02:40:35 UTC

*** This bug has been marked as a duplicate of bug 1953045 ***


Note You need to log in before you can comment on or make changes to this bug.