Description of problem: qemu crashes with: Thread 1 "qemu-kvm" received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000560bf72d4f04 in notifier_list_notify (list=list@entry=0x560bf7b3f4c8 <migration_state_notifiers>, data=data@entry=0x560bf993efd0) at ../util/notify.c:39 #2 0x0000560bf70208e2 in migrate_fd_connect (s=s@entry=0x560bf993efd0, error_in=<optimized out>) at ../migration/migration.c:3636 #3 0x0000560bf6fb2eaa in migration_channel_connect (s=s@entry=0x560bf993efd0, ioc=ioc@entry=0x560bf9dc9810, hostname=hostname@entry=0x0, error=<optimized out>, error@entry=0x0) at ../migration/channel.c:92 #4 0x0000560bf6f7262e in fd_start_outgoing_migration (s=0x560bf993efd0, fdname=<optimized out>, errp=<optimized out>) at ../migration/fd.c:42 #5 0x0000560bf701f056 in qmp_migrate (uri=0x560bf9d6ade0 "fd:migrate", has_blk=<optimized out>, blk=<optimized out>, has_inc=<optimized out>, inc=<optimized out>, has_detach=<optimized out>, detach=true, has_resume=false, resume=false, errp=0x7ffc3006c718) at ../migration/migration.c:2177 #6 0x0000560bf72b4a3e in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f14890bdec0) at qapi/qapi-commands-migration.c:533 #7 0x0000560bf72f87fd in do_qmp_dispatch_bh (opaque=0x7f14890bded0) at ../qapi/qmp-dispatch.c:110 #8 0x0000560bf72c9a8d in aio_bh_call (bh=0x7f13e4006080) at ../util/async.c:164 #9 aio_bh_poll (ctx=ctx@entry=0x560bf98dd340) at ../util/async.c:164 #10 0x0000560bf72cf772 in aio_dispatch (ctx=0x560bf98dd340) at ../util/aio-posix.c:381 #11 0x0000560bf72c9972 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306 #12 0x00007f1487f1877d in g_main_context_dispatch () from target:/lib64/libglib-2.0.so.0 #13 0x0000560bf72ca9f0 in glib_pollfds_poll () at ../util/main-loop.c:221 #14 os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244 #15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520 #16 0x0000560bf71b2251 in qemu_main_loop () at ../softmmu/vl.c:1679 #17 0x0000560bf6f33942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 Version-Release number of selected component (if applicable): qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_6 How reproducible: 100% Steps to Reproduce: 1. simplified CLI to reproduce: qemu-kvm -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 \ -monitor stdio rhel84.qcow2 (qemu) migrate "exec:gzip -c > STATEFILE.gz" 2. (qemu) device_del net2 wait till net2 is unplugged (qemu) device_del net1 wait till net1 is unplugged 3. (qemu) migrate "exec:gzip -c > STATEFILE.gz" Actual results: Segmentation fault (core dumped) Expected results: migration completes Additional info: originally reported at https://bugzilla.redhat.com/show_bug.cgi?id=1946981#c36
bug also present in current upstream (to be released as 6.0)
I can use the method Igor mentioned in the description to reproduce this problem: Test env: host: 4.18.0-304.el8.x86_64 qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64 guest: 4.18.0-304.el8.x86_64 Test step: (1) start a vm with the following qemu cmd line: /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 \ -monitor stdio \ -vnc :0 \ /home/images/RHEL84.qcow2 \ (2) hot-unplug the nic (qemu) device_del net2 (qemu) device_del net1 (3) do the offline migration (qemu) migrate "exec:gzip -c > STATEFILE.gz" (4) check the test result the qemu-kvm crashes: bug_1953045 .sh: line 8: 75095 Segmentation fault (core dumped) /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 -monitor stdio -vnc :0 /home/images/RHEL84.qcow2 # dmesg [253143.862201] qemu-kvm[75095]: segfault at 0 ip 0000000000000000 sp 00007ffda54a0b58 error 14 in qemu-kvm[55ebee0ef000+b13000] [253143.874838] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. (gdb) bt #0 0x0000000000000000 in () #1 0x000055ebee799ea4 in notifier_list_notify (list=list@entry=0x55ebef00e7a8 <migration_state_notifiers>, data=data@entry=0x55ebf09e79c0) at ../util/notify.c:39 #2 0x000055ebee438022 in migrate_fd_cleanup (s=s@entry=0x55ebf09e79c0) at ../migration/migration.c:1753 #3 0x000055ebee4380bd in migrate_fd_cleanup_bh (opaque=0x55ebf09e79c0) at ../migration/migration.c:1770 #4 0x000055ebee7b8ebd in aio_bh_call (bh=0x55ebf0a372f0) at ../util/async.c:164 #5 0x000055ebee7b8ebd in aio_bh_poll (ctx=ctx@entry=0x55ebf09cb2b0) at ../util/async.c:164 #6 0x000055ebee7c7b62 in aio_dispatch (ctx=0x55ebf09cb2b0) at ../util/aio-posix.c:381 #7 0x000055ebee7b8da2 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306 #8 0x00007fb9fe43977d in g_main_dispatch (context=0x55ebf09cc020) at gmain.c:3176 #9 0x00007fb9fe43977d in g_main_context_dispatch (context=context@entry=0x55ebf09cc020) at gmain.c:3829 #10 0x000055ebee798c90 in glib_pollfds_poll () at ../util/main-loop.c:221 #11 0x000055ebee798c90 in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244 #12 0x000055ebee798c90 in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520 #13 0x000055ebee5ef3c1 in qemu_main_loop () at ../softmmu/vl.c:1679 #14 0x000055ebee414942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50
It looks to me as if hw/net/virtio-net.c calls add_migration_state_change_notifier but never calls remove
*** Bug 1953283 has been marked as a duplicate of this bug. ***
I'm able to reproduce the problem, I'm having a look to try to fix it.
(In reply to Dr. David Alan Gilbert from comment #5) > It looks to me as if hw/net/virtio-net.c calls > add_migration_state_change_notifier but never calls remove Right, there is an add_migration_state_change_notifier() in the realize function, but remove_migration_state_change_notifier() is missing in the unrealize function. The following patch fixes the problem for me: diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 66b9ff451185..914051feb75b 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -3373,6 +3373,7 @@ static void virtio_net_device_unrealize(DeviceState *dev) if (n->failover) { device_listener_unregister(&n->primary_listener); + remove_migration_state_change_notifier(&n->migration_state); } max_queues = n->multiqueue ? n->max_queues : 1;
(In reply to Laurent Vivier from comment #13) > (In reply to Dr. David Alan Gilbert from comment #5) > > It looks to me as if hw/net/virtio-net.c calls > > add_migration_state_change_notifier but never calls remove > > Right, there is an add_migration_state_change_notifier() in the realize > function, but remove_migration_state_change_notifier() is missing in the > unrealize function. > Patch sent upstream: https://patchew.org/QEMU/20210427135147.111218-1-lvivier@redhat.com/ Author: Laurent Vivier <lvivier> Date: Tue Apr 27 15:25:29 2021 +0200 virtio-net: failover: add missing remove_migration_state_change_notifier() In the failover case configuration, virtio_net_device_realize() uses an add_migration_state_change_notifier() to add a state notifier, but this notifier is not removed by the unrealize function when the virtio-net card is unplugged. If the card is unplugged and a migration is started, the notifier is called and as it is not valid anymore QEMU crashes. This patch fixes the problem by adding the remove_migration_state_change_notifier() in virtio_net_device_unrealize(). The problem can be reproduced with: $ qemu-system-x86_64 -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 \ -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -monitor stdio disk.qcow2 (qemu) device_del net1 (qemu) migrate "exec:gzip -c > STATEFILE.gz" Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in () #1 0x0000555555d726d7 in notifier_list_notify (...) at .../util/notify.c:39 #2 0x0000555555842c1a in migrate_fd_connect (...) at .../migration/migration.c:3975 #3 0x0000555555950f7d in migration_channel_connect (...) error@entry=0x0) at .../migration/channel.c:107 #4 0x0000555555910922 in exec_start_outgoing_migration (...) at .../migration/exec.c:42 Reported-by: Igor Mammedov <imammedo> Signed-off-by: Laurent Vivier <lvivier>
Simplify the reproduction steps: > I can use the method Igor mentioned in the description to reproduce this problem: > > Test env: > host: > 4.18.0-304.el8.x86_64 > qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64 > guest: > 4.18.0-304.el8.x86_64 > Test step: > (1) start a vm with a failover virtio net device: /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -monitor stdio \ -vnc :0 \ /home/images/RHEL84.qcow2 \ > (2) hot-unplug the failover virtio nic (qemu) device_del net1 > (3) do the offline migration (qemu) migrate "exec:gzip -c > STATEFILE.gz" > (4) check the test result (gdb) bt #0 0x0000000000000000 in () #1 0x000055afcc7a9ea4 in notifier_list_notify (list=list@entry=0x55afcd01e7a8 <migration_state_notifiers>, data=data@entry=0x55afce82b100) at ../util/notify.c:39 #2 0x000055afcc448022 in migrate_fd_cleanup (s=s@entry=0x55afce82b100) at ../migration/migration.c:1753 #3 0x000055afcc4480bd in migrate_fd_cleanup_bh (opaque=0x55afce82b100) at ../migration/migration.c:1770 #4 0x000055afcc7c8ebd in aio_bh_call (bh=0x55afcf1de800) at ../util/async.c:164 #5 0x000055afcc7c8ebd in aio_bh_poll (ctx=ctx@entry=0x55afce80e440) at ../util/async.c:164 #6 0x000055afcc7d7b62 in aio_dispatch (ctx=0x55afce80e440) at ../util/aio-posix.c:381 #7 0x000055afcc7c8da2 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306 #8 0x00007f8ef530077d in g_main_dispatch (context=0x55afce80f620) at gmain.c:3176 #9 0x00007f8ef530077d in g_main_context_dispatch (context=context@entry=0x55afce80f620) at gmain.c:3829 #10 0x000055afcc7a8c90 in glib_pollfds_poll () at ../util/main-loop.c:221 #11 0x000055afcc7a8c90 in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244 #12 0x000055afcc7a8c90 in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520 --Type <RET> for more, q to quit, c to continue without paging-- #13 0x000055afcc5ff3c1 in qemu_main_loop () at ../softmmu/vl.c:1679 #14 0x000055afcc424942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 bug1953045.sh: line 7: 6379 Segmentation fault (core dumped) /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 -monitor stdio -vnc :0 /home/images/RHEL84.qcow2 # dmesg [ 4942.528793] qemu-kvm[6379]: segfault at 0 ip 0000000000000000 sp 00007ffd860ff2a8 error 14 in qemu-kvm[55afcc0ff000+b13000] [ 4942.541234] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
*** Bug 1946981 has been marked as a duplicate of this bug. ***
Just adjusting the DTM=12 (which is what changes the Current Deadline) - that means hopefully by 24-May we'll have 3 reviews and be able to move to MODIFIED. I see 0 now and the next DTM is 17-May which feels unreasonable to occur... I'll let QE adjust ITM if they feel it's necessary
Set Verified:Tested,SanityOnly as gating/tier1 test pass.
> I have repeated the same tests as in comment 18/19 using different qemu-kvm version. > > My test result is as following: > > (1) qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64 > > I can still reproduce this problem. > > The vm *will crash* after hot-unplug the failover virtio net device and do > offline migration. Test with qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.x86_64: This problem has been fixed. The vm *will not crash* after hot-unplug the failover virtio net device and do offline migration.
According to comment 39 and comment 40 , move the bug status to VERIFIED.
Test result also passes on RHEL9.0.0. Packages: qemu-kvm-6.0.0-12.el9.x86_64 kernel-5.14.0-0.rc7.54.el9.x86_64 (both host and guest) Steps are the same as comment 3 Test results: No crash. QEMU 6.0.0 monitor - type 'help' for more information (qemu) device_del net2 (qemu) device_del net1 (qemu) info status VM status: running (qemu) migrate "exec:gzip -c > STATEFILE.gz" (qemu) info status VM status: paused (postmigrate) (qemu)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684