Bug 1953045
| Summary: | qemu-kvm NULL pointer de-reference during migration at migrate_fd_connect ->...-> notifier_list_notify | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Igor Mammedov <imammedo> | |
| Component: | qemu-kvm | Assignee: | Laurent Vivier <lvivier> | |
| qemu-kvm sub component: | General | QA Contact: | Yanhui Ma <yama> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | unspecified | CC: | aadam, ailan, amusil, dgilbert, jasowang, jinzhao, juzhang, laine, lvivier, mburman, mperina, mrezanin, mtessun, virt-maint, yama, yanghliu, yfu, ymankad | |
| Version: | 8.4 | Keywords: | Triaged, ZStream | |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
|
| Target Release: | 8.4 | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1955666 (view as bug list) | Environment: | ||
| Last Closed: | 2021-11-16 07:52:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | qemu-6.1.0 | |
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1688177, 1955666, 1957194, 1964261 | |||
bug also present in current upstream (to be released as 6.0) I can use the method Igor mentioned in the description to reproduce this problem:
Test env:
host:
4.18.0-304.el8.x86_64
qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64
guest:
4.18.0-304.el8.x86_64
Test step:
(1) start a vm with the following qemu cmd line:
/usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 \
-device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \
-device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \
-device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 \
-monitor stdio \
-vnc :0 \
/home/images/RHEL84.qcow2 \
(2) hot-unplug the nic
(qemu) device_del net2
(qemu) device_del net1
(3) do the offline migration
(qemu) migrate "exec:gzip -c > STATEFILE.gz"
(4) check the test result
the qemu-kvm crashes:
bug_1953045 .sh: line 8: 75095 Segmentation fault (core dumped) /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 -monitor stdio -vnc :0 /home/images/RHEL84.qcow2
# dmesg
[253143.862201] qemu-kvm[75095]: segfault at 0 ip 0000000000000000 sp 00007ffda54a0b58 error 14 in qemu-kvm[55ebee0ef000+b13000]
[253143.874838] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
(gdb) bt
#0 0x0000000000000000 in ()
#1 0x000055ebee799ea4 in notifier_list_notify
(list=list@entry=0x55ebef00e7a8 <migration_state_notifiers>, data=data@entry=0x55ebf09e79c0)
at ../util/notify.c:39
#2 0x000055ebee438022 in migrate_fd_cleanup (s=s@entry=0x55ebf09e79c0) at ../migration/migration.c:1753
#3 0x000055ebee4380bd in migrate_fd_cleanup_bh (opaque=0x55ebf09e79c0) at ../migration/migration.c:1770
#4 0x000055ebee7b8ebd in aio_bh_call (bh=0x55ebf0a372f0) at ../util/async.c:164
#5 0x000055ebee7b8ebd in aio_bh_poll (ctx=ctx@entry=0x55ebf09cb2b0) at ../util/async.c:164
#6 0x000055ebee7c7b62 in aio_dispatch (ctx=0x55ebf09cb2b0) at ../util/aio-posix.c:381
#7 0x000055ebee7b8da2 in aio_ctx_dispatch
(source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306
#8 0x00007fb9fe43977d in g_main_dispatch (context=0x55ebf09cc020) at gmain.c:3176
#9 0x00007fb9fe43977d in g_main_context_dispatch (context=context@entry=0x55ebf09cc020) at gmain.c:3829
#10 0x000055ebee798c90 in glib_pollfds_poll () at ../util/main-loop.c:221
#11 0x000055ebee798c90 in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244
#12 0x000055ebee798c90 in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
#13 0x000055ebee5ef3c1 in qemu_main_loop () at ../softmmu/vl.c:1679
#14 0x000055ebee414942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at ../softmmu/main.c:50
It looks to me as if hw/net/virtio-net.c calls add_migration_state_change_notifier but never calls remove *** Bug 1953283 has been marked as a duplicate of this bug. *** I'm able to reproduce the problem, I'm having a look to try to fix it. (In reply to Dr. David Alan Gilbert from comment #5) > It looks to me as if hw/net/virtio-net.c calls > add_migration_state_change_notifier but never calls remove Right, there is an add_migration_state_change_notifier() in the realize function, but remove_migration_state_change_notifier() is missing in the unrealize function. The following patch fixes the problem for me: diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 66b9ff451185..914051feb75b 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -3373,6 +3373,7 @@ static void virtio_net_device_unrealize(DeviceState *dev) if (n->failover) { device_listener_unregister(&n->primary_listener); + remove_migration_state_change_notifier(&n->migration_state); } max_queues = n->multiqueue ? n->max_queues : 1; (In reply to Laurent Vivier from comment #13) > (In reply to Dr. David Alan Gilbert from comment #5) > > It looks to me as if hw/net/virtio-net.c calls > > add_migration_state_change_notifier but never calls remove > > Right, there is an add_migration_state_change_notifier() in the realize > function, but remove_migration_state_change_notifier() is missing in the > unrealize function. > Patch sent upstream: https://patchew.org/QEMU/20210427135147.111218-1-lvivier@redhat.com/ Author: Laurent Vivier <lvivier> Date: Tue Apr 27 15:25:29 2021 +0200 virtio-net: failover: add missing remove_migration_state_change_notifier() In the failover case configuration, virtio_net_device_realize() uses an add_migration_state_change_notifier() to add a state notifier, but this notifier is not removed by the unrealize function when the virtio-net card is unplugged. If the card is unplugged and a migration is started, the notifier is called and as it is not valid anymore QEMU crashes. This patch fixes the problem by adding the remove_migration_state_change_notifier() in virtio_net_device_unrealize(). The problem can be reproduced with: $ qemu-system-x86_64 -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 \ -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -monitor stdio disk.qcow2 (qemu) device_del net1 (qemu) migrate "exec:gzip -c > STATEFILE.gz" Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in () #1 0x0000555555d726d7 in notifier_list_notify (...) at .../util/notify.c:39 #2 0x0000555555842c1a in migrate_fd_connect (...) at .../migration/migration.c:3975 #3 0x0000555555950f7d in migration_channel_connect (...) error@entry=0x0) at .../migration/channel.c:107 #4 0x0000555555910922 in exec_start_outgoing_migration (...) at .../migration/exec.c:42 Reported-by: Igor Mammedov <imammedo> Signed-off-by: Laurent Vivier <lvivier> Simplify the reproduction steps: > I can use the method Igor mentioned in the description to reproduce this problem: > > Test env: > host: > 4.18.0-304.el8.x86_64 > qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64 > guest: > 4.18.0-304.el8.x86_64 > Test step: > (1) start a vm with a failover virtio net device: /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -monitor stdio \ -vnc :0 \ /home/images/RHEL84.qcow2 \ > (2) hot-unplug the failover virtio nic (qemu) device_del net1 > (3) do the offline migration (qemu) migrate "exec:gzip -c > STATEFILE.gz" > (4) check the test result (gdb) bt #0 0x0000000000000000 in () #1 0x000055afcc7a9ea4 in notifier_list_notify (list=list@entry=0x55afcd01e7a8 <migration_state_notifiers>, data=data@entry=0x55afce82b100) at ../util/notify.c:39 #2 0x000055afcc448022 in migrate_fd_cleanup (s=s@entry=0x55afce82b100) at ../migration/migration.c:1753 #3 0x000055afcc4480bd in migrate_fd_cleanup_bh (opaque=0x55afce82b100) at ../migration/migration.c:1770 #4 0x000055afcc7c8ebd in aio_bh_call (bh=0x55afcf1de800) at ../util/async.c:164 #5 0x000055afcc7c8ebd in aio_bh_poll (ctx=ctx@entry=0x55afce80e440) at ../util/async.c:164 #6 0x000055afcc7d7b62 in aio_dispatch (ctx=0x55afce80e440) at ../util/aio-posix.c:381 #7 0x000055afcc7c8da2 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306 #8 0x00007f8ef530077d in g_main_dispatch (context=0x55afce80f620) at gmain.c:3176 #9 0x00007f8ef530077d in g_main_context_dispatch (context=context@entry=0x55afce80f620) at gmain.c:3829 #10 0x000055afcc7a8c90 in glib_pollfds_poll () at ../util/main-loop.c:221 #11 0x000055afcc7a8c90 in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244 #12 0x000055afcc7a8c90 in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520 --Type <RET> for more, q to quit, c to continue without paging-- #13 0x000055afcc5ff3c1 in qemu_main_loop () at ../softmmu/vl.c:1679 #14 0x000055afcc424942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 bug1953045.sh: line 7: 6379 Segmentation fault (core dumped) /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 -monitor stdio -vnc :0 /home/images/RHEL84.qcow2 # dmesg [ 4942.528793] qemu-kvm[6379]: segfault at 0 ip 0000000000000000 sp 00007ffd860ff2a8 error 14 in qemu-kvm[55afcc0ff000+b13000] [ 4942.541234] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. *** Bug 1946981 has been marked as a duplicate of this bug. *** Just adjusting the DTM=12 (which is what changes the Current Deadline) - that means hopefully by 24-May we'll have 3 reviews and be able to move to MODIFIED. I see 0 now and the next DTM is 17-May which feels unreasonable to occur... I'll let QE adjust ITM if they feel it's necessary Set Verified:Tested,SanityOnly as gating/tier1 test pass.
> I have repeated the same tests as in comment 18/19 using different qemu-kvm version.
>
> My test result is as following:
>
> (1) qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64
>
> I can still reproduce this problem.
>
> The vm *will crash* after hot-unplug the failover virtio net device and do
> offline migration.
Test with qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.x86_64:
This problem has been fixed.
The vm *will not crash* after hot-unplug the failover virtio net device and do offline migration.
According to comment 39 and comment 40 , move the bug status to VERIFIED. Test result also passes on RHEL9.0.0. Packages: qemu-kvm-6.0.0-12.el9.x86_64 kernel-5.14.0-0.rc7.54.el9.x86_64 (both host and guest) Steps are the same as comment 3 Test results: No crash. QEMU 6.0.0 monitor - type 'help' for more information (qemu) device_del net2 (qemu) device_del net1 (qemu) info status VM status: running (qemu) migrate "exec:gzip -c > STATEFILE.gz" (qemu) info status VM status: paused (postmigrate) (qemu) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684 |
Description of problem: qemu crashes with: Thread 1 "qemu-kvm" received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000560bf72d4f04 in notifier_list_notify (list=list@entry=0x560bf7b3f4c8 <migration_state_notifiers>, data=data@entry=0x560bf993efd0) at ../util/notify.c:39 #2 0x0000560bf70208e2 in migrate_fd_connect (s=s@entry=0x560bf993efd0, error_in=<optimized out>) at ../migration/migration.c:3636 #3 0x0000560bf6fb2eaa in migration_channel_connect (s=s@entry=0x560bf993efd0, ioc=ioc@entry=0x560bf9dc9810, hostname=hostname@entry=0x0, error=<optimized out>, error@entry=0x0) at ../migration/channel.c:92 #4 0x0000560bf6f7262e in fd_start_outgoing_migration (s=0x560bf993efd0, fdname=<optimized out>, errp=<optimized out>) at ../migration/fd.c:42 #5 0x0000560bf701f056 in qmp_migrate (uri=0x560bf9d6ade0 "fd:migrate", has_blk=<optimized out>, blk=<optimized out>, has_inc=<optimized out>, inc=<optimized out>, has_detach=<optimized out>, detach=true, has_resume=false, resume=false, errp=0x7ffc3006c718) at ../migration/migration.c:2177 #6 0x0000560bf72b4a3e in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f14890bdec0) at qapi/qapi-commands-migration.c:533 #7 0x0000560bf72f87fd in do_qmp_dispatch_bh (opaque=0x7f14890bded0) at ../qapi/qmp-dispatch.c:110 #8 0x0000560bf72c9a8d in aio_bh_call (bh=0x7f13e4006080) at ../util/async.c:164 #9 aio_bh_poll (ctx=ctx@entry=0x560bf98dd340) at ../util/async.c:164 #10 0x0000560bf72cf772 in aio_dispatch (ctx=0x560bf98dd340) at ../util/aio-posix.c:381 #11 0x0000560bf72c9972 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306 #12 0x00007f1487f1877d in g_main_context_dispatch () from target:/lib64/libglib-2.0.so.0 #13 0x0000560bf72ca9f0 in glib_pollfds_poll () at ../util/main-loop.c:221 #14 os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244 #15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520 #16 0x0000560bf71b2251 in qemu_main_loop () at ../softmmu/vl.c:1679 #17 0x0000560bf6f33942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 Version-Release number of selected component (if applicable): qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_6 How reproducible: 100% Steps to Reproduce: 1. simplified CLI to reproduce: qemu-kvm -enable-kvm -m 1g -M q35 \ -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \ -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \ -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 \ -monitor stdio rhel84.qcow2 (qemu) migrate "exec:gzip -c > STATEFILE.gz" 2. (qemu) device_del net2 wait till net2 is unplugged (qemu) device_del net1 wait till net1 is unplugged 3. (qemu) migrate "exec:gzip -c > STATEFILE.gz" Actual results: Segmentation fault (core dumped) Expected results: migration completes Additional info: originally reported at https://bugzilla.redhat.com/show_bug.cgi?id=1946981#c36