Bug 1953045 - qemu-kvm NULL pointer de-reference during migration at migrate_fd_connect ->...-> notifier_list_notify
Summary: qemu-kvm NULL pointer de-reference during migration at migrate_fd_connect ->...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.4
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: rc
: 8.4
Assignee: Laurent Vivier
QA Contact: Yanhui Ma
URL:
Whiteboard:
: 1946981 1953283 (view as bug list)
Depends On:
Blocks: 1688177 1955666 1957194 1964261
TreeView+ depends on / blocked
 
Reported: 2021-04-23 19:14 UTC by Igor Mammedov
Modified: 2022-01-05 08:40 UTC (History)
18 users (show)

Fixed In Version: qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1955666 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:52:40 UTC
Type: Bug
Target Upstream Version: qemu-6.1.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:53:23 UTC

Description Igor Mammedov 2021-04-23 19:14:14 UTC
Description of problem:

qemu crashes with:

Thread 1 "qemu-kvm" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000560bf72d4f04 in notifier_list_notify (list=list@entry=0x560bf7b3f4c8 <migration_state_notifiers>, data=data@entry=0x560bf993efd0) at ../util/notify.c:39
#2  0x0000560bf70208e2 in migrate_fd_connect (s=s@entry=0x560bf993efd0, error_in=<optimized out>) at ../migration/migration.c:3636
#3  0x0000560bf6fb2eaa in migration_channel_connect (s=s@entry=0x560bf993efd0, ioc=ioc@entry=0x560bf9dc9810, hostname=hostname@entry=0x0, error=<optimized out>, error@entry=0x0) at ../migration/channel.c:92
#4  0x0000560bf6f7262e in fd_start_outgoing_migration (s=0x560bf993efd0, fdname=<optimized out>, errp=<optimized out>) at ../migration/fd.c:42
#5  0x0000560bf701f056 in qmp_migrate (uri=0x560bf9d6ade0 "fd:migrate", has_blk=<optimized out>, blk=<optimized out>, has_inc=<optimized out>, inc=<optimized out>, has_detach=<optimized out>, detach=true, has_resume=false, resume=false, 
    errp=0x7ffc3006c718) at ../migration/migration.c:2177
#6  0x0000560bf72b4a3e in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f14890bdec0) at qapi/qapi-commands-migration.c:533
#7  0x0000560bf72f87fd in do_qmp_dispatch_bh (opaque=0x7f14890bded0) at ../qapi/qmp-dispatch.c:110
#8  0x0000560bf72c9a8d in aio_bh_call (bh=0x7f13e4006080) at ../util/async.c:164
#9  aio_bh_poll (ctx=ctx@entry=0x560bf98dd340) at ../util/async.c:164
#10 0x0000560bf72cf772 in aio_dispatch (ctx=0x560bf98dd340) at ../util/aio-posix.c:381
#11 0x0000560bf72c9972 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306
#12 0x00007f1487f1877d in g_main_context_dispatch () from target:/lib64/libglib-2.0.so.0
#13 0x0000560bf72ca9f0 in glib_pollfds_poll () at ../util/main-loop.c:221
#14 os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244
#15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
#16 0x0000560bf71b2251 in qemu_main_loop () at ../softmmu/vl.c:1679
#17 0x0000560bf6f33942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50

Version-Release number of selected component (if applicable):
qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_6

How reproducible:

100%

Steps to Reproduce:
1. simplified CLI to reproduce: 

qemu-kvm -enable-kvm -m 1g -M q35 \
  -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \
  -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \
  -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 \
  -monitor stdio rhel84.qcow2


(qemu) migrate "exec:gzip -c > STATEFILE.gz"
2.
(qemu) device_del net2

wait till net2 is unplugged

(qemu) device_del net1

wait till net1 is unplugged
 
3.

(qemu) migrate "exec:gzip -c > STATEFILE.gz"

Actual results:

Segmentation fault (core dumped)

Expected results:

migration completes

Additional info:

originally reported at https://bugzilla.redhat.com/show_bug.cgi?id=1946981#c36

Comment 1 Igor Mammedov 2021-04-23 19:23:43 UTC
bug also present in current upstream (to be released as 6.0)

Comment 3 Yanghang Liu 2021-04-26 06:01:26 UTC
I can use the method Igor mentioned in the description to reproduce this problem:

Test env:
host:
4.18.0-304.el8.x86_64
qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64
guest:
4.18.0-304.el8.x86_64



Test step:

(1) start a vm with the following qemu cmd line:
/usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 \
-device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \
-device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \
-device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 \
-monitor stdio \
-vnc :0 \
/home/images/RHEL84.qcow2 \


(2) hot-unplug the nic

(qemu) device_del net2

(qemu) device_del net1


(3) do the offline migration

(qemu) migrate "exec:gzip -c > STATEFILE.gz"



(4) check the test result

the qemu-kvm crashes:
bug_1953045 .sh: line 8: 75095 Segmentation fault      (core dumped) /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 -device e1000e,id=net2,mac=52:54:00:6f:55:cc,bus=root2,addr=0x0,failover_pair_id=net1 -monitor stdio -vnc :0 /home/images/RHEL84.qcow2


# dmesg 
[253143.862201] qemu-kvm[75095]: segfault at 0 ip 0000000000000000 sp 00007ffda54a0b58 error 14 in qemu-kvm[55ebee0ef000+b13000]
[253143.874838] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.



(gdb) bt
#0  0x0000000000000000 in  ()
#1  0x000055ebee799ea4 in notifier_list_notify
    (list=list@entry=0x55ebef00e7a8 <migration_state_notifiers>, data=data@entry=0x55ebf09e79c0)
    at ../util/notify.c:39
#2  0x000055ebee438022 in migrate_fd_cleanup (s=s@entry=0x55ebf09e79c0) at ../migration/migration.c:1753
#3  0x000055ebee4380bd in migrate_fd_cleanup_bh (opaque=0x55ebf09e79c0) at ../migration/migration.c:1770
#4  0x000055ebee7b8ebd in aio_bh_call (bh=0x55ebf0a372f0) at ../util/async.c:164
#5  0x000055ebee7b8ebd in aio_bh_poll (ctx=ctx@entry=0x55ebf09cb2b0) at ../util/async.c:164
#6  0x000055ebee7c7b62 in aio_dispatch (ctx=0x55ebf09cb2b0) at ../util/aio-posix.c:381
#7  0x000055ebee7b8da2 in aio_ctx_dispatch
    (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306
#8  0x00007fb9fe43977d in g_main_dispatch (context=0x55ebf09cc020) at gmain.c:3176
#9  0x00007fb9fe43977d in g_main_context_dispatch (context=context@entry=0x55ebf09cc020) at gmain.c:3829
#10 0x000055ebee798c90 in glib_pollfds_poll () at ../util/main-loop.c:221
#11 0x000055ebee798c90 in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244
#12 0x000055ebee798c90 in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
#13 0x000055ebee5ef3c1 in qemu_main_loop () at ../softmmu/vl.c:1679
#14 0x000055ebee414942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at ../softmmu/main.c:50

Comment 5 Dr. David Alan Gilbert 2021-04-26 19:02:22 UTC
It looks to me as if hw/net/virtio-net.c calls add_migration_state_change_notifier but never calls remove

Comment 7 Chao Yang 2021-04-27 02:40:32 UTC
*** Bug 1953283 has been marked as a duplicate of this bug. ***

Comment 12 Laurent Vivier 2021-04-27 10:59:56 UTC
I'm able to reproduce the problem, I'm having a look to try to fix it.

Comment 13 Laurent Vivier 2021-04-27 13:18:45 UTC
(In reply to Dr. David Alan Gilbert from comment #5)
> It looks to me as if hw/net/virtio-net.c calls
> add_migration_state_change_notifier but never calls remove

Right, there is an add_migration_state_change_notifier() in the realize function, but remove_migration_state_change_notifier() is missing in the unrealize function.

The following patch fixes the problem for me:

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 66b9ff451185..914051feb75b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3373,6 +3373,7 @@ static void virtio_net_device_unrealize(DeviceState *dev)
 
     if (n->failover) {
         device_listener_unregister(&n->primary_listener);
+        remove_migration_state_change_notifier(&n->migration_state);
     }
 
     max_queues = n->multiqueue ? n->max_queues : 1;

Comment 16 Laurent Vivier 2021-04-28 09:31:22 UTC
(In reply to Laurent Vivier from comment #13)
> (In reply to Dr. David Alan Gilbert from comment #5)
> > It looks to me as if hw/net/virtio-net.c calls
> > add_migration_state_change_notifier but never calls remove
> 
> Right, there is an add_migration_state_change_notifier() in the realize
> function, but remove_migration_state_change_notifier() is missing in the
> unrealize function.
> 

Patch sent upstream:

https://patchew.org/QEMU/20210427135147.111218-1-lvivier@redhat.com/

Author: Laurent Vivier <lvivier>
Date:   Tue Apr 27 15:25:29 2021 +0200

    virtio-net: failover: add missing remove_migration_state_change_notifier()
    
    In the failover case configuration, virtio_net_device_realize() uses an
    add_migration_state_change_notifier() to add a state notifier, but this
    notifier is not removed by the unrealize function when the virtio-net
    card is unplugged.
    
    If the card is unplugged and a migration is started, the notifier is
    called and as it is not valid anymore QEMU crashes.
    
    This patch fixes the problem by adding the
    remove_migration_state_change_notifier() in virtio_net_device_unrealize().
    
    The problem can be reproduced with:
    
      $ qemu-system-x86_64 -enable-kvm -m 1g -M q35 \
        -device pcie-root-port,slot=4,id=root1 \
        -device pcie-root-port,slot=5,id=root2 \
        -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \
        -monitor stdio disk.qcow2
      (qemu) device_del net1
      (qemu) migrate "exec:gzip -c > STATEFILE.gz"
    
      Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
      0x0000000000000000 in ?? ()
      (gdb) bt
      #0  0x0000000000000000 in  ()
      #1  0x0000555555d726d7 in notifier_list_notify (...)
          at .../util/notify.c:39
      #2  0x0000555555842c1a in migrate_fd_connect (...)
          at .../migration/migration.c:3975
      #3  0x0000555555950f7d in migration_channel_connect (...)
          error@entry=0x0) at .../migration/channel.c:107
      #4  0x0000555555910922 in exec_start_outgoing_migration (...)
          at .../migration/exec.c:42
    
    Reported-by: Igor Mammedov <imammedo>
    Signed-off-by: Laurent Vivier <lvivier>

Comment 18 Yanghang Liu 2021-04-28 14:38:34 UTC
Simplify the reproduction steps:

> I can use the method Igor mentioned in the description to reproduce this problem:
> 
> Test env:
> host:
> 4.18.0-304.el8.x86_64
> qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64
> guest:
> 4.18.0-304.el8.x86_64

 
> Test step:
> (1) start a vm with a failover virtio net device:

/usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 \
-device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 \
-device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 \
-monitor stdio \
-vnc :0 \
/home/images/RHEL84.qcow2 \

 
> (2) hot-unplug the failover virtio nic

(qemu) device_del net1

> (3) do the offline migration

(qemu) migrate "exec:gzip -c > STATEFILE.gz"

> (4) check the test result

(gdb) bt
#0  0x0000000000000000 in  ()
#1  0x000055afcc7a9ea4 in notifier_list_notify
    (list=list@entry=0x55afcd01e7a8 <migration_state_notifiers>, data=data@entry=0x55afce82b100)
    at ../util/notify.c:39
#2  0x000055afcc448022 in migrate_fd_cleanup (s=s@entry=0x55afce82b100) at ../migration/migration.c:1753
#3  0x000055afcc4480bd in migrate_fd_cleanup_bh (opaque=0x55afce82b100) at ../migration/migration.c:1770
#4  0x000055afcc7c8ebd in aio_bh_call (bh=0x55afcf1de800) at ../util/async.c:164
#5  0x000055afcc7c8ebd in aio_bh_poll (ctx=ctx@entry=0x55afce80e440) at ../util/async.c:164
#6  0x000055afcc7d7b62 in aio_dispatch (ctx=0x55afce80e440) at ../util/aio-posix.c:381
#7  0x000055afcc7c8da2 in aio_ctx_dispatch
    (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:306
#8  0x00007f8ef530077d in g_main_dispatch (context=0x55afce80f620) at gmain.c:3176
#9  0x00007f8ef530077d in g_main_context_dispatch (context=context@entry=0x55afce80f620) at gmain.c:3829
#10 0x000055afcc7a8c90 in glib_pollfds_poll () at ../util/main-loop.c:221
#11 0x000055afcc7a8c90 in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:244
#12 0x000055afcc7a8c90 in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
--Type <RET> for more, q to quit, c to continue without paging--
#13 0x000055afcc5ff3c1 in qemu_main_loop () at ../softmmu/vl.c:1679
#14 0x000055afcc424942 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at ../softmmu/main.c:50


bug1953045.sh: line 7:  6379 Segmentation fault      (core dumped) /usr/libexec/qemu-kvm -enable-kvm -m 1g -M q35 -device pcie-root-port,slot=4,id=root1 -device pcie-root-port,slot=5,id=root2 -device virtio-net-pci,id=net1,mac=52:54:00:6f:55:cc,failover=on,bus=root1 -monitor stdio -vnc :0 /home/images/RHEL84.qcow2

# dmesg
[ 4942.528793] qemu-kvm[6379]: segfault at 0 ip 0000000000000000 sp 00007ffd860ff2a8 error 14 in qemu-kvm[55afcc0ff000+b13000]
[ 4942.541234] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.

Comment 24 Laurent Vivier 2021-04-30 09:42:35 UTC
*** Bug 1946981 has been marked as a duplicate of this bug. ***

Comment 33 John Ferlan 2021-05-14 14:57:52 UTC
Just adjusting the DTM=12 (which is what changes the Current Deadline) - that means hopefully by 24-May we'll have 3 reviews and be able to move to MODIFIED. I see 0 now and the next DTM is 17-May which feels unreasonable to occur... 

I'll let QE adjust ITM if they feel it's necessary

Comment 39 Yanan Fu 2021-06-07 02:40:40 UTC
Set Verified:Tested,SanityOnly as gating/tier1 test pass.

Comment 40 Yanghang Liu 2021-06-07 02:47:35 UTC


> I have repeated the same tests as in comment 18/19 using different qemu-kvm version.
> 
> My test result is as following:
> 
> (1) qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64
> 
> I can still reproduce this problem.
> 
> The vm *will crash* after hot-unplug the failover virtio net device and do
> offline migration.


Test with qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1.x86_64:

This problem has been fixed.

The vm *will not crash* after hot-unplug the failover virtio net device and do offline migration.

Comment 48 Yanghang Liu 2021-06-10 10:29:16 UTC
According to comment 39 and comment 40 , move the bug status to VERIFIED.

Comment 50 Yanhui Ma 2021-08-25 10:06:30 UTC
Test result also passes on RHEL9.0.0.

Packages:
qemu-kvm-6.0.0-12.el9.x86_64
kernel-5.14.0-0.rc7.54.el9.x86_64 (both host and guest)

Steps are the same as comment 3

Test results:
No crash.

QEMU 6.0.0 monitor - type 'help' for more information
(qemu) device_del net2
(qemu) device_del net1
(qemu) info status
VM status: running
(qemu) migrate "exec:gzip -c > STATEFILE.gz"
(qemu) info status
VM status: paused (postmigrate)
(qemu)

Comment 52 errata-xmlrpc 2021-11-16 07:52:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.