Bug 1525899

Summary:	Migrate to an error destination ip ->"migrate_cancel"->info migrate, there will be segmentation fault
Product:	Red Hat Enterprise Linux 7	Reporter:	xianwang <xianwang>
Component:	qemu-kvm-rhev	Assignee:	Dr. David Alan Gilbert <dgilbert>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Yumei Huang <yuhuang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.5	CC:	chayang, dgilbert, jinzhao, juzhang, knoel, michen, qzhang, virt-maint, xianwang
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-15 08:34:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1558351

Description xianwang 2017-12-14 10:41:34 UTC

Description of problem:
Migrate a vm to an error destination ip, then in HMP(qemu)migrate_cancel,(qemu)info migrate, there will be segmentation fault, vm hang, qemu crash and quit automatically.this issue both exist on x86 and ppc. 

Version-Release number of selected component (if applicable):
Host:
3.10.0-823.el7.x86_64
qemu-kvm-rhev-2.10.0-12.el7.x86_64
seabios-bin-1.11.0-1.el7.noarch


How reproducible:
4/5

Steps to Reproduce:
1.Boot a guest with qemu cli:
gdb --args /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga std \
    -rtc base=utc,clock=host,driftfix=slew \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03,disable-legacy=off,disable-modern=on  \
    -chardev socket,id=console0,path=/tmp/console0,server,nowait \
    -device virtserialport,chardev=console0,name=console0,id=console0,bus=virtio_serial_pci0.0  \
    -chardev socket,id=serial0,path=/tmp/serial0,server,nowait \
    -device isa-serial,chardev=serial0,id=serial0 \
    -device nec-usb-xhci,id=usb1,multifunction=on,bus=pci.0,addr=11 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on,iothread=iothread0 \
    -object iothread,id=iothread0 \
    -drive id=drive_image1,if=none,cache=none,format=qcow2,snapshot=off,file=/home/xianwang/rhel75-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,bootindex=0 \
    -netdev tap,vhost=on,id=idlkwV8e,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:7f,id=idtlLxAk,vectors=4,netdev=idlkwV8e,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -m 4G  \
    -smp 4  \
    -cpu SandyBridge \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=2  \
    -device usb-kbd,id=usb-kbd1,bus=usb1.0,port=3 \
    -device usb-mouse,id=usb-mouse1,bus=usb1.0,port=4 \
    -qmp tcp:0:6666,server,nowait \
    -vnc :9 \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -monitor stdio \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=06 \
2.Migrate vm to an error destination ip and cancel migration
(gdb) r
(qemu) migrate -d tcp:10.66.101.144:5801 ****(this ip and port doesn't exist)
(qemu) info migrate
globals: store-global-state=1, only_migratable=0, send-configuration=1, send-section-footer=1
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off 
Migration status: setup

3.check the status of migration
(qemu) migrate_cancel 
(qemu) info migrate


Actual results:
vm hang, segmentation fault,and qemu crash

(qemu) migrate_cancel 
(qemu) info migrate

Program received signal SIGSEGV, Segmentation fault.
0x00005555557f27a7 in ram_bytes_remaining () at /usr/src/debug/qemu-2.10.0/migration/ram.c:207
207	    return ram_state->migration_dirty_pages * TARGET_PAGE_SIZE;

(gdb) bt
#0  0x00005555557f27a7 in ram_bytes_remaining () at /usr/src/debug/qemu-2.10.0/migration/ram.c:207
#1  0x000055555599bdf6 in populate_ram_info (info=info@entry=0x555556d150e0, s=0x555556d30280, s=0x555556d30280)
    at migration/migration.c:523
#2  0x000055555599c760 in qmp_query_migrate (errp=errp@entry=0x0) at migration/migration.c:567
#3  0x00005555558c8008 in hmp_info_migrate (mon=0x555556db0240, qdict=<optimized out>) at hmp.c:165
#4  0x00005555557ded0f in handle_hmp_command (mon=mon@entry=0x555556db0240, cmdline=0x55555715600c "")
    at /usr/src/debug/qemu-2.10.0/monitor.c:3151
#5  0x00005555557e038a in monitor_command_cb (opaque=0x555556db0240, cmdline=<optimized out>, readline_opaque=<optimized out>)
    at /usr/src/debug/qemu-2.10.0/monitor.c:3954
#6  0x0000555555ace6a8 in readline_handle_byte (rs=0x555557156000, ch=<optimized out>) at util/readline.c:393
#7  0x00005555557def12 in monitor_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>)
    at /usr/src/debug/qemu-2.10.0/monitor.c:3937
#8  0x0000555555a6602f in fd_chr_read (chan=0x555556ce1d40, cond=<optimized out>, opaque=0x555556d14fa0) at chardev/char-fd.c:66
#9  0x00007fffef4c98f9 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#10 0x0000555555abc19c in glib_pollfds_poll () at util/main-loop.c:213
#11 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261
#12 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:515
#13 0x000055555579d8ca in main_loop () at vl.c:1917
#14 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4805
(gdb) q


Expected results:
migrate status is "Migration status: cancelled", and vm is running on src host.

Additional info:
the result of ppc is same with x86 platform, version is as following:
3.10.0-820.el7.ppc64le
qemu-kvm-rhev-2.10.0-12.el7.ppc64le
SLOF-20170724-2.git89f519f.el7.noarch

Comment 2 Dr. David Alan Gilbert 2017-12-15 09:40:57 UTC

Confirmed (and with upstream 2.11); the crucial thing is that the IP address doesn't reject the connection, but just hangs during the connect.

Status is 'cancelling'.

Comment 3 xianwang 2017-12-15 11:40:09 UTC

This bug is not a regression, it also exist on qemu-kvm-rhev-2.9.0-16.el7_4.1, although the result of qemu-kvm-rhev-2.9.0-16.el7_4.1.ppc64le is not totally same with qemu-kvm-rhev-2.10.0-12.el7.ppc64le, the result is as Dave said in comment2, the status of migration is "cancelling" as following:

version:
Host:
3.10.0-693.el7.ppc64le
qemu-kvm-rhev-2.9.0-16.el7_4.1.ppc64le
SLOF-20170724-2.git89f519f.el7.noarch

steps are same with bug report.

result:
(qemu) migrate -d tcp:10.16.110.120:5801
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off 
Migration status: setup
total time: 0 milliseconds
(qemu) migrate_cancel 
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off 
Migration status: cancelling
(qemu) info migrate
Migration status: cancelling
.........
endless "cancelling"

Comment 4 Dr. David Alan Gilbert 2017-12-15 11:54:03 UTC

Yep, there's actually two bugs:
  a) The seg, for which I've just posted upstream: migration: Guard ram_bytes_remaining against early call
  b) The endless cancelling, which I've got an idea how to fix - it's related to the error path through the socket code.

Comment 5 Dr. David Alan Gilbert 2017-12-15 17:19:07 UTC

and posted  upstream fixes for (b):
[PATCH 1/2] migration: Allow migrate_fd_connect to take an Error *
[PATCH 2/2] migration: Route errors down  through

Comment 6 Dr. David Alan Gilbert 2018-01-15 16:42:36 UTC

a) just got merged upstream as bae416e5ba65701d3c5238164517158066d615e5

Comment 7 Dr. David Alan Gilbert 2018-02-01 15:04:25 UTC

bumped to 7.6

Comment 8 Dr. David Alan Gilbert 2018-02-07 18:06:23 UTC

b) got merged upstream as:
688a3dcba980bf01344a
cce8040bb0ea6ff56d88

Comment 9 Dr. David Alan Gilbert 2018-02-13 09:08:02 UTC

Also needs:
  migration: Fix early failure cleanup
posted 2018-02-12 and should include the:
  tests/migration: Add test for migration to bad destination
included with it.

Comment 10 Dr. David Alan Gilbert 2018-04-30 19:01:39 UTC

Also needs:
Migration+TLS: Fix crash due to double cleanup