Bug 1731038 - guest on src host get stuck after execute migrate_cancel for rdma migration
Summary: guest on src host get stuck after execute migrate_cancel for rdma migration
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.1
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Dr. David Alan Gilbert
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks: 1771318 1758964
TreeView+ depends on / blocked
 
Reported: 2019-07-18 07:45 UTC by Li Xiaohui
Modified: 2020-02-05 23:00 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)

Description Li Xiaohui 2019-07-18 07:45:10 UTC
Description of problem:
guest on src host get stuck after execute migrate_cancel for rdma migration


Version-Release number of selected component (if applicable):
src&dst host info: kernel-4.18.0-117.el8.x86_64 & qemu-img-4.0.0-5.module+el8.1.0+3622+5812d9bf.x86_64
guest info: kernel-4.18.0-113.el8.x86_64

Mellanox card:
# lspci
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]


How reproducible:
2/2


Steps to Reproduce:
1.config Mellanox card
2.boot guest on src host with clis:
/usr/libexec/qemu-kvm \
-enable-kvm \
-machine q35  \
-m 8G \
-smp 8 \
-cpu Skylake-Client \
-name debug-threads=on \
-device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0-root-port-2,addr=0x0 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfs/rhel810-64-virtio-scsi-3.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my_disk,file=my_file \
-device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \
-netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=70:5a:0f:38:cd:1c,bus=pcie.0-root-port-3,vectors=10,mq=on \
-vnc :0 \
-device VGA \
-monitor stdio \
-qmp tcp:0:1234,server,nowait \
3.run stressapptest in guest:
# stressapptest -M 1000 -s 10000
4.boot guest on dst host with clis:
/usr/libexec/qemu-kvm \
-enable-kvm \
-machine q35  \
-m 8G \
-smp 8 \
-cpu Skylake-Client \
-name debug-threads=on \
-device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0-root-port-2,addr=0x0 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfs/rhel810-64-virtio-scsi-3.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my_disk,file=my_file \
-device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \
-netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=70:5a:0f:38:cd:1c,bus=pcie.0-root-port-3,vectors=10,mq=on \
-vnc :0 \
-device VGA \
-monitor stdio \
-qmp tcp:0:1234,server,nowait \
-incoming rdma:0:4444 \
5.set migration transfer speed and enable rdma-pin-all
(qemu) migrate_set_speed 10G
(qemu) migrate_set_capability rdma-pin-all on
6.Do migration through rdma protocal
(qemu)migrate rdma:192.168.10.21:5555
7.cancel migration process before migration completed
# telnet 127.0.0.1 1234
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 4}, "package": "qemu-kvm-4.0.0-5.module+el8.1.0+3622+5812d9bf"}, "capabilities": ["oob"]}}
{"execute":"qmp_capabilities"}
{"return": {}}
{"timestamp": {"seconds": 1563433312, "microseconds": 731190}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "net0", "path": "/machine/peripheral/net0/virtio-backend"}}
{"execute":"migrate_cancel"}
{"return": {}}


Actual results:
guest on src host get stuck after execute migrate_cancel.
(1)on src qemu, keep here, and couldn't operate the hmp:
(qemu) migrate rdma:192.168.0.21:4444
source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
(2)on dst qemu, search migration status:
(qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
(qemu) info status 
VM status: paused (inmigrate)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: active
total time: 0 milliseconds
(3)guest get stuck, and couldn't operate via mice on remote-viewer control panel.
and couldn't ping guest via guest ip


Expected results:
guest still run normal on src host after migrate_cancel


Additional info:
guest work well after rdma migration without mgirate_cancel

Comment 2 Li Xiaohui 2019-07-19 09:30:33 UTC
Hi all,
I also test this case on rhel8.1.0 fast train with guest win10(q35+seabios), win8-32(pc+seabios), rhel8.1.0(q35+seabios), rhel7.7(pc+seabios), rhel8.0.1(q35+ovmf), 
1.rhel8.1.0 and win10 guest hit same issue, like above comment 0
2.rhel8.0.1 and rhel7.7, and win8-32 guest get prompt like followings after migrate_cancel, But I think maybe the prompt isn't right(ibv_poll_cq wc.status=13 RNR retry counter exceeded!...), what do you think?
(1)on src host qemu:
(qemu) migrate rdma:192.168.0.21:4444
source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: Early error. Sending error.
ibv_poll_cq wc.status=13 RNR retry counter exceeded!
ibv_poll_cq wrid=CONTROL SEND!
qemu-kvm: rdma migration: send polling control error
(qemu) info status 
VM status: running
(qemu) info migr
migrate               migrate_cache_size    migrate_capabilities  
migrate_parameters    
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: cancelled
total time: 0 milliseconds
(2)on dst host qemu:
(qemu) info status 
VM status: paused (inmigrate)
(qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: receive cm event, cm event is 10
qemu-kvm: rdma migration: send polling control error
qemu-kvm: Failed to send control buffer!
qemu-kvm: load of migration failed: Input/output error
qemu-kvm: Early error. Sending error.
qemu-kvm: rdma migration: send polling control error


What's more, I test this case on rhel8.1.0 slow train with guest win10(q35+seabios) and rhel8.1.0(pc+seabios), guest run normal on src host after migrate_cancel, and the prompt is right both on src and dst qemu:
(1)on src host qemu:
(qemu) migrate rdma:192.168.0.21:4444
source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: migration_iteration_finish: Unknown ending state 2
qemu-kvm: Early error. Sending error.
(qemu) info status 
VM status: running
(qemu) info migr
migrate               migrate_cache_size    migrate_capabilities  
migrate_parameters    
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: on auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off 
Migration status: cancelled
total time: 0 milliseconds

(2)on dst host qemu:
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: Was expecting a QEMU FILE (3) control message, but got: ERROR (1), length: 0
qemu-kvm: load of migration failed: Input/output error

Comment 4 Ademar Reis 2020-02-05 23:00:59 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks


Note You need to log in before you can comment on or make changes to this bug.