Description of problem: guest on src host get stuck after execute migrate_cancel for rdma migration Version-Release number of selected component (if applicable): src&dst host info: kernel-4.18.0-117.el8.x86_64 & qemu-img-4.0.0-5.module+el8.1.0+3622+5812d9bf.x86_64 guest info: kernel-4.18.0-113.el8.x86_64 Mellanox card: # lspci 01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] How reproducible: 2/2 Steps to Reproduce: 1.config Mellanox card 2.boot guest on src host with clis: /usr/libexec/qemu-kvm \ -enable-kvm \ -machine q35 \ -m 8G \ -smp 8 \ -cpu Skylake-Client \ -name debug-threads=on \ -device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0-root-port-2,addr=0x0 \ -blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfs/rhel810-64-virtio-scsi-3.qcow2,node-name=my_file \ -blockdev driver=qcow2,node-name=my_disk,file=my_file \ -device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \ -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=70:5a:0f:38:cd:1c,bus=pcie.0-root-port-3,vectors=10,mq=on \ -vnc :0 \ -device VGA \ -monitor stdio \ -qmp tcp:0:1234,server,nowait \ 3.run stressapptest in guest: # stressapptest -M 1000 -s 10000 4.boot guest on dst host with clis: /usr/libexec/qemu-kvm \ -enable-kvm \ -machine q35 \ -m 8G \ -smp 8 \ -cpu Skylake-Client \ -name debug-threads=on \ -device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \ -device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0-root-port-2,addr=0x0 \ -blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfs/rhel810-64-virtio-scsi-3.qcow2,node-name=my_file \ -blockdev driver=qcow2,node-name=my_disk,file=my_file \ -device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \ -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=70:5a:0f:38:cd:1c,bus=pcie.0-root-port-3,vectors=10,mq=on \ -vnc :0 \ -device VGA \ -monitor stdio \ -qmp tcp:0:1234,server,nowait \ -incoming rdma:0:4444 \ 5.set migration transfer speed and enable rdma-pin-all (qemu) migrate_set_speed 10G (qemu) migrate_set_capability rdma-pin-all on 6.Do migration through rdma protocal (qemu)migrate rdma:192.168.10.21:5555 7.cancel migration process before migration completed # telnet 127.0.0.1 1234 Trying 127.0.0.1... Connected to 127.0.0.1. Escape character is '^]'. {"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 4}, "package": "qemu-kvm-4.0.0-5.module+el8.1.0+3622+5812d9bf"}, "capabilities": ["oob"]}} {"execute":"qmp_capabilities"} {"return": {}} {"timestamp": {"seconds": 1563433312, "microseconds": 731190}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "net0", "path": "/machine/peripheral/net0/virtio-backend"}} {"execute":"migrate_cancel"} {"return": {}} Actual results: guest on src host get stuck after execute migrate_cancel. (1)on src qemu, keep here, and couldn't operate the hmp: (qemu) migrate rdma:192.168.0.21:4444 source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband (2)on dst qemu, search migration status: (qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband (qemu) info status VM status: paused (inmigrate) (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off Migration status: active total time: 0 milliseconds (3)guest get stuck, and couldn't operate via mice on remote-viewer control panel. and couldn't ping guest via guest ip Expected results: guest still run normal on src host after migrate_cancel Additional info: guest work well after rdma migration without mgirate_cancel
Hi all, I also test this case on rhel8.1.0 fast train with guest win10(q35+seabios), win8-32(pc+seabios), rhel8.1.0(q35+seabios), rhel7.7(pc+seabios), rhel8.0.1(q35+ovmf), 1.rhel8.1.0 and win10 guest hit same issue, like above comment 0 2.rhel8.0.1 and rhel7.7, and win8-32 guest get prompt like followings after migrate_cancel, But I think maybe the prompt isn't right(ibv_poll_cq wc.status=13 RNR retry counter exceeded!...), what do you think? (1)on src host qemu: (qemu) migrate rdma:192.168.0.21:4444 source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband qemu-kvm: Early error. Sending error. ibv_poll_cq wc.status=13 RNR retry counter exceeded! ibv_poll_cq wrid=CONTROL SEND! qemu-kvm: rdma migration: send polling control error (qemu) info status VM status: running (qemu) info migr migrate migrate_cache_size migrate_capabilities migrate_parameters (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off Migration status: cancelled total time: 0 milliseconds (2)on dst host qemu: (qemu) info status VM status: paused (inmigrate) (qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband qemu-kvm: receive cm event, cm event is 10 qemu-kvm: rdma migration: send polling control error qemu-kvm: Failed to send control buffer! qemu-kvm: load of migration failed: Input/output error qemu-kvm: Early error. Sending error. qemu-kvm: rdma migration: send polling control error What's more, I test this case on rhel8.1.0 slow train with guest win10(q35+seabios) and rhel8.1.0(pc+seabios), guest run normal on src host after migrate_cancel, and the prompt is right both on src and dst qemu: (1)on src host qemu: (qemu) migrate rdma:192.168.0.21:4444 source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband qemu-kvm: migration_iteration_finish: Unknown ending state 2 qemu-kvm: Early error. Sending error. (qemu) info status VM status: running (qemu) info migr migrate migrate_cache_size migrate_capabilities migrate_parameters (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on capabilities: xbzrle: off rdma-pin-all: on auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off Migration status: cancelled total time: 0 milliseconds (2)on dst host qemu: QEMU 2.12.0 monitor - type 'help' for more information (qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband qemu-kvm: Was expecting a QEMU FILE (3) control message, but got: ERROR (1), length: 0 qemu-kvm: load of migration failed: Input/output error
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.
Didn't reproduce this bz with Comment 0 on rhelav 8.4.0(kernel-4.18.0-304.el8.x86_64&qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64) Close this bz as CurrentRelease.