Bug 1989415
Summary: | Fail to cancel blockjob when cancelling copy-storage migration in libvirt | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Han Han <hhan> | ||||
Component: | qemu-kvm | Assignee: | Hanna Czenczek <hreitz> | ||||
qemu-kvm sub component: | Block Jobs | QA Contact: | aihua liang <aliang> | ||||
Status: | CLOSED WORKSFORME | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | medium | CC: | aliang, coli, fjin, hhan, hreitz, jinzhao, juzhang, kkiwi, pkrempa, timao, virt-maint | ||||
Version: | 9.0 | Keywords: | Triaged | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-03-29 10:18:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Han Han
2021-08-03 07:53:30 UTC
Max, can this be related to some of the things you've been investigating upstream recently? Yes, this sounds exactly like https://gitlab.com/qemu-project/qemu/-/issues/462. If this is what I think it is (the issue linked in comment 2; but I think I’ll have to try reproducing comment 0, too, to see whether it really is), then this here should work as a reproducer: $ qemu-img create -f qcow2 src.qcow2 64M $ qemu-img create -f qcow2 tgt.qcow2 64M $ qemu-nbd tgt.qcow2 & $ qemu-system-x86_64 \ -qmp stdio \ -blockdev qcow2,node-name=src,file.driver=file,file.filename=src.qcow2 \ -blockdev nbd,node-name=tgt,server.type=inet,server.host=localhost,server.port=10809 {"execute": "qmp_capabilities"} {"return": {}} {"execute": "blockdev-mirror", "arguments": { "job-id": "mirror", "filter-node-name": "mirror-node", "device": "src", "target": "tgt", "sync": "full", "on-target-error": "stop" } } {"timestamp": {"seconds": 1628665650, "microseconds": 570910}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "mirror"}} {"timestamp": {"seconds": 1628665650, "microseconds": 570981}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "mirror"}} {"return": {}} {"timestamp": {"seconds": 1628665650, "microseconds": 573887}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "mirror"}} {"timestamp": {"seconds": 1628665650, "microseconds": 573916}, "event": "BLOCK_JOB_READY", "data": {"device": "mirror", "len": 67108864, "offset": 67108864, "speed": 0, "type": "mirror"}} Then kill the NBD server (e.g. by using ^Z to bring qemu into the background): ^Z $ kill %1 # qemu-nbd shell job ID $ fg And cancel the job: {"execute": "block-job-cancel", "arguments": { "device": "mirror" } } For me, this leads to qemu emitting endless BLOCK_JOB_ERROR message without being able to quit it (other than killing the process). Tried by libvirt, still can't reproduce it. Test Env: kernel version:4.18.0-325.el8.x86_64 qemu-kvm version:qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc libvirt version:libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b.x86_64 Test Steps: 1.create test.xml <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:capabilities> <qemu:add capability='blockdev'/> <qemu:del capability='drive'/> </qemu:capabilities> <name>rhel9</name> <uuid>73684f97-5e70-49fa-83f5-e1b8365d4447</uuid> <memory unit="KiB">2097152</memory> <currentMemory unit="KiB">2097152</currentMemory> <vcpu placement="static">1</vcpu> <iothreads>4</iothreads> <os> <type arch="x86_64" machine="pc">hvm</type> <boot dev="hd"/> </os> <features> <acpi/> <apic/> <vmport state="off"/> </features> <clock offset="utc"/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <pm> <suspend-to-mem enabled="no"/> <suspend-to-disk enabled="no"/> </pm> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/var/lib/avocado/data/avocado-vt/images/rhel850-64-virtio-scsi.qcow2"/> <target dev="vda" bus="virtio"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/> </disk> <controller type="usb" index="0" model="ich9-ehci1"> <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x7"/> </controller> <controller type="usb" index="0" model="ich9-uhci1"> <master startport="0"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" multifunction="on"/> </controller> <controller type="usb" index="0" model="ich9-uhci2"> <master startport="2"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x1"/> </controller> <controller type="usb" index="0" model="ich9-uhci3"> <master startport="4"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x2"/> </controller> <controller type="pci" index="0" model="pci-root"/> <controller type="ide" index="0"> <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x1"/> </controller> <controller type="virtio-serial" index="0"> <address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0"/> </controller> <controller type="scsi" index="0" model="virtio-scsi"> <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/> </controller> <interface type="network"> <mac address="52:54:00:84:71:f7"/> <source network="default"/> <model type="e1000"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0"/> </interface> <interface type="network"> <mac address="52:54:00:c0:59:5a"/> <source network="default"/> <model type="rtl8139"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x09" function="0x0"/> </interface> <serial type="pty"> <target type="isa-serial" port="0"> <model name="isa-serial"/> </target> </serial> <console type="pty"> <target type="serial" port="0"/> </console> <input type="tablet" bus="usb"> <address type="usb" bus="0" port="1"/> </input> <input type="mouse" bus="ps2"/> <input type="keyboard" bus="ps2"/> <graphics type="vnc" port="-1" autoport="yes"> <listen type="address"/> </graphics> <sound model="ich6"> <address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0"/> </sound> <audio id="1" type="none"/> <video> <model type="virtio" heads="1" primary="yes"/> <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0"/> </video> <memballoon model="virtio"> <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/> </memballoon> </devices> <qemu:commandline> <qemu:env name="MALLOC_CHECK_" value="1"/> </qemu:commandline> </domain> 2. start vm with test.xml #virsh create test.xml setlocale: No such file or directory Domain 'rhel9' created from test.xml 3. check virsh list #virsh list setlocale: No such file or directory Id Name State ----------------------- 1 rhel9 running 4. Create target pool in dst #virsh pool-create-as --name default --type dir --target /var/lib/avocado/data/avocado-vt/images/ 5. Do storage vm migration and during migration cancel(run 200 times). #sh run.sh ********run.sh*** DOM=rhel9 for i in {1..200};do virsh migrate $DOM qemu+ssh://root.196.25/system --live --copy-storage-all --p2p --verbose& echo $i sleep_time=$((1 + $RANDOM % 45)) echo "It will sleep for $sleep_time" sleep $sleep_time virsh domjobabort $DOM wait done Actual result: All 200 times migration cancelled successfully. Details: see attachment: 200_migration_cancel.txt Additional info: qemu vm cmdline, see attachment: qemu_vm_cmdline.txt qemu and libvirt log, see attachment:libvirtd-migration_cancel_200.log Created attachment 1816735 [details]
VM xml and reproducing script
Reproduced on libvirt-7.3.0-1.fc35.x86_64 qemu-kvm-5.2.0-8.fc34.x86_64
Key points for reproducing:
The keypoint is to find out the minimal migration time of your environment and assign to to the var "min_mig_time" in the script run.sh .
I find a way to get it:
1. Migrate the VM to dst host with command `time virsh migrate $DOM qemu+ssh://root.79.60/system --live --copy-storage-all --p2p --verbose` then record the total time of the migration
2. Assign the recorded time to the var "min_mig_time" and run the run.sh
3. Once get the first error "error: Requested operation is not valid: domain is not running" appears in the script. It means the actual minimal migration time is less than the "min_mig_time". Then assign the last sleep value to the var "min_mig_time" and cancel the running script.
e.g: for the script output:
```
It will sleep for 19.97
error: End of file while reading data: : Input/output error
It will sleep for 19.27
error: Requested operation is not valid: domain is not running
```
Assign 19.27 to the var "min_mig_time" in the run.sh.
4. Migrate the VM back. Then rerun the run.sh. If the error "domain is not running" happens again. Redo step3&step4
5. If the script doesn't output anything or quit. The bug is reproduced. You can check the blockjob of the VM by:
➜ ~ virsh blockjob rhel9 vda
Block Copy: [ 36 %] Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s)
You can also try cancelling the blockjob:
➜ ~ virsh blockjob rhel9 vda --abort
It will hang.
Versions in comment9 should be libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64 qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b.x86_64 Hi, Hanna Do we plan to fix this issue in RHEL8.5? BR, Aliang Hi Aliang, The upstream patches have not been merged yet, so I can’t say for sure. Another problem is that I can’t reproduce the problem from the libvirt side, even with the instructions given in comment 9. Depending on the exact sleep time, the migration is either cancelled successfully, or finishes without fail. I guess I’ll create an RPM that includes the currently proposed upstream patches, and then we can see whether this is even the bug I linked in comment 2. Hanna Hi Han Han, Can you test a Brew build with the patches that are currently proposed upstream (for the bug linked in comment 2)? It’s here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263761 The x86-64 RPMs are here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263974 (e.g. qemu-kvm-core should be here: http://download.eng.bos.redhat.com/brewroot/work/tasks/3974/39263974/qemu-kvm-core-6.0.0-28.el8.hreitz202108251725.x86_64.rpm) Thanks! (After thinking for a bit, I don’t think we need to force this into 8.5. If we don’t make it, 8.6 should be fine, too.) Hanna Tried more than 200 times with libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64 qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b.x86_64 with method in comment 9, still can't reproduce the issue, and all block jobs are cancelled successfully. Will ping hanhan to try in my test env to check if his nested env is a keypoint. (In reply to Hanna Reitz from comment #13) > Hi Han Han, > > Can you test a Brew build with the patches that are currently proposed > upstream (for the bug linked in comment 2)? > > It’s here: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263761 > The x86-64 RPMs are here: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263974 > (e.g. qemu-kvm-core should be here: > http://download.eng.bos.redhat.com/brewroot/work/tasks/3974/39263974/qemu- > kvm-core-6.0.0-28.el8.hreitz202108251725.x86_64.rpm) > > Thanks! > > > (After thinking for a bit, I don’t think we need to force this into 8.5. If > we don’t make it, 8.6 should be fine, too.) > > Hanna Tried on qemu-kvm-6.0.0-28.el8.hreitz202108251725.x86_64 libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64. It looks like something more serious has happened: The script results: # bash ./run.sh It will sleep for 17.29 error: End of file while reading data: : Input/output error It will sleep for 16.02 error: operation aborted: migration out: canceled by client It will sleep for 18.69 error: End of file while reading data: : Input/output error It will sleep for 18.43 error: operation aborted: migration out: canceled by client It will sleep for 17.36 error: operation aborted: migration out: canceled by client It will sleep for 17.18 error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) It cannot get the blockjob info anymore: # virsh blockjob rhel9 vda error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) Even the VM itself is hang: # virsh domifaddr rhel9 Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet4 52:54:00:84:71:f7 ipv4 192.168.122.170/24 vnet5 52:54:00:c0:59:5a ipv4 192.168.122.106/24 # ping 192.168.122.170 PING 192.168.122.170 (192.168.122.170) 56(84) bytes of data. From 192.168.122.1 icmp_seq=1 Destination Host Unreachable From 192.168.122.1 icmp_seq=2 Destination Host Unreachable From 192.168.122.1 icmp_seq=3 Destination Host Unreachable From 192.168.122.1 icmp_seq=4 Destination Host Unreachable From 192.168.122.1 icmp_seq=5 Destination Host Unreachable From 192.168.122.1 icmp_seq=6 Destination Host Unreachable However, the VM can be accessed before the script. On another run, only migration hangs and is never aborted: # bash run.sh ... Thu Aug 26 02:13:55 UTC 2021 It will sleep for 18.94 ➜ ~ virsh domjobinfo rhel9 Job type: Unbounded Operation: Outgoing migration Time elapsed: 2006224 ms ➜ ~ virsh domjobabort rhel9 ➜ ~ virsh domjobinfo rhel9 Job type: Unbounded Operation: Outgoing migration Time elapsed: 2039306 ms ➜ ~ virsh blockjob rhel9 vda No current block job for vda ➜ ~ virsh domifaddr rhel9 Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet12 52:54:00:84:71:f7 ipv4 192.168.122.170/24 vnet13 52:54:00:c0:59:5a ipv4 192.168.122.106/24 ➜ ~ ping 192.168.122.170 PING 192.168.122.170 (192.168.122.170) 56(84) bytes of data. 64 bytes from 192.168.122.170: icmp_seq=1 ttl=64 time=1.50 ms 64 bytes from 192.168.122.170: icmp_seq=2 ttl=64 time=0.896 ms Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Hi, So I tried reproducing this and I thought I got somewhere because I could get some migration to hang. But comparing the QMP logs, I saw a very different situation than shown in the log in https://gitlab.com/libvirt/libvirt/-/issues/197. In my logs, qemu didn’t seem unresponsive, but instead listed the block job as `ready`, then `aborting`, then it reported an I/O error, and the job was then successfully aborted and the job entered the `concluded` state for libvirt to dismiss. libvirt however didn’t dismiss is, which is why it remained. Not sure what was happening, but qemu seemed responsive. In your logs, the block job never reached the `ready` state and consequently the migration was never started (no `"execute": "migration"` QMP calls). I tried to time my aborts such that it would hit around the time the job reached the ready state, but could not reproduce the same situation. The migration (and the block job) were always successfully and quickly aborted. So because the job never signals to be ready, I now doubt this is really related to https://gitlab.com/qemu-project/qemu/-/issues/462 (which I said in comment 2). Hanna Seems reproduced the issue as below. But when executing 'virsh blockjob rhel8.6 sda' immediately after the script exited again in terminal manually, there is no the job anymore. Not sure whether we need to handle the issue? Tested with: qemu-kvm-6.2.0-9.el9 kernel-5.14.0-65.kpq0.el9.x86_64 Steps: 1. Boot up a source guest # virt-install \ --name rhel8.6 \ --cpu host-model \ --machine q35 \ --ram=8192 \ --vcpus=2,sockets=1,cores=1,threads=2 \ --hvm \ --noreboot \ --network bridge=switch \ --os-type=linux \ --disk path=RHEL-8.6-x86_64-latest.qcow2,bus=scsi,format='qcow2',cache=none \ --graphics vnc,listen=0.0.0.0 \ --console unix,path=/tmp/rhel8.6,mode=bind \ --import \ 2. Execute the capture script # cat catch.sh #!/bin/bash while true do virsh domjobabort rhel8.6 virsh blockjob rhel8.6 sda &> result if grep -q "Block" result; then echo "yes!" exit 0 fi sleep 0.1 done 3. Migrate the guest # virsh migrate rhel8.6 qemu+ssh://root.72.90/system --live --copy-storage-all --p2p --verbose error: operation aborted: migration out: canceled by client Results: As above, the migration failed. But checked the result of the capture script, still catched the block job info. # cat result Block Copy: [ 0 %] Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s) # sh catch.sh error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain error: Requested operation is not valid: no job is active on the domain yes! More info: While execute 'virsh blockjob rhel8.6 sda' immediately after the script exited manually, there is no the job anymore. Not sure whether we still need to handle the issue? # virsh blockjob rhel8.6 sda No current block job for sda As far as I’ve understood, the issue was that the block job could not be cancelled and stayed around indefinitely. It doesn’t seem to do that in the case you show, it just lingers for a very short time, which I don’t think is problematic. So from my point of view we don’t need to handle this issue anymore. Update: Not reproduced inn libvirt-8.0.0-7.el9_0.x86_64 qemu-kvm-6.2.0-12.el9.x86_64 |