Bug 1989415
| Summary: | Fail to cancel blockjob when cancelling copy-storage migration in libvirt | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Han Han <hhan> | ||||
| Component: | qemu-kvm | Assignee: | Hanna Czenczek <hreitz> | ||||
| qemu-kvm sub component: | Block Jobs | QA Contact: | aihua liang <aliang> | ||||
| Status: | CLOSED WORKSFORME | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | medium | CC: | aliang, coli, fjin, hhan, hreitz, jinzhao, juzhang, kkiwi, pkrempa, timao, virt-maint | ||||
| Version: | 9.0 | Keywords: | Triaged | ||||
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-03-29 10:18:53 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Han Han
2021-08-03 07:53:30 UTC
Max, can this be related to some of the things you've been investigating upstream recently? Yes, this sounds exactly like https://gitlab.com/qemu-project/qemu/-/issues/462. If this is what I think it is (the issue linked in comment 2; but I think I’ll have to try reproducing comment 0, too, to see whether it really is), then this here should work as a reproducer: $ qemu-img create -f qcow2 src.qcow2 64M $ qemu-img create -f qcow2 tgt.qcow2 64M $ qemu-nbd tgt.qcow2 & $ qemu-system-x86_64 \ -qmp stdio \ -blockdev qcow2,node-name=src,file.driver=file,file.filename=src.qcow2 \ -blockdev nbd,node-name=tgt,server.type=inet,server.host=localhost,server.port=10809 {"execute": "qmp_capabilities"} {"return": {}} {"execute": "blockdev-mirror", "arguments": { "job-id": "mirror", "filter-node-name": "mirror-node", "device": "src", "target": "tgt", "sync": "full", "on-target-error": "stop" } } {"timestamp": {"seconds": 1628665650, "microseconds": 570910}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "mirror"}} {"timestamp": {"seconds": 1628665650, "microseconds": 570981}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "mirror"}} {"return": {}} {"timestamp": {"seconds": 1628665650, "microseconds": 573887}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "mirror"}} {"timestamp": {"seconds": 1628665650, "microseconds": 573916}, "event": "BLOCK_JOB_READY", "data": {"device": "mirror", "len": 67108864, "offset": 67108864, "speed": 0, "type": "mirror"}} Then kill the NBD server (e.g. by using ^Z to bring qemu into the background): ^Z $ kill %1 # qemu-nbd shell job ID $ fg And cancel the job: {"execute": "block-job-cancel", "arguments": { "device": "mirror" } } For me, this leads to qemu emitting endless BLOCK_JOB_ERROR message without being able to quit it (other than killing the process). Tried by libvirt, still can't reproduce it.
Test Env:
kernel version:4.18.0-325.el8.x86_64
qemu-kvm version:qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc
libvirt version:libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b.x86_64
Test Steps:
1.create test.xml
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<qemu:capabilities>
<qemu:add capability='blockdev'/>
<qemu:del capability='drive'/>
</qemu:capabilities>
<name>rhel9</name>
<uuid>73684f97-5e70-49fa-83f5-e1b8365d4447</uuid>
<memory unit="KiB">2097152</memory>
<currentMemory unit="KiB">2097152</currentMemory>
<vcpu placement="static">1</vcpu>
<iothreads>4</iothreads>
<os>
<type arch="x86_64" machine="pc">hvm</type>
<boot dev="hd"/>
</os>
<features>
<acpi/>
<apic/>
<vmport state="off"/>
</features>
<clock offset="utc"/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled="no"/>
<suspend-to-disk enabled="no"/>
</pm>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type="file" device="disk">
<driver name="qemu" type="qcow2"/>
<source file="/var/lib/avocado/data/avocado-vt/images/rhel850-64-virtio-scsi.qcow2"/>
<target dev="vda" bus="virtio"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
</disk>
<controller type="usb" index="0" model="ich9-ehci1">
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x7"/>
</controller>
<controller type="usb" index="0" model="ich9-uhci1">
<master startport="0"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" multifunction="on"/>
</controller>
<controller type="usb" index="0" model="ich9-uhci2">
<master startport="2"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x1"/>
</controller>
<controller type="usb" index="0" model="ich9-uhci3">
<master startport="4"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x2"/>
</controller>
<controller type="pci" index="0" model="pci-root"/>
<controller type="ide" index="0">
<address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x1"/>
</controller>
<controller type="virtio-serial" index="0">
<address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0"/>
</controller>
<controller type="scsi" index="0" model="virtio-scsi">
<address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/>
</controller>
<interface type="network">
<mac address="52:54:00:84:71:f7"/>
<source network="default"/>
<model type="e1000"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0"/>
</interface>
<interface type="network">
<mac address="52:54:00:c0:59:5a"/>
<source network="default"/>
<model type="rtl8139"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x09" function="0x0"/>
</interface>
<serial type="pty">
<target type="isa-serial" port="0">
<model name="isa-serial"/>
</target>
</serial>
<console type="pty">
<target type="serial" port="0"/>
</console>
<input type="tablet" bus="usb">
<address type="usb" bus="0" port="1"/>
</input>
<input type="mouse" bus="ps2"/>
<input type="keyboard" bus="ps2"/>
<graphics type="vnc" port="-1" autoport="yes">
<listen type="address"/>
</graphics>
<sound model="ich6">
<address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0"/>
</sound>
<audio id="1" type="none"/>
<video>
<model type="virtio" heads="1" primary="yes"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0"/>
</video>
<memballoon model="virtio">
<address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
</memballoon>
</devices>
<qemu:commandline>
<qemu:env name="MALLOC_CHECK_" value="1"/>
</qemu:commandline>
</domain>
2. start vm with test.xml
#virsh create test.xml
setlocale: No such file or directory
Domain 'rhel9' created from test.xml
3. check virsh list
#virsh list
setlocale: No such file or directory
Id Name State
-----------------------
1 rhel9 running
4. Create target pool in dst
#virsh pool-create-as --name default --type dir --target /var/lib/avocado/data/avocado-vt/images/
5. Do storage vm migration and during migration cancel(run 200 times).
#sh run.sh
********run.sh***
DOM=rhel9
for i in {1..200};do
virsh migrate $DOM qemu+ssh://root.196.25/system --live --copy-storage-all --p2p --verbose&
echo $i
sleep_time=$((1 + $RANDOM % 45))
echo "It will sleep for $sleep_time"
sleep $sleep_time
virsh domjobabort $DOM
wait
done
Actual result:
All 200 times migration cancelled successfully.
Details: see attachment: 200_migration_cancel.txt
Additional info:
qemu vm cmdline, see attachment: qemu_vm_cmdline.txt
qemu and libvirt log, see attachment:libvirtd-migration_cancel_200.log
Created attachment 1816735 [details]
VM xml and reproducing script
Reproduced on libvirt-7.3.0-1.fc35.x86_64 qemu-kvm-5.2.0-8.fc34.x86_64
Key points for reproducing:
The keypoint is to find out the minimal migration time of your environment and assign to to the var "min_mig_time" in the script run.sh .
I find a way to get it:
1. Migrate the VM to dst host with command `time virsh migrate $DOM qemu+ssh://root.79.60/system --live --copy-storage-all --p2p --verbose` then record the total time of the migration
2. Assign the recorded time to the var "min_mig_time" and run the run.sh
3. Once get the first error "error: Requested operation is not valid: domain is not running" appears in the script. It means the actual minimal migration time is less than the "min_mig_time". Then assign the last sleep value to the var "min_mig_time" and cancel the running script.
e.g: for the script output:
```
It will sleep for 19.97
error: End of file while reading data: : Input/output error
It will sleep for 19.27
error: Requested operation is not valid: domain is not running
```
Assign 19.27 to the var "min_mig_time" in the run.sh.
4. Migrate the VM back. Then rerun the run.sh. If the error "domain is not running" happens again. Redo step3&step4
5. If the script doesn't output anything or quit. The bug is reproduced. You can check the blockjob of the VM by:
➜ ~ virsh blockjob rhel9 vda
Block Copy: [ 36 %] Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s)
You can also try cancelling the blockjob:
➜ ~ virsh blockjob rhel9 vda --abort
It will hang.
Versions in comment9 should be libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64 qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b.x86_64 Hi, Hanna Do we plan to fix this issue in RHEL8.5? BR, Aliang Hi Aliang, The upstream patches have not been merged yet, so I can’t say for sure. Another problem is that I can’t reproduce the problem from the libvirt side, even with the instructions given in comment 9. Depending on the exact sleep time, the migration is either cancelled successfully, or finishes without fail. I guess I’ll create an RPM that includes the currently proposed upstream patches, and then we can see whether this is even the bug I linked in comment 2. Hanna Hi Han Han, Can you test a Brew build with the patches that are currently proposed upstream (for the bug linked in comment 2)? It’s here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263761 The x86-64 RPMs are here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263974 (e.g. qemu-kvm-core should be here: http://download.eng.bos.redhat.com/brewroot/work/tasks/3974/39263974/qemu-kvm-core-6.0.0-28.el8.hreitz202108251725.x86_64.rpm) Thanks! (After thinking for a bit, I don’t think we need to force this into 8.5. If we don’t make it, 8.6 should be fine, too.) Hanna Tried more than 200 times with libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64 qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b.x86_64 with method in comment 9, still can't reproduce the issue, and all block jobs are cancelled successfully. Will ping hanhan to try in my test env to check if his nested env is a keypoint. (In reply to Hanna Reitz from comment #13) > Hi Han Han, > > Can you test a Brew build with the patches that are currently proposed > upstream (for the bug linked in comment 2)? > > It’s here: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263761 > The x86-64 RPMs are here: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263974 > (e.g. qemu-kvm-core should be here: > http://download.eng.bos.redhat.com/brewroot/work/tasks/3974/39263974/qemu- > kvm-core-6.0.0-28.el8.hreitz202108251725.x86_64.rpm) > > Thanks! > > > (After thinking for a bit, I don’t think we need to force this into 8.5. If > we don’t make it, 8.6 should be fine, too.) > > Hanna Tried on qemu-kvm-6.0.0-28.el8.hreitz202108251725.x86_64 libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64. It looks like something more serious has happened: The script results: # bash ./run.sh It will sleep for 17.29 error: End of file while reading data: : Input/output error It will sleep for 16.02 error: operation aborted: migration out: canceled by client It will sleep for 18.69 error: End of file while reading data: : Input/output error It will sleep for 18.43 error: operation aborted: migration out: canceled by client It will sleep for 17.36 error: operation aborted: migration out: canceled by client It will sleep for 17.18 error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) It cannot get the blockjob info anymore: # virsh blockjob rhel9 vda error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) Even the VM itself is hang: # virsh domifaddr rhel9 Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet4 52:54:00:84:71:f7 ipv4 192.168.122.170/24 vnet5 52:54:00:c0:59:5a ipv4 192.168.122.106/24 # ping 192.168.122.170 PING 192.168.122.170 (192.168.122.170) 56(84) bytes of data. From 192.168.122.1 icmp_seq=1 Destination Host Unreachable From 192.168.122.1 icmp_seq=2 Destination Host Unreachable From 192.168.122.1 icmp_seq=3 Destination Host Unreachable From 192.168.122.1 icmp_seq=4 Destination Host Unreachable From 192.168.122.1 icmp_seq=5 Destination Host Unreachable From 192.168.122.1 icmp_seq=6 Destination Host Unreachable However, the VM can be accessed before the script. On another run, only migration hangs and is never aborted: # bash run.sh ... Thu Aug 26 02:13:55 UTC 2021 It will sleep for 18.94 ➜ ~ virsh domjobinfo rhel9 Job type: Unbounded Operation: Outgoing migration Time elapsed: 2006224 ms ➜ ~ virsh domjobabort rhel9 ➜ ~ virsh domjobinfo rhel9 Job type: Unbounded Operation: Outgoing migration Time elapsed: 2039306 ms ➜ ~ virsh blockjob rhel9 vda No current block job for vda ➜ ~ virsh domifaddr rhel9 Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet12 52:54:00:84:71:f7 ipv4 192.168.122.170/24 vnet13 52:54:00:c0:59:5a ipv4 192.168.122.106/24 ➜ ~ ping 192.168.122.170 PING 192.168.122.170 (192.168.122.170) 56(84) bytes of data. 64 bytes from 192.168.122.170: icmp_seq=1 ttl=64 time=1.50 ms 64 bytes from 192.168.122.170: icmp_seq=2 ttl=64 time=0.896 ms Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Hi, So I tried reproducing this and I thought I got somewhere because I could get some migration to hang. But comparing the QMP logs, I saw a very different situation than shown in the log in https://gitlab.com/libvirt/libvirt/-/issues/197. In my logs, qemu didn’t seem unresponsive, but instead listed the block job as `ready`, then `aborting`, then it reported an I/O error, and the job was then successfully aborted and the job entered the `concluded` state for libvirt to dismiss. libvirt however didn’t dismiss is, which is why it remained. Not sure what was happening, but qemu seemed responsive. In your logs, the block job never reached the `ready` state and consequently the migration was never started (no `"execute": "migration"` QMP calls). I tried to time my aborts such that it would hit around the time the job reached the ready state, but could not reproduce the same situation. The migration (and the block job) were always successfully and quickly aborted. So because the job never signals to be ready, I now doubt this is really related to https://gitlab.com/qemu-project/qemu/-/issues/462 (which I said in comment 2). Hanna Seems reproduced the issue as below.
But when executing 'virsh blockjob rhel8.6 sda' immediately after the script exited again in terminal manually, there is no the job anymore. Not sure whether we need to handle the issue?
Tested with:
qemu-kvm-6.2.0-9.el9
kernel-5.14.0-65.kpq0.el9.x86_64
Steps:
1. Boot up a source guest
# virt-install \
--name rhel8.6 \
--cpu host-model \
--machine q35 \
--ram=8192 \
--vcpus=2,sockets=1,cores=1,threads=2 \
--hvm \
--noreboot \
--network bridge=switch \
--os-type=linux \
--disk path=RHEL-8.6-x86_64-latest.qcow2,bus=scsi,format='qcow2',cache=none \
--graphics vnc,listen=0.0.0.0 \
--console unix,path=/tmp/rhel8.6,mode=bind \
--import \
2. Execute the capture script
# cat catch.sh
#!/bin/bash
while true
do
virsh domjobabort rhel8.6
virsh blockjob rhel8.6 sda &> result
if grep -q "Block" result; then
echo "yes!"
exit 0
fi
sleep 0.1
done
3. Migrate the guest
# virsh migrate rhel8.6 qemu+ssh://root.72.90/system --live --copy-storage-all --p2p --verbose
error: operation aborted: migration out: canceled by client
Results:
As above, the migration failed. But checked the result of the capture script, still catched the block job info.
# cat result
Block Copy: [ 0 %] Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s)
# sh catch.sh
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
error: Requested operation is not valid: no job is active on the domain
yes!
More info:
While execute 'virsh blockjob rhel8.6 sda' immediately after the script exited manually, there is no the job anymore. Not sure whether we still need to handle the issue?
# virsh blockjob rhel8.6 sda
No current block job for sda
As far as I’ve understood, the issue was that the block job could not be cancelled and stayed around indefinitely. It doesn’t seem to do that in the case you show, it just lingers for a very short time, which I don’t think is problematic. So from my point of view we don’t need to handle this issue anymore. Update: Not reproduced inn libvirt-8.0.0-7.el9_0.x86_64 qemu-kvm-6.2.0-12.el9.x86_64 |