Bug 1989415

Summary:

Fail to cancel blockjob when cancelling copy-storage migration in libvirt

Product:

Red Hat Enterprise Linux 9

Reporter:

Han Han <hhan>

Component:

qemu-kvm

Assignee:

Hanna Czenczek <hreitz>

qemu-kvm sub component:

Block Jobs

QA Contact:

aihua liang <aliang>

Status:

CLOSED WORKSFORME

Docs Contact:

Severity:

medium

Priority:

medium

CC:

aliang, coli, fjin, hhan, hreitz, jinzhao, juzhang, kkiwi, pkrempa, timao, virt-maint

Version:

9.0

Keywords:

Triaged

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-03-29 10:18:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
VM xml and reproducing script	none

Description Han Han 2021-08-03 07:53:30 UTC

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b.x86_64
qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc.x86_64

How reproducible:
10%

Steps to Reproduce:
1. Start an VM
2. Migrate the VM to another host with `--live --copy-storage-all --p2p --verbose`
# virsh migrate rhel9 qemu+ssh://root@XXX/system --live --copy-storage-all --p2p --verbose


3. Before step2 finished, cancel the job by domjobabort
# virsh domjobabort rhel9

It return 0 but the migration job is not cancelled:
➜  ~ virsh domjobinfo rhel9
Job type:         Cancelled
Operation:        Outgoing migration

And the blockjob hangs on 0%:
➜  ~ virsh blockjob rhel9 vda
Block Copy: [  0 %]    Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s)

Actual results:
As above


Expected results:
The migration can be actually cancelled.

Additional info:
More detailed logs and analysis: https://gitlab.com/libvirt/libvirt/-/issues/197

Comment 1 Klaus Heinrich Kiwi 2021-08-03 12:13:51 UTC

Max,

 can this be related to some of the things you've been investigating upstream recently?

Comment 2 Hanna Czenczek 2021-08-04 08:29:09 UTC

Yes, this sounds exactly like https://gitlab.com/qemu-project/qemu/-/issues/462.

Comment 4 Hanna Czenczek 2021-08-11 08:11:24 UTC

If this is what I think it is (the issue linked in comment 2; but I think I’ll have to try reproducing comment 0, too, to see whether it really is), then this here should work as a reproducer:

$ qemu-img create -f qcow2 src.qcow2 64M
$ qemu-img create -f qcow2 tgt.qcow2 64M
$ qemu-nbd tgt.qcow2 &

$ qemu-system-x86_64 \
    -qmp stdio \
    -blockdev qcow2,node-name=src,file.driver=file,file.filename=src.qcow2 \
    -blockdev nbd,node-name=tgt,server.type=inet,server.host=localhost,server.port=10809

{"execute": "qmp_capabilities"}
{"return": {}}

{"execute": "blockdev-mirror",
 "arguments": {
   "job-id": "mirror",
   "filter-node-name": "mirror-node",
   "device": "src",
   "target": "tgt",
   "sync": "full",
   "on-target-error": "stop"
 } }
{"timestamp": {"seconds": 1628665650, "microseconds": 570910}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "mirror"}}
{"timestamp": {"seconds": 1628665650, "microseconds": 570981}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "mirror"}}
{"return": {}}
{"timestamp": {"seconds": 1628665650, "microseconds": 573887}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "mirror"}}
{"timestamp": {"seconds": 1628665650, "microseconds": 573916}, "event": "BLOCK_JOB_READY", "data": {"device": "mirror", "len": 67108864, "offset": 67108864, "speed": 0, "type": "mirror"}}

Then kill the NBD server (e.g. by using ^Z to bring qemu into the background):
^Z
$ kill %1 # qemu-nbd shell job ID
$ fg

And cancel the job:
{"execute": "block-job-cancel",
 "arguments": {
   "device": "mirror"
 } }


For me, this leads to qemu emitting endless BLOCK_JOB_ERROR message without being able to quit it (other than killing the process).

Comment 5 aihua liang 2021-08-12 04:41:49 UTC

Tried by libvirt, still can't reproduce it.

Test Env:
  kernel version:4.18.0-325.el8.x86_64
  qemu-kvm version:qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc
  libvirt version:libvirt-7.6.0-1.module+el8.5.0+12097+2c77910b.x86_64

Test Steps:
  1.create test.xml
     <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <qemu:capabilities>
        <qemu:add capability='blockdev'/>
        <qemu:del capability='drive'/>
  </qemu:capabilities>
  <name>rhel9</name>
  <uuid>73684f97-5e70-49fa-83f5-e1b8365d4447</uuid>
  <memory unit="KiB">2097152</memory>
  <currentMemory unit="KiB">2097152</currentMemory>
  <vcpu placement="static">1</vcpu>
  <iothreads>4</iothreads>
  <os>
    <type arch="x86_64" machine="pc">hvm</type>
    <boot dev="hd"/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <vmport state="off"/>
  </features>
  <clock offset="utc"/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2"/>
      <source file="/var/lib/avocado/data/avocado-vt/images/rhel850-64-virtio-scsi.qcow2"/>
      <target dev="vda" bus="virtio"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
    </disk>
    <controller type="usb" index="0" model="ich9-ehci1">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x7"/>
    </controller>
    <controller type="usb" index="0" model="ich9-uhci1">
      <master startport="0"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" multifunction="on"/>
    </controller>
    <controller type="usb" index="0" model="ich9-uhci2">
      <master startport="2"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x1"/>
    </controller>
    <controller type="usb" index="0" model="ich9-uhci3">
      <master startport="4"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x2"/>
    </controller>
    <controller type="pci" index="0" model="pci-root"/>
    <controller type="ide" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x1"/>
    </controller>
    <controller type="virtio-serial" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0"/>
    </controller>
    <controller type="scsi" index="0" model="virtio-scsi">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/>
    </controller>
    <interface type="network">
      <mac address="52:54:00:84:71:f7"/>
      <source network="default"/>
      <model type="e1000"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0"/>
    </interface>
    <interface type="network">
      <mac address="52:54:00:c0:59:5a"/>
      <source network="default"/>
      <model type="rtl8139"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x09" function="0x0"/>
    </interface>
    <serial type="pty">
      <target type="isa-serial" port="0">
      <model name="isa-serial"/>
      </target>
    </serial>
    <console type="pty">
      <target type="serial" port="0"/>
    </console>
    <input type="tablet" bus="usb">
      <address type="usb" bus="0" port="1"/>
    </input>
    <input type="mouse" bus="ps2"/>
    <input type="keyboard" bus="ps2"/>
    <graphics type="vnc" port="-1" autoport="yes">
    <listen type="address"/>
    </graphics>
    <sound model="ich6">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0"/>
    </sound>
    <audio id="1" type="none"/>
    <video>
      <model type="virtio" heads="1" primary="yes"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0"/>
    </video>
    <memballoon model="virtio">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
    </memballoon>
  </devices>
  <qemu:commandline>
    <qemu:env name="MALLOC_CHECK_" value="1"/>
  </qemu:commandline>
</domain>

 2. start vm with test.xml
     #virsh create test.xml
      setlocale: No such file or directory
      Domain 'rhel9' created from test.xml

 3. check virsh list
     #virsh list
     setlocale: No such file or directory
 Id   Name    State
-----------------------
 1    rhel9   running


 4. Create target pool in dst
    #virsh pool-create-as --name default --type dir --target /var/lib/avocado/data/avocado-vt/images/

 5. Do storage vm migration and during migration cancel(run 200 times).
    #sh run.sh
 ********run.sh***
    DOM=rhel9
for i in {1..200};do
        virsh migrate $DOM qemu+ssh://root.196.25/system --live --copy-storage-all --p2p --verbose&
        echo $i
        sleep_time=$((1 + $RANDOM % 45))
        echo "It will sleep for $sleep_time"
        sleep $sleep_time
        virsh domjobabort $DOM
        wait
done

 Actual result:
   All 200 times migration cancelled successfully.
   Details: see attachment: 200_migration_cancel.txt

 Additional info:
   qemu vm cmdline, see attachment: qemu_vm_cmdline.txt
   qemu and libvirt log, see attachment:libvirtd-migration_cancel_200.log

Comment 9 Han Han 2021-08-23 08:37:06 UTC

Created attachment 1816735 [details]
VM xml and reproducing script

Reproduced on libvirt-7.3.0-1.fc35.x86_64 qemu-kvm-5.2.0-8.fc34.x86_64
Key points for reproducing:
The keypoint is to find out the minimal migration time of your environment and assign to to the var "min_mig_time" in the script run.sh .
I find a way to get it:
1. Migrate the VM to dst host with command `time virsh migrate $DOM qemu+ssh://root.79.60/system --live --copy-storage-all --p2p --verbose` then record the total time of the migration
2. Assign the recorded time to the var "min_mig_time" and run the run.sh
3. Once get the first error "error: Requested operation is not valid: domain is not running" appears in the script. It means the actual minimal migration time is less than the "min_mig_time". Then assign the last sleep value to the var "min_mig_time" and cancel the running script.
e.g: for the script output:
```
It will sleep for 19.97
                                                           
error: End of file while reading data: : Input/output error    
                       
It will sleep for 19.27
error: Requested operation is not valid: domain is not running
```

Assign 19.27 to the var "min_mig_time" in the run.sh.

4. Migrate the VM back. Then rerun the run.sh. If the error "domain is not running" happens again. Redo step3&step4

5. If the script doesn't output anything or quit. The bug is reproduced. You can check the blockjob of the VM by:
➜  ~ virsh blockjob rhel9 vda
Block Copy: [ 36 %]    Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s)

You can also try cancelling the blockjob:
➜  ~ virsh blockjob rhel9 vda --abort 

It will hang.

Comment 10 Han Han 2021-08-23 09:43:05 UTC

Versions in comment9 should be libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64 qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b.x86_64

Comment 11 aihua liang 2021-08-25 11:17:09 UTC

Hi, Hanna

  Do we plan to fix this issue in RHEL8.5?

BR,
Aliang

Comment 12 Hanna Czenczek 2021-08-25 11:29:11 UTC

Hi Aliang,

The upstream patches have not been merged yet, so I can’t say for sure.

Another problem is that I can’t reproduce the problem from the libvirt side, even with the instructions given in comment 9.  Depending on the exact sleep time, the migration is either cancelled successfully, or finishes without fail.

I guess I’ll create an RPM that includes the currently proposed upstream patches, and then we can see whether this is even the bug I linked in comment 2.

Hanna

Comment 13 Hanna Czenczek 2021-08-25 16:07:13 UTC

Hi Han Han,

Can you test a Brew build with the patches that are currently proposed upstream (for the bug linked in comment 2)?

It’s here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263761
The x86-64 RPMs are here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263974
(e.g. qemu-kvm-core should be here: http://download.eng.bos.redhat.com/brewroot/work/tasks/3974/39263974/qemu-kvm-core-6.0.0-28.el8.hreitz202108251725.x86_64.rpm)

Thanks!


(After thinking for a bit, I don’t think we need to force this into 8.5.  If we don’t make it, 8.6 should be fine, too.)

Hanna

Comment 14 aihua liang 2021-08-26 01:14:24 UTC

Tried more than 200 times with libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64 qemu-kvm-6.0.0-28.module+el8.5.0+12271+fffa967b.x86_64 with method in comment 9, still can't reproduce the issue, and all block jobs are cancelled successfully. Will ping hanhan to try in my test env to check if his nested env is a keypoint.

Comment 15 Han Han 2021-08-26 02:54:09 UTC

(In reply to Hanna Reitz from comment #13)
> Hi Han Han,
> 
> Can you test a Brew build with the patches that are currently proposed
> upstream (for the bug linked in comment 2)?
> 
> It’s here:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263761
> The x86-64 RPMs are here:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=39263974
> (e.g. qemu-kvm-core should be here:
> http://download.eng.bos.redhat.com/brewroot/work/tasks/3974/39263974/qemu-
> kvm-core-6.0.0-28.el8.hreitz202108251725.x86_64.rpm)
> 
> Thanks!
> 
> 
> (After thinking for a bit, I don’t think we need to force this into 8.5.  If
> we don’t make it, 8.6 should be fine, too.)
> 
> Hanna
Tried on qemu-kvm-6.0.0-28.el8.hreitz202108251725.x86_64 libvirt-7.6.0-2.module+el8.5.0+12219+a5ea13d2.x86_64. It looks like something more 
serious has happened:
The script results:
# bash ./run.sh                                                                                                                                                                         
It will sleep for 17.29

error: End of file while reading data: : Input/output error

It will sleep for 16.02

error: operation aborted: migration out: canceled by client

It will sleep for 18.69

error: End of file while reading data: : Input/output error

It will sleep for 18.43

error: operation aborted: migration out: canceled by client

It will sleep for 17.36

error: operation aborted: migration out: canceled by client

It will sleep for 17.18
error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params) 


It cannot get the blockjob info anymore:
# virsh blockjob rhel9 vda
error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePerform3Params)

Even the VM itself is hang:
# virsh domifaddr rhel9
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet4      52:54:00:84:71:f7    ipv4         192.168.122.170/24
 vnet5      52:54:00:c0:59:5a    ipv4         192.168.122.106/24

# ping 192.168.122.170
PING 192.168.122.170 (192.168.122.170) 56(84) bytes of data.
From 192.168.122.1 icmp_seq=1 Destination Host Unreachable
From 192.168.122.1 icmp_seq=2 Destination Host Unreachable
From 192.168.122.1 icmp_seq=3 Destination Host Unreachable
From 192.168.122.1 icmp_seq=4 Destination Host Unreachable
From 192.168.122.1 icmp_seq=5 Destination Host Unreachable
From 192.168.122.1 icmp_seq=6 Destination Host Unreachable

However, the VM can be accessed before the script.


On another run, only migration hangs and is never aborted:
# bash run.sh
...
Thu Aug 26 02:13:55 UTC 2021
It will sleep for 18.94


➜  ~ virsh domjobinfo rhel9
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     2006224      ms

➜  ~ virsh domjobabort rhel9 


➜  ~ virsh domjobinfo rhel9 
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     2039306      ms

➜  ~ virsh blockjob rhel9 vda
No current block job for vda

➜  ~ virsh domifaddr rhel9   
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet12     52:54:00:84:71:f7    ipv4         192.168.122.170/24
 vnet13     52:54:00:c0:59:5a    ipv4         192.168.122.106/24

➜  ~ ping 192.168.122.170
PING 192.168.122.170 (192.168.122.170) 56(84) bytes of data.
64 bytes from 192.168.122.170: icmp_seq=1 ttl=64 time=1.50 ms
64 bytes from 192.168.122.170: icmp_seq=2 ttl=64 time=0.896 ms

Comment 17 John Ferlan 2021-09-08 21:49:52 UTC

Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 18 Hanna Czenczek 2021-09-10 14:56:54 UTC

Hi,

So I tried reproducing this and I thought I got somewhere because I could get some migration to hang.  But comparing the QMP logs, I saw a very different situation than shown in the log in https://gitlab.com/libvirt/libvirt/-/issues/197.

In my logs, qemu didn’t seem unresponsive, but instead listed the block job as `ready`, then `aborting`, then it reported an I/O error, and the job was then successfully aborted and the job entered the `concluded` state for libvirt to dismiss.  libvirt however didn’t dismiss is, which is why it remained.  Not sure what was happening, but qemu seemed responsive.

In your logs, the block job never reached the `ready` state and consequently the migration was never started (no `"execute": "migration"` QMP calls).  I tried to time my aborts such that it would hit around the time the job reached the ready state, but could not reproduce the same situation.  The migration (and the block job) were always successfully and quickly aborted.

So because the job never signals to be ready, I now doubt this is really related to https://gitlab.com/qemu-project/qemu/-/issues/462 (which I said in comment 2).

Hanna

Comment 22 Tingting Mao 2022-03-01 13:06:18 UTC

Seems reproduced the issue as below. 

But when executing 'virsh blockjob rhel8.6 sda' immediately after the script exited again in terminal manually, there is no the job anymore. Not sure whether we need to handle the issue?


Tested with:
qemu-kvm-6.2.0-9.el9
kernel-5.14.0-65.kpq0.el9.x86_64


Steps:
1. Boot up a source guest
# virt-install \
    	--name rhel8.6 \
    	--cpu host-model \
	--machine q35 \
    	--ram=8192 \
    	--vcpus=2,sockets=1,cores=1,threads=2 \
    	--hvm \
    	--noreboot \
    	--network bridge=switch \
    	--os-type=linux \
    	--disk path=RHEL-8.6-x86_64-latest.qcow2,bus=scsi,format='qcow2',cache=none \
    	--graphics vnc,listen=0.0.0.0 \
    	--console unix,path=/tmp/rhel8.6,mode=bind \
	--import \

2. Execute the capture script
# cat catch.sh 
#!/bin/bash

while true
do
	virsh domjobabort rhel8.6
	virsh blockjob rhel8.6 sda &> result 
	if grep -q "Block" result; then
	  echo "yes!"
	  exit 0
	fi
	sleep 0.1
done

3. Migrate the guest
# virsh migrate rhel8.6 qemu+ssh://root.72.90/system --live --copy-storage-all --p2p --verbose
error: operation aborted: migration out: canceled by client


Results:
As above, the migration failed. But checked the result of the capture script, still catched the block job info.
# cat result 
Block Copy: [  0 %]    Bandwidth limit: 9223372036853727232 bytes/s (8.000 EiB/s)

# sh catch.sh 
error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain

error: Requested operation is not valid: no job is active on the domain



yes!



More info:
While execute 'virsh blockjob rhel8.6 sda' immediately after the script exited manually, there is no the job anymore. Not sure whether we still need to handle the issue?
# virsh blockjob rhel8.6 sda
No current block job for sda

Comment 23 Hanna Czenczek 2022-03-07 11:38:23 UTC

As far as I’ve understood, the issue was that the block job could not be cancelled and stayed around indefinitely.  It doesn’t seem to do that in the case you show, it just lingers for a very short time, which I don’t think is problematic.

So from my point of view we don’t need to handle this issue anymore.

Comment 24 Han Han 2022-03-29 10:05:22 UTC

Update: Not reproduced inn libvirt-8.0.0-7.el9_0.x86_64 qemu-kvm-6.2.0-12.el9.x86_64