1822538 – activeNVMeHostdevs is not reset when error happens on blockcopy

Bug 1822538 - activeNVMeHostdevs is not reset when error happens on blockcopy

Summary: activeNVMeHostdevs is not reset when error happens on blockcopy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.3
Assignee:	Michal Privoznik
QA Contact:	gaojianan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-09 09:54 UTC by Han Han
Modified:	2021-03-23 09:15 UTC (History)
CC List:	13 users (show)
Fixed In Version:	libvirt-6.3.0-1.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-17 17:48:08 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
libvirtd log, domain xml, disk xml (112.93 KB, application/gzip) 2020-04-09 09:54 UTC, Han Han	no flags	Details
domain xml, disk xml, libvirtd log, qemu log, qemu events (405.46 KB, application/gzip) 2020-04-20 09:03 UTC, Han Han	no flags	Details
qemu dtrace log (3.39 MB, application/x-xz) 2020-04-21 03:19 UTC, Han Han	no flags	Details
View All

Description Han Han 2020-04-09 09:54:42 UTC

Created attachment 1677495 [details]
libvirtd log, domain xml, disk xml

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
qemu-kvm-4.2.0-17.module+el8.2.0+6141+0f540f16.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a running VM. 
Prepare a host with nvme disk:
44:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)

Prepare the invalid and valid nvme dest xml:
invalid:
<disk type="nvme" device="disk">
      <driver name="qemu" type="qcow2"/>
      <source type="pci" managed="yes" namespace="1" index="1">
        <address domain="0x0000" bus="0x44" slot="0x00" function="0x0"/>
        <slices>
          <slice type="storage" offset="0" size="10737418240"/>
        </slices>
      </source>
      <target dev="sda" bus="scsi"/>
    </disk>

valid:
<disk type="nvme" device="disk">
      <driver name="qemu" type="raw"/>
      <source type="pci" managed="yes" namespace="1" index="1">
        <address domain="0x0000" bus="0x44" slot="0x00" function="0x0"/>
      </source>
      <target dev="sda" bus="scsi"/>
    </disk>

2. Blockcopy to the invalid xml:
~ virsh blockcopy pc sda --xml /tmp/nvme.xml --wait --verbose --transient-job
error: Operation not supported: creation of images with slice type='storage' is not supported

Blockcopy to the valid xml:
➜  ~ virsh blockcopy pc sda --xml /tmp/nvme-raw.xml --wait --verbose --transient-job
error: Requested operation is not valid: NVMe device 0000:44:00.0 already in use by driver QEMU domain pc

3. Restart libvirtd and copy to the valid xml, it will pass
➜  ~ systemctl restart libvirtd                                                     
➜  ~ virsh blockcopy pc sda --xml /tmp/nvme-raw.xml --wait --verbose --transient-job
Block Copy: [100 %]
Now in mirroring phase

Actual results:
As above

Expected results:
Rest activeNVMeHostdevs after failed blockcopy

Additional info:
Also update the error msg frpm debug to error level:
2020-04-09 09:49:26.752+0000: 439842: debug : virHostdevPrepareOneNVMeDevice:2318 : Requested operation is not valid: NVMe device 0000:44:00.0 already in use by driver QEMU domain pc

Comment 1 Michal Privoznik 2020-04-14 12:16:44 UTC

Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2020-April/msg00652.html

Comment 2 Michal Privoznik 2020-04-16 07:43:50 UTC

v2:

https://www.redhat.com/archives/libvir-list/2020-April/msg00762.html

Comment 3 Michal Privoznik 2020-04-16 14:17:52 UTC

Pushed upstream as:

67e19fc99a qemu: Revoke access to mirror on failed blockcopy

v6.2.0-159-g67e19fc99a

Comment 4 Han Han 2020-04-20 09:03:01 UTC

Created attachment 1680207 [details]
domain xml, disk xml, libvirtd log, qemu log, qemu events

50% probability failed when testing on qemu-kvm-5.0.0-0.scrmod+el8.3.0+6312+cee4f348.x86_64, libvirt v6.2.0-173-gee62b98e01, 4.18.0-193.el8.x86_64

Steps:
Follow the steps of bug report:
#!/bin/bash
INVALID=invalid.xml
VALID=valid.xml
VM=pc
DISK=vda
VFIO_ADM=./vfio_adm
NVME_ADDR=0000:44:00.0
LOG=libvirtd.log

function setup(){
        virsh start $VM && sleep 30
}
function cleanup(){
        virsh destroy $VM
        $VFIO_ADM --detach --device $NVME_ADDR
}

while true;do
        setup
        tail -f /var/log/libvirt/libvirtd.log > $LOG &
        virsh blockcopy $VM $DISK --xml $INVALID --transient-job --wait --verbose
        virsh blockcopy $VM $DISK --xml $VALID --transient-job --wait --verbose
        if [ $? -ne 0 ];then
                echo bug
                break
        fi
        kill %1
        cleanup
done


Failed at:
~ virsh blockcopy pc vda --xml valid.xml --transient-job --wait --verbose                                                                                                                              
Block Copy: [ 61 %]error:
Copy failed


No error in libvirtd log, but a error "VFIO_MAP_DMA failed: Cannot allocate memory", which looks similiar to https://bugs.launchpad.net/qemu/+bug/1863333

I tried to gather more qemu trace by events in file vfio-events and following arugments:
  <qemu:commandline>
    <qemu:arg value='-trace'/>
    <qemu:arg value='log,events=/tmp/vfio-events'/>
  </qemu:commandline>
But no trace event found in qemu...

Comment 5 Han Han 2020-04-20 09:05:27 UTC

Michel, could you please help to check if the issue in comment4 is a qemu bug?
And the qemu event trace example for downstream qemu :).

Comment 6 Han Han 2020-04-20 09:43:50 UTC

For comment4, the following dmesg are found:
> [1570608.440452] device vnet0 left promiscuous mode
> [1570608.445115] virbr0: port 2(vnet0) entered disabled state
> [1570610.544704] virbr0: port 2(vnet0) entered blocking state
> [1570610.550238] virbr0: port 2(vnet0) entered disabled state
> [1570610.555875] device vnet0 entered promiscuous mode
> [1570610.561009] virbr0: port 2(vnet0) entered blocking state
> [1570610.566538] virbr0: port 2(vnet0) entered listening state
> [1570610.630260] scsi host5: iSCSI Initiator over TCP/IP
> [1570610.647187]  connection150:0: detected conn error (1020)
> [1570612.594286] virbr0: port 2(vnet0) entered learning state
> [1570614.643289] virbr0: port 2(vnet0) entered forwarding state
> [1570614.643291] virbr0: topology change detected, propagating
> [1570642.099384] vfio-pci 0000:44:00.0: enabling device (0400 -> 0402)
> [1570642.211525] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
> [1570645.137327] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.144273] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.151145] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.158022] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.164888] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.171850] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.178720] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.185584] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.192449] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.199316] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.206178] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.213041] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.219908] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.226780] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570645.233641] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.128029] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.134977] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.141850] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.148714] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.155580] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.162454] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.169317] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.176184] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.183051] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.189916] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.196779] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.203639] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.210501] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.217363] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570646.224225] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.436204] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.443140] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.450029] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.456894] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.463757] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.470619] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.477484] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.484347] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.491211] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.498072] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.504933] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.511789] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570647.518676] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.420958] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.427885] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.434764] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.441623] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.448480] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.455339] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.462199] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.469063] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.475920] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.482780] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.489642] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.496503] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.503362] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.510250] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570648.517113] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570649.566156] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> [1570649.631453] vfio-pci 0000:44:00.0: Event logged [IO_PAGE_FAULT domain=0x007d address=0x7fc81e1000 flags=0x0000]
> [1570649.641805] vfio-pci 0000:44:00.0: Event logged [IO_PAGE_FAULT domain=0x007d address=0x7fc81e0000 flags=0x0000]
> [1570649.652154] vfio-pci 0000:44:00.0: Event logged [IO_PAGE_FAULT domain=0x007d address=0x7fc81e1700 flags=0x0000]
> [1570649.662504] vfio-pci 0000:44:00.0: Event logged [IO_PAGE_FAULT domain=0x007d address=0x7fc81e2000 flags=0x0000]
> [1570649.672850] vfio-pci 0000:44:00.0: Event logged [IO_PAGE_FAULT domain=0x007d address=0x7fc81e3000 flags=0x0000]
> [1570649.683203] vfio-pci 0000:44:00.0: Event logged [IO_PAGE_FAULT domain=0x007d address=0x7fc81e4000 flags=0x0000]

Comment 7 Michal Privoznik 2020-04-20 13:17:11 UTC

(In reply to Han Han from comment #4)
> > [1570642.099384] vfio-pci 0000:44:00.0: enabling device (0400 -> 0402)
> > [1570642.211525] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
> > [1570645.137327] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded

I remember seeing these when developing NVMe feature. And the limit of 2GB shows that libvirt added 1GB to how much memory your domain has (1GB). Unfortunately, I don't remember how I addressed it. I vaguely recall asking Alex Williamson, but I can't find the conversation anywhere, so I better ask him now. Alex, do you remember these errors (or me talking to you about them)?

Comment 8 Han Han 2020-04-21 03:19:53 UTC

Created attachment 1680402 [details]
qemu dtrace log

I found the way to trace downstream qemu:
Trace the blockcopy by 
~ qemu-trace-stap -v run --pid `pidof qemu-kvm` /usr/libexec/qemu-kvm 'qio*' 'nvme*' 'vfio*'

And the issue of 'vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded' could be triggered by single blockcopy directly
~ virsh blockcopy pc vda --xml valid.xml --transient-job --wait --verbose

Comment 9 Alex Williamson 2020-04-21 23:05:24 UTC

(In reply to Michal Privoznik from comment #7)
> (In reply to Han Han from comment #4)
> > > [1570642.099384] vfio-pci 0000:44:00.0: enabling device (0400 -> 0402)
> > > [1570642.211525] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
> > > [1570645.137327] vfio_pin_pages_remote: RLIMIT_MEMLOCK (2147483648) exceeded
> 
> I remember seeing these when developing NVMe feature. And the limit of 2GB
> shows that libvirt added 1GB to how much memory your domain has (1GB).
> Unfortunately, I don't remember how I addressed it. I vaguely recall asking
> Alex Williamson, but I can't find the conversation anywhere, so I better ask
> him now. Alex, do you remember these errors (or me talking to you about
> them)?

Sorry, I don't recall that conversation either, and to be honest I don't claim maintainership, or even familiarity really, for the vfio-based nvme device, if I understand what we're testing here.  This was all Fam's work and I'm afraid it's somewhat orphaned now.  As you've identified, we're clearly hitting the locked memory limit, which suggests the driver is either exceeding what we expect or leaking mappings, but I don't know off hand how this driver manages its mappings to understand its locked memory requirements.

Comment 10 Han Han 2020-04-22 08:44:43 UTC

Hi all, for the issue in comment4, let's track it at https://bugzilla.redhat.com/show_bug.cgi?id=1826647

Comment 13 gaojianan 2020-08-26 03:40:42 UTC

There is another problem happened which trigger in https://bugzilla.redhat.com/show_bug.cgi?id=1825785,
after resolve that problem i will verify this bug.

Comment 14 gaojianan 2020-09-03 03:19:57 UTC

Try to verified this bug at libvirt version:
libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64

1.Clear the nvme env:
# ./vfio_adm --attach --device 0000:44:00.0                                
Attaching device 0000:44:00.0 to VFIO/UIO

# ./vfio_adm --detach --device 0000:44:00.0                                
Detaching device 0000:44:00.0 from VFIO/UIO

2.Prepare a valid xml:
<disk type="nvme" device="disk">
      <driver name="qemu" type="raw"/>
      <source type="pci" managed="yes" namespace="1" index="1">
        <address domain="0x0000" bus="0x44" slot="0x00" function="0x0"/>
      </source>
      <target dev="sda" bus="scsi"/>
    </disk>

2. Blockcopy to the valid xml:
# virsh blockcopy nvme vda --xml valid.xml --wait --verbose --transient-job

Copy failed

Check the libvirtd.log and found the error ""error": "Source and target image have different sizes"":
2020-09-03 02:57:23.598+0000: 11394: debug : virThreadJobSet:94 : Thread 11394 (rpc-worker) is now running job remoteDispatchDomainGetBlockJobInfo
2020-09-03 02:57:23.598+0000: 11394: debug : virDomainGetBlockJobInfo:9959 : dom=0x55a219f2bb80, (VM: name=nvme, uuid=bca87ea8-7499-4c1d-81fd-8ecd71162b91), disk=vda, info=0x7f94f4f1f7e0, flags=0x0
2020-09-03 02:57:23.598+0000: 11394: debug : qemuDomainObjBeginJobInternal:840 : Starting job: job=query agentJob=none asyncJob=none (vm=0x55a219ee46a0 name=nvme, current job=modify agentJob=none async=none)
2020-09-03 02:57:23.598+0000: 11394: debug : qemuDomainObjBeginJobInternal:868 : Waiting for job (vm=0x55a219ee46a0 name=nvme)
2020-09-03 02:57:23.598+0000: 11476: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": [{"current-progress": 0, "status": "concluded", "total-progress": 0, "type": "mirror", "id": "copy-vda-libvirt-1-format", "error": "Source and target image have different sizes"}], "id": "libvirt-25"}]
2020-09-03 02:57:23.598+0000: 11476: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f9488004010 reply={"return": [{"current-progress": 0, "status": "concluded", "total-progress": 0, "type": "mirror", "id": "copy-vda-libvirt-1-format", "error": "Source and target image have different sizes"}], "id": "libvirt-25"}
2020-09-03 02:57:23.598+0000: 11494: debug : qemuMonitorJobDismiss:3330 : jobname=copy-vda-libvirt-1-format
2020-09-03 02:57:23.598+0000: 11494: debug : qemuMonitorJobDismiss:3332 : mon:0x7f9488004010 vm:0x55a219ee46a0 fd:31
2020-09-03 02:57:23.598+0000: 11494: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f9488004010 msg={"execute":"job-dismiss","arguments":{"id":"copy-vda-libvirt-1-format"},"id":"libvirt-26"}^M
 fd=-1
2020-09-03 02:57:23.598+0000: 11476: info : qemuMonitorIOWrite:433 : QEMU_MONITOR_IO_WRITE: mon=0x7f9488004010 buf={"execute":"job-dismiss","arguments":{"id":"copy-vda-libvirt-1-format"},"id":"libvirt-26"}^M
 len=92 ret=92 errno=0
2020-09-03 02:57:23.599+0000: 11476: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1599101843, "microseconds": 599599}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "copy-vda-libvirt-1-format"}}]
2020-09-03 02:57:23.599+0000: 11476: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9488004010 event={"timestamp": {"seconds": 1599101843, "microseconds": 599599}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "copy-vda-libvirt-1-format"}}

and in dmesg log we can see:
[ 5374.981746] nvme nvme5: pci function 0000:44:00.0
[ 5375.093669] nvme nvme5: 16/0/0 default/read/poll queues
[ 5375.102037]  nvme5n1: p1 p2
[ 5515.125756] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
[ 5518.030206] nvme nvme5: pci function 0000:44:00.0
[ 5538.629870] nvme nvme5: Device not ready; aborting initialisation, CSTS=0x0
[ 5538.636887] nvme nvme5: Removing after probe failure status: -19

additonal info:
Guest image:
# qemu-img info /var/lib/libvirt/images/test -U
image: /var/lib/libvirt/images/test
file format: qcow2
virtual size: 10 GiB (10737418240 bytes)
disk size: 1.58 GiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

nvme disk:
nvme6n1                                                                                             259:0    0 372.6G  0 disk  
├─nvme.8086-43564654373338343030513734303042474e-494e54454c205353445045444d443430304734-00000001    253:2    0 372.6G  0 mpath 
│ ├─nvme.8086-43564654373338343030513734303042474e-494e54454c205353445045444d443430304734-00000001p1
│ │                                                                                                 253:4    0     1G  0 part  
│ └─nvme.8086-43564654373338343030513734303042474e-494e54454c205353445045444d443430304734-00000001p2
│                                                                                                   253:5    0     9G  0 part  
├─nvme6n1p1                                                                                         259:1    0     1G  0 part  
└─nvme6n1p2                                                                                         259:2    0     9G  0 part  

I think this is another libvirt bug.Do you think so @Michal ?

Comment 15 Michal Privoznik 2020-09-04 10:22:21 UTC

(In reply to gaojianan from comment #14)
> Try to verified this bug at libvirt version:
> libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64
> 
> 1.Clear the nvme env:
> # ./vfio_adm --attach --device 0000:44:00.0                                
> Attaching device 0000:44:00.0 to VFIO/UIO
> 
> # ./vfio_adm --detach --device 0000:44:00.0                                
> Detaching device 0000:44:00.0 from VFIO/UIO
> 
> 2.Prepare a valid xml:
> <disk type="nvme" device="disk">
>       <driver name="qemu" type="raw"/>
>       <source type="pci" managed="yes" namespace="1" index="1">
>         <address domain="0x0000" bus="0x44" slot="0x00" function="0x0"/>
>       </source>
>       <target dev="sda" bus="scsi"/>
>     </disk>
> 
> 2. Blockcopy to the valid xml:
> # virsh blockcopy nvme vda --xml valid.xml --wait --verbose --transient-job
> 
> Copy failed
> 
> Check the libvirtd.log and found the error ""error": "Source and target
> image have different sizes"":
> 2020-09-03 02:57:23.598+0000: 11394: debug : virThreadJobSet:94 : Thread
> 11394 (rpc-worker) is now running job remoteDispatchDomainGetBlockJobInfo
> 2020-09-03 02:57:23.598+0000: 11394: debug : virDomainGetBlockJobInfo:9959 :
> dom=0x55a219f2bb80, (VM: name=nvme,
> uuid=bca87ea8-7499-4c1d-81fd-8ecd71162b91), disk=vda, info=0x7f94f4f1f7e0,
> flags=0x0
> 2020-09-03 02:57:23.598+0000: 11394: debug :
> qemuDomainObjBeginJobInternal:840 : Starting job: job=query agentJob=none
> asyncJob=none (vm=0x55a219ee46a0 name=nvme, current job=modify agentJob=none
> async=none)
> 2020-09-03 02:57:23.598+0000: 11394: debug :
> qemuDomainObjBeginJobInternal:868 : Waiting for job (vm=0x55a219ee46a0
> name=nvme)
> 2020-09-03 02:57:23.598+0000: 11476: debug :
> qemuMonitorJSONIOProcessLine:220 : Line [{"return": [{"current-progress": 0,
> "status": "concluded", "total-progress": 0, "type": "mirror", "id":
> "copy-vda-libvirt-1-format", "error": "Source and target image have
> different sizes"}], "id": "libvirt-25"}]
> 2020-09-03 02:57:23.598+0000: 11476: info : qemuMonitorJSONIOProcessLine:240
> : QEMU_MONITOR_RECV_REPLY: mon=0x7f9488004010 reply={"return":
> [{"current-progress": 0, "status": "concluded", "total-progress": 0, "type":
> "mirror", "id": "copy-vda-libvirt-1-format", "error": "Source and target
> image have different sizes"}], "id": "libvirt-25"}

This error message comes from QEMU. I have no idea why that is a problem for QEMU. Stefan, maybe you know?

Comment 16 Stefan Hajnoczi 2020-09-10 15:35:31 UTC

(In reply to gaojianan from comment #14)
> 2. Blockcopy to the valid xml:
> # virsh blockcopy nvme vda --xml valid.xml --wait --verbose --transient-job
> 
> Copy failed
> 
> Check the libvirtd.log and found the error ""error": "Source and target
> image have different sizes"":

Did you confirm that the source and target images have the exact same size?

The output posted does not seem to contain this information.

Comment 17 Stefan Hajnoczi 2020-09-10 15:38:29 UTC

Adding Philippe, who is working on the nvme driver in QEMU.

Comment 18 Peter Krempa 2020-09-10 15:39:29 UTC

That might be hard to achieve when the destination of the copy is a NVMe drive. I'd suggest that qemu doesn't cause a fatal error if the destination is larger than source.

Comment 19 Stefan Hajnoczi 2020-09-11 08:56:06 UTC

(In reply to Peter Krempa from comment #18)
> That might be hard to achieve when the destination of the copy is a NVMe
> drive. I'd suggest that qemu doesn't cause a fatal error if the destination
> is larger than source.

NVMe Namespaces or traditional partitions/volumes can be used to divide up NVMe drives into correctly-sized pieces. That way free blocks can still be used for other purposes later. I think this should be the recommended approach to migrating/copying to a bigger NVMe drive.

About your suggestion:
How will the destination keep track of the correct virtual disk size? There isn't enough information to answer this in QEMU. At best a parameter could be added to disable the size check and the user needs to ensure that disk size is preserved.

Comment 20 gaojianan 2020-09-28 02:46:55 UTC

(In reply to Stefan Hajnoczi from comment #16)
> (In reply to gaojianan from comment #14)
> > 2. Blockcopy to the valid xml:
> > # virsh blockcopy nvme vda --xml valid.xml --wait --verbose --transient-job
> > 
> > Copy failed
> > 
> > Check the libvirtd.log and found the error ""error": "Source and target
> > image have different sizes"":
> 
> Did you confirm that the source and target images have the exact same size?
> 
> The output posted does not seem to contain this information.

The target size is bigger than the source size so i think the blockcopy should success but failed.

Comment 21 gaojianan 2020-10-13 11:21:11 UTC

Try to verified at 
libvirt-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64
but still have some problem.

Scenario 1:Invalid xml
1.# cat nvme.xml
<disk type="nvme" device="disk">
      <driver name="qemu" type="qcow2"/>
      <source type="pci" managed="yes" namespace="1" index="1">
        <address domain="0x0000" bus="0x44" slot="0x00" function="0x0"/>
        <slices>
          <slice type="storage" offset="0" size="10737418240"/>
        </slices>
      </source>
      <target dev="sda" bus="scsi"/>
    </disk>

2.Do blockcopy
#  virsh blockcopy test vda --xml nvme.xml --wait --verbose --transient-job 
error: Operation not supported: creation of images with slice type='storage' is not supported

3.Check the nvme device status
# lsblk                                     
NAME                           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                              8:0    0 447.1G  0 disk  
├─sda1                           8:1    0     1G  0 part  /boot
└─sda2                           8:2    0 446.1G  0 part  
  ├─rhel_hp--dl385g10--15-root 253:0    0   250G  0 lvm   /
  └─rhel_hp--dl385g10--15-swap 253:1    0  15.7G  0 lvm   [SWAP]
nvme5n1                        259:0    0 372.6G  0 disk  
├─mpathr                       253:2    0 372.6G  0 mpath 
│ └─mpathr1                    253:3    0    10G  0 part  
└─nvme5n1p1                    259:1    0    10G  0 part  

NVME device is still exist,pass

Scenario2: Valid xml but blockcopy operation failed
1.# cat nvme1.xml
  <disk type='nvme' device='disk'>
    <driver name='qemu' type='raw'/>
    <source type='pci' managed='yes' namespace='1'>
      <address domain='0x0000' bus='0x44' slot='0x00' function='0x0'/>
    </source>
    <target dev='vde' bus='virtio'/>
  </disk>

2.Do blockcopy
virsh blockcopy test vda --xml nvme1.xml --wait --verbose --transient-job

Copy failed
Here failed because the source image size is not same as target size

3.Check the NVME device status
# lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                              8:0    0 447.1G  0 disk 
├─sda1                           8:1    0     1G  0 part /boot
└─sda2                           8:2    0 446.1G  0 part 
  ├─rhel_hp--dl385g10--15-root 253:0    0   250G  0 lvm  /
  └─rhel_hp--dl385g10--15-swap 253:1    0  15.7G  0 lvm  [SWAP]
The nvme device gone and can't restore

So i think there is still some problems in scenario2,the nvme device should restore when the blockcopy failed.

Comment 23 Michal Privoznik 2020-10-13 12:04:41 UTC

(In reply to gaojianan from comment #21)

> Scenario2: Valid xml but blockcopy operation failed
> 1.# cat nvme1.xml
>   <disk type='nvme' device='disk'>
>     <driver name='qemu' type='raw'/>
>     <source type='pci' managed='yes' namespace='1'>
>       <address domain='0x0000' bus='0x44' slot='0x00' function='0x0'/>
>     </source>
>     <target dev='vde' bus='virtio'/>
>   </disk>
> 
> 2.Do blockcopy
> virsh blockcopy test vda --xml nvme1.xml --wait --verbose --transient-job
> 
> Copy failed
> Here failed because the source image size is not same as target size
> 
> 3.Check the NVME device status
> # lsblk
> NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda                              8:0    0 447.1G  0 disk 
> ├─sda1                           8:1    0     1G  0 part /boot
> └─sda2                           8:2    0 446.1G  0 part 
>   ├─rhel_hp--dl385g10--15-root 253:0    0   250G  0 lvm  /
>   └─rhel_hp--dl385g10--15-swap 253:1    0  15.7G  0 lvm  [SWAP]
> The nvme device gone and can't restore
> 
> So i think there is still some problems in scenario2,the nvme device should
> restore when the blockcopy failed.

Hard to tell. What driver is the nvme device attached to? I remember host kernel not initializing NVMe always when returning from a guest. Hence, lsblk won't display it despite libvirt reseting the device and returning it back to the host. If the nvme is still attached to vfio can you please share debug logs and the hosts dmesg?

Comment 24 gaojianan 2020-10-14 01:40:25 UTC

(In reply to Michal Privoznik from comment #23)
> (In reply to gaojianan from comment #21)
> 
> > Scenario2: Valid xml but blockcopy operation failed
> > 1.# cat nvme1.xml
> >   <disk type='nvme' device='disk'>
> >     <driver name='qemu' type='raw'/>
> >     <source type='pci' managed='yes' namespace='1'>
> >       <address domain='0x0000' bus='0x44' slot='0x00' function='0x0'/>
> >     </source>
> >     <target dev='vde' bus='virtio'/>
> >   </disk>
> > 
> > 2.Do blockcopy
> > virsh blockcopy test vda --xml nvme1.xml --wait --verbose --transient-job
> > 
> > Copy failed
> > Here failed because the source image size is not same as target size
> > 
> > 3.Check the NVME device status
> > # lsblk
> > NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > sda                              8:0    0 447.1G  0 disk 
> > ├─sda1                           8:1    0     1G  0 part /boot
> > └─sda2                           8:2    0 446.1G  0 part 
> >   ├─rhel_hp--dl385g10--15-root 253:0    0   250G  0 lvm  /
> >   └─rhel_hp--dl385g10--15-swap 253:1    0  15.7G  0 lvm  [SWAP]
> > The nvme device gone and can't restore
> > 
> > So i think there is still some problems in scenario2,the nvme device should
> > restore when the blockcopy failed.
> 
> Hard to tell. What driver is the nvme device attached to? I remember host
> kernel not initializing NVMe always when returning from a guest. Hence,
> lsblk won't display it despite libvirt reseting the device and returning it
> back to the host. If the nvme is still attached to vfio can you please share
> debug logs and the hosts dmesg?

I don't think the NVMe device is still attached to vfio,because i can't use "./vfio_adm --detach --device 0000:44:00.0 
Device is not VFIO bound" to detach it,but i can use it attach first " ./vfio_adm --attach --device 0000:44:00.0 
Attaching device 0000:44:00.0 to VFIO/UIO" and then detach it by "./vfio_adm --detach --device 0000:44:00.0",the nvme will back when 
check by 'lsblk'.
Through demsg i can get:
[1273279.118469] nvme nvme7: pci function 0000:44:00.0
[1273279.233548] nvme nvme7: 16/0/0 default/read/poll queues
[1273279.242960]  nvme7n1: p1
[1273333.129854] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
[1273335.823750] nvme nvme7: pci function 0000:44:00.0
[1273356.426563] nvme nvme7: Device not ready; aborting initialisation, CSTS=0x0
[1273356.433753] nvme nvme7: Removing after probe failure status: -19

Is this information useful for you?

Comment 25 Michal Privoznik 2020-10-14 06:11:29 UTC

(In reply to gaojianan from comment #24)
> (In reply to Michal Privoznik from comment #23)
> > (In reply to gaojianan from comment #21)
> > 
> > > Scenario2: Valid xml but blockcopy operation failed
> > > 1.# cat nvme1.xml
> > >   <disk type='nvme' device='disk'>
> > >     <driver name='qemu' type='raw'/>
> > >     <source type='pci' managed='yes' namespace='1'>
> > >       <address domain='0x0000' bus='0x44' slot='0x00' function='0x0'/>
> > >     </source>
> > >     <target dev='vde' bus='virtio'/>
> > >   </disk>
> > > 
> > > 2.Do blockcopy
> > > virsh blockcopy test vda --xml nvme1.xml --wait --verbose --transient-job
> > > 
> > > Copy failed
> > > Here failed because the source image size is not same as target size
> > > 
> > > 3.Check the NVME device status
> > > # lsblk
> > > NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > > sda                              8:0    0 447.1G  0 disk 
> > > ├─sda1                           8:1    0     1G  0 part /boot
> > > └─sda2                           8:2    0 446.1G  0 part 
> > >   ├─rhel_hp--dl385g10--15-root 253:0    0   250G  0 lvm  /
> > >   └─rhel_hp--dl385g10--15-swap 253:1    0  15.7G  0 lvm  [SWAP]
> > > The nvme device gone and can't restore
> > > 
> > > So i think there is still some problems in scenario2,the nvme device should
> > > restore when the blockcopy failed.
> > 
> > Hard to tell. What driver is the nvme device attached to? I remember host
> > kernel not initializing NVMe always when returning from a guest. Hence,
> > lsblk won't display it despite libvirt reseting the device and returning it
> > back to the host. If the nvme is still attached to vfio can you please share
> > debug logs and the hosts dmesg?
> 
> I don't think the NVMe device is still attached to vfio,because i can't use
> "./vfio_adm --detach --device 0000:44:00.0 
> Device is not VFIO bound" to detach it,but i can use it attach first "
> ./vfio_adm --attach --device 0000:44:00.0 

I don't know what vfio_adm does, but it reports that the device is not bound to VFIO, so libvirt has detached it.

> Attaching device 0000:44:00.0 to VFIO/UIO" and then detach it by "./vfio_adm
> --detach --device 0000:44:00.0",the nvme will back when 
> check by 'lsblk'.
> Through demsg i can get:
> [1273279.118469] nvme nvme7: pci function 0000:44:00.0
> [1273279.233548] nvme nvme7: 16/0/0 default/read/poll queues
> [1273279.242960]  nvme7n1: p1
> [1273333.129854] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
> [1273335.823750] nvme nvme7: pci function 0000:44:00.0
> [1273356.426563] nvme nvme7: Device not ready; aborting initialisation,
> CSTS=0x0
> [1273356.433753] nvme nvme7: Removing after probe failure status: -19
> 
> Is this information useful for you?

Yes, this is the info I was after. It shows that the problem does not lie within libvirt. I'd blame kernel but it may be somewhere in between HW and kernel. We should open a kernel bug for this to have somebody who understands kernel look into it. Anyway, in my book this is enough proof that libvirt has unbound the NVMe disk from VFIO => this bug is verified IMO.

Comment 26 gaojianan 2020-10-14 06:16:02 UTC

(In reply to Michal Privoznik from comment #25)
> (In reply to gaojianan from comment #24)
> > (In reply to Michal Privoznik from comment #23)
> > > (In reply to gaojianan from comment #21)
> > > 
> > > > Scenario2: Valid xml but blockcopy operation failed
> > > > 1.# cat nvme1.xml
> > > >   <disk type='nvme' device='disk'>
> > > >     <driver name='qemu' type='raw'/>
> > > >     <source type='pci' managed='yes' namespace='1'>
> > > >       <address domain='0x0000' bus='0x44' slot='0x00' function='0x0'/>
> > > >     </source>
> > > >     <target dev='vde' bus='virtio'/>
> > > >   </disk>
> > > > 
> > > > 2.Do blockcopy
> > > > virsh blockcopy test vda --xml nvme1.xml --wait --verbose --transient-job
> > > > 
> > > > Copy failed
> > > > Here failed because the source image size is not same as target size
> > > > 
> > > > 3.Check the NVME device status
> > > > # lsblk
> > > > NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> > > > sda                              8:0    0 447.1G  0 disk 
> > > > ├─sda1                           8:1    0     1G  0 part /boot
> > > > └─sda2                           8:2    0 446.1G  0 part 
> > > >   ├─rhel_hp--dl385g10--15-root 253:0    0   250G  0 lvm  /
> > > >   └─rhel_hp--dl385g10--15-swap 253:1    0  15.7G  0 lvm  [SWAP]
> > > > The nvme device gone and can't restore
> > > > 
> > > > So i think there is still some problems in scenario2,the nvme device should
> > > > restore when the blockcopy failed.
> > > 
> > > Hard to tell. What driver is the nvme device attached to? I remember host
> > > kernel not initializing NVMe always when returning from a guest. Hence,
> > > lsblk won't display it despite libvirt reseting the device and returning it
> > > back to the host. If the nvme is still attached to vfio can you please share
> > > debug logs and the hosts dmesg?
> > 
> > I don't think the NVMe device is still attached to vfio,because i can't use
> > "./vfio_adm --detach --device 0000:44:00.0 
> > Device is not VFIO bound" to detach it,but i can use it attach first "
> > ./vfio_adm --attach --device 0000:44:00.0 
> 
> I don't know what vfio_adm does, but it reports that the device is not bound
> to VFIO, so libvirt has detached it.
> 
> > Attaching device 0000:44:00.0 to VFIO/UIO" and then detach it by "./vfio_adm
> > --detach --device 0000:44:00.0",the nvme will back when 
> > check by 'lsblk'.
> > Through demsg i can get:
> > [1273279.118469] nvme nvme7: pci function 0000:44:00.0
> > [1273279.233548] nvme nvme7: 16/0/0 default/read/poll queues
> > [1273279.242960]  nvme7n1: p1
> > [1273333.129854] vfio_ecap_init: 0000:44:00.0 hiding ecap 0x19@0x2a0
> > [1273335.823750] nvme nvme7: pci function 0000:44:00.0
> > [1273356.426563] nvme nvme7: Device not ready; aborting initialisation,
> > CSTS=0x0
> > [1273356.433753] nvme nvme7: Removing after probe failure status: -19
> > 
> > Is this information useful for you?
> 
> Yes, this is the info I was after. It shows that the problem does not lie
> within libvirt. I'd blame kernel but it may be somewhere in between HW and
> kernel. We should open a kernel bug for this to have somebody who
> understands kernel look into it. Anyway, in my book this is enough proof
> that libvirt has unbound the NVMe disk from VFIO => this bug is verified IMO.

Verified this bug as comment 21 since the problem is due to kernel.

Comment 28 errata-xmlrpc 2020-11-17 17:48:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137

Comment 29 Peter Krempa 2021-03-23 09:15:58 UTC

(In reply to Stefan Hajnoczi from comment #19)
> (In reply to Peter Krempa from comment #18)
> > That might be hard to achieve when the destination of the copy is a NVMe
> > drive. I'd suggest that qemu doesn't cause a fatal error if the destination
> > is larger than source.
> 
> NVMe Namespaces or traditional partitions/volumes can be used to divide up
> NVMe drives into correctly-sized pieces. That way free blocks can still be
> used for other purposes later. I think this should be the recommended
> approach to migrating/copying to a bigger NVMe drive.
> 
> About your suggestion:
> How will the destination keep track of the correct virtual disk size? There
> isn't enough information to answer this in QEMU. At best a parameter could
> be added to disable the size check and the user needs to ensure that disk
> size is preserved.

Sorry I've missed the needinfo.

I wanted to just point out that it may be perceived as a regression in certain scenarios, but I'm not sure how real those scenarios are. We can always suggest a workaround with a qcow2 image which has size info stored internally.

Note You need to log in before you can comment on or make changes to this bug.