To improve performance, we should use the fastest method for zeroing a range on the storage. This RFE is for the performance improvement by using fast-zero.
Since first version of imageio, we had no support efficient zeroing. Uploading a sparse file used to convert holes to actual zeroes sent over wire, and writen to storage. In 1.3.0, we added zero API (See PATCH/zero): http://ovirt.github.io/ovirt-imageio/random-io.html#patch This avoids sending zeros on the wire, but was implemented by writing actual zeros to storage. While pretty fast (we can write 720 MiB/s with fast FC storage), this is much slower compared with proper apis like fallocate() and ioctl(BLKZEROOUT). It also creates unnecessary I/O and consume huge amount of network bandwidth when using iSCSI storage. In 1.4.3, we re-implemented zero apis using the proper apis, using the fastest method for the underlying storage. For file based storage, we use: 1. fallocate(FALLOC_FL_ZERO_RANGE) 2. if not supported (even NFS 4.2 does not support this yet), we fall back to combining punching hole and fallocate: fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) fallocate(0) Using 2 syscalls is slower but much master compared with manual zeroing. 3. If we are writing after the end of the file, just fallocate(0) 4. If everything failed, we fall back to manually writing zeros. For block based storage we use: 1. fallocate(FALLOC_FL_ZERO_RANGE) - This is supported for block storage since kenerl 4.9, but not supported yet on RHEL 7. 2. ioctl(BLKZEROOUT) - This is well supported, the same method used in vdsm to wipe disks since 4.2 (via blkdiscard command). There is no fallback for BLKZEROOUT, since the kernel already implements fallback to manual zeroing if the storage does support efficient zeroing.
Here are some performance results with fast-zero patches. Tested on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz server with 40 cores, connected to XtremIO storage via 4G FC HBAs, and 4 paths to storage. 1. Created fedora 27 image # virt-builder fedora-27 -o fedora-27.img ... Output file: fedora-27.img Output size: 6.0G Output format: raw Total usable space: 5.3G Free space: 4.4G (81%) 2. Copy image using dd (for reference) # time dd if=fedora-27.img \ of=/dev/vgname/lvname \ bs=8M \ oflag=direct \ conv=fsync 6442450944 bytes (6.4 GB) copied, 11.2388 s, 573 MB/s real 0m11.243s user 0m0.003s sys 0m3.469s 3. Upload using examples/upload script # time examples/upload fedora-27.img https://server:54322/images/test real 0m3.294s user 0m0.269s sys 0m0.424s 4. Copy image using qemu-img # time qemu-img convert -f raw -O raw -t none -T none fedora-27.img \ /dev/vgname/lvname real 0m13.528s user 0m0.608s sys 0m1.525s 5. 4 concurrent uploads # for n in $(seq 4); do (time ./upload fedora-27.img https://server:54322/images/fedora-27-0$n &) done real 0m6.795s user 0m0.301s sys 0m0.505s real 0m6.800s user 0m0.289s sys 0m0.510s real 0m6.823s user 0m0.326s sys 0m0.463s real 0m6.831s user 0m0.340s sys 0m0.452s
Here are performance results with concurrent virt-v2v import. Tested on the same server and storage as in comment 2. ## image info The image was created by installing a fedora 28 server on 100g block based preallocated disk. To populate the image with data, I pulled current kernel tree using git, and built a kernel using "make olddefconfig". This generated 19G of data in the image. Then I duplicated most the linux directory to get 33G used image. Finally, I shutdown the vm and copied the disk to NFS storage using qemu-img convert. The NFS server is another server with same spec, exporting a LUN from XtremIO formatted using xfs over single 10G nic. The NFS server is mounted using NFS 4.2. # ls -lhs /var/tmp/nfs/fedora-28-33g.img 33G -rw-r--r--. 1 root root 100G Aug 12 01:19 /var/tmp/nfs/fedora-28-33g.img total segments: 21180 data segments: 10590 min data: 4096 max data: 1503703040 avg data: 3254189 zero segments: 10590 min zero: 4096 max zero: 12722929664 avg zero: 6885015 ## virt-v2v command I run this script in parallel: # cat v2v-33g-nfs.sh virt-v2v \ -i disk /var/tmp/nfs/fedora-28-19g.img \ -o rhv-upload \ -oc https://engine/ovirt-engine/api \ -os nsoffer-fc1 \ -on v2v-33g-nfs-$1 \ -op /var/tmp/password \ -of raw \ -oa preallocated \ -oo rhv-cafile=ca.pem \ -oo rhv-cluster=nsoffer-fc-el7 \ -oo rhv-direct=true I run this command in parallel like this: # for n in $(seq 10); do (sh v2v-33g-nfs.sh $n >v2v-33g-10-from-nfs/1/$n.log 2>&1 &) done I tried also to upload from FC and local file with similar results. Looking in iostat on the server, there is almost no reads on the nfs LUN, so I guess most of the data is cached on the client side (this server has 500G RAM). ## import stats time: 1323 seonds rate (total): 773 MiB/s rate (data): 255 MiB/s I based the calculation on the slowest import: [ 66.9] Assigning disks to buses [ 66.9] Copying disk 1/1 to qemu URI json:{ "file.driver": "nbd", "file.path": "/var/tmp/rhvupload.XQl9Wg/nbdkit0.sock", "file.export": "/" } (raw) (100.00/100%) [1303.4] Creating output metadata [1323.0] Finishing off ## Upload stats The longest part of the import is the actual transfer: time: 1216 seconds rate (total): 842 MiB/s rate (data): 277 MiB/s requests: 663486 requests/s: 545 req/s avg request time: 1.8 milliseconds Based on imageio daemon logs - from first OPTIONS request to last FLUSH request: # grep OPTIONS daemon.log | head -1 2018-08-12 01:27:32,959 INFO (Thread-1931) [images] [10.35.68.25] OPTIONS ticket=8f30a6b0-acea-4e3c-b030-50dba49c1a14 # grep FLUSH daemon.log | tail -1 2018-08-12 01:47:48,416 INFO (Thread-1977) [images] [local] FLUSH ticket=b8649b05-2cdd-4710-9d49-c7845d36bb3a # wc -l daemon.log 663486 daemon.log
We're releasing today 4.2.6 RC2 including v1.4.3 which is referencing this bug. can you please check this bug status?
(In reply to Sandro Bonazzola from comment #4) Bug should be fixed in 1.4.3, but not tested by QE yet.
We have a downstream build, moving to ON_QA
From load run on 19.8 with ovirt-imageio-daemon-1.4.3 version with V2V migration of 10 VMS 100GB to FC times where greatly improved following the upgrade with the zero code. Case 8 (disk 33% full) reduced from 50 minutes to 27, and case 8a (disk 66% full) from 75 minutes to 42 minutes.