1615144 – [v2v] fast-zero - to improve performance

Bug 1615144 - [v2v] fast-zero - to improve performance

Summary: [v2v] fast-zero - to improve performance

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-imageio
Classification:	oVirt
Component:	Common
Sub Component:
Version:	1.4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.6
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	guy chen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1612841
TreeView+	depends on / blocked

Reported:	2018-08-12 16:46 UTC by Daniel Erez
Modified:	2018-09-03 15:10 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ovirt-imageio-{common,daemon}-1.4.3
Clone Of:
Environment:
Last Closed:	2018-09-03 15:10:00 UTC
oVirt Team:	Scale
Embargoed:
Flags:	rule-engine: ovirt-4.2? rule-engine: planning_ack? rule-engine: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	85512	'None'	MERGED	ioutil: Add fallocate()	2020-08-25 05:01:49 UTC
oVirt gerrit	85537	'None'	MERGED	ioutil: Add ioutil extension module	2020-08-25 05:01:48 UTC
oVirt gerrit	92719	'None'	MERGED	directio: Implement fast zero for block storage	2020-08-25 05:01:48 UTC
oVirt gerrit	92870	'None'	MERGED	directio: Add GenericIO backend	2020-08-25 05:01:48 UTC
oVirt gerrit	92871	'None'	MERGED	directio: Implement fast zero for preallocated files	2020-08-25 05:01:48 UTC
oVirt gerrit	92872	'None'	MERGED	tests: Add fallocate test script	2020-08-25 05:01:48 UTC
oVirt gerrit	92873	'None'	MERGED	directio: Use fallocate() for block devices	2020-08-25 05:01:48 UTC

Description Daniel Erez 2018-08-12 16:46:40 UTC

To improve performance, we should use the fastest method for zeroing a range on the storage. This RFE is for the performance improvement by using fast-zero.

Comment 1 Nir Soffer 2018-08-12 18:25:49 UTC

Since first version of imageio, we had no support efficient zeroing. Uploading 
a sparse file used to convert holes to actual zeroes sent over wire, and writen
to storage.

In 1.3.0, we added zero API (See PATCH/zero):
http://ovirt.github.io/ovirt-imageio/random-io.html#patch
This avoids sending zeros on the wire, but was implemented by writing actual zeros
to storage.

While pretty fast (we can write 720 MiB/s with fast FC storage), this is much
slower compared with proper apis like fallocate() and ioctl(BLKZEROOUT). It also
creates unnecessary I/O and consume huge amount of network bandwidth when using
iSCSI storage.

In 1.4.3, we re-implemented zero apis using the proper apis, using the fastest
method for the underlying storage.

For file based storage, we use:

1. fallocate(FALLOC_FL_ZERO_RANGE)

2. if not supported (even NFS 4.2 does not support this yet), we fall back to 
   combining punching hole and fallocate:

   fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)
   fallocate(0)

   Using 2 syscalls is slower but much master compared with manual zeroing.

3. If we are writing after the end of the file, just fallocate(0)

4. If everything failed, we fall back to manually writing zeros.

For block based storage we use:

1. fallocate(FALLOC_FL_ZERO_RANGE) - This is supported for block storage since
   kenerl 4.9, but not supported yet on RHEL 7.

2. ioctl(BLKZEROOUT) - This is well supported, the same method used in vdsm to
   wipe disks since 4.2 (via blkdiscard command).

There is no fallback for BLKZEROOUT, since the kernel already implements fallback
to manual zeroing if the storage does support efficient zeroing.

Comment 2 Nir Soffer 2018-08-12 18:34:22 UTC

Here are some performance results with fast-zero patches.

Tested on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz server with 40 cores, connected
to XtremIO storage via 4G FC HBAs, and 4 paths to storage.

1. Created fedora 27 image

# virt-builder fedora-27 -o fedora-27.img
...
                   Output file: fedora-27.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 5.3G
                    Free space: 4.4G (81%)

2. Copy image using dd (for reference)

# time dd if=fedora-27.img \
    of=/dev/vgname/lvname \
    bs=8M \
    oflag=direct \
    conv=fsync

6442450944 bytes (6.4 GB) copied, 11.2388 s, 573 MB/s

real	0m11.243s
user	0m0.003s
sys	0m3.469s

3. Upload using examples/upload script

# time examples/upload fedora-27.img https://server:54322/images/test

real	0m3.294s
user	0m0.269s
sys	0m0.424s

4. Copy image using qemu-img

# time qemu-img convert -f raw -O raw -t none -T none fedora-27.img \
    /dev/vgname/lvname

real	0m13.528s
user	0m0.608s
sys	0m1.525s

5. 4 concurrent uploads

# for n in $(seq 4); do
      (time ./upload fedora-27.img https://server:54322/images/fedora-27-0$n &)
  done

real	0m6.795s
user	0m0.301s
sys	0m0.505s

real	0m6.800s
user	0m0.289s
sys	0m0.510s

real	0m6.823s
user	0m0.326s
sys	0m0.463s

real	0m6.831s
user	0m0.340s
sys	0m0.452s

Comment 3 Nir Soffer 2018-08-12 19:01:08 UTC

Here are performance results with concurrent virt-v2v import. Tested on the same
server and storage as in comment 2.

## image info

The image was created by installing a fedora 28 server on 100g block based
preallocated disk.

To populate the image with data, I pulled current kernel tree using git, and 
built a kernel using "make olddefconfig". This generated 19G of data in the image.
Then I duplicated most the linux directory to get 33G used image.

Finally, I shutdown the vm and copied the disk to NFS storage using qemu-img
convert.

The NFS server is another server with same spec, exporting a LUN from XtremIO
formatted using xfs over single 10G nic. The NFS server is mounted using NFS 4.2.

# ls -lhs /var/tmp/nfs/fedora-28-33g.img 
33G -rw-r--r--. 1 root root 100G Aug 12 01:19 /var/tmp/nfs/fedora-28-33g.img

total segments: 21180

data segments:  10590
min data:       4096
max data:       1503703040
avg data:       3254189

zero segments: 10590
min zero:       4096
max zero:       12722929664
avg zero:       6885015

## virt-v2v command

I run this script in parallel:

# cat v2v-33g-nfs.sh 
virt-v2v \
    -i disk /var/tmp/nfs/fedora-28-19g.img \
    -o rhv-upload \
    -oc https://engine/ovirt-engine/api \
    -os nsoffer-fc1 \
    -on v2v-33g-nfs-$1 \
    -op /var/tmp/password \
    -of raw \
    -oa preallocated \
    -oo rhv-cafile=ca.pem \
    -oo rhv-cluster=nsoffer-fc-el7 \
    -oo rhv-direct=true

I run this command in parallel like this:

# for n in $(seq 10); do
    (sh v2v-33g-nfs.sh $n >v2v-33g-10-from-nfs/1/$n.log 2>&1 &)
done

I tried also to upload from FC and local file with similar results. Looking in
iostat on the server, there is almost no reads on the nfs LUN, so I guess most
of the data is cached on the client side (this server has 500G RAM).

## import stats

time: 1323 seonds
rate (total): 773 MiB/s
rate (data): 255 MiB/s

I based the calculation on the slowest import:

[  66.9] Assigning disks to buses
[  66.9] Copying disk 1/1 to qemu URI json:{ "file.driver": "nbd", "file.path": "/var/tmp/rhvupload.XQl9Wg/nbdkit0.sock", "file.export": "/" } (raw)
    (100.00/100%)
[1303.4] Creating output metadata
[1323.0] Finishing off

## Upload stats

The longest part of the import is the actual transfer:

time: 1216 seconds
rate (total): 842 MiB/s
rate (data): 277 MiB/s
requests: 663486
requests/s: 545 req/s
avg request time: 1.8 milliseconds

Based on imageio daemon logs - from first OPTIONS request to last FLUSH request:

# grep OPTIONS daemon.log | head -1
2018-08-12 01:27:32,959 INFO    (Thread-1931) [images] [10.35.68.25] OPTIONS ticket=8f30a6b0-acea-4e3c-b030-50dba49c1a14

# grep FLUSH daemon.log | tail -1
2018-08-12 01:47:48,416 INFO    (Thread-1977) [images] [local] FLUSH ticket=b8649b05-2cdd-4710-9d49-c7845d36bb3a

# wc -l daemon.log 
663486 daemon.log

Comment 4 Sandro Bonazzola 2018-08-14 19:14:10 UTC

We're releasing today 4.2.6 RC2 including v1.4.3 which is referencing this bug. can you please check this bug status?

Comment 5 Nir Soffer 2018-08-14 19:36:03 UTC

(In reply to Sandro Bonazzola from comment #4)
Bug should be fixed in 1.4.3, but not tested by QE yet.

Comment 6 Nir Soffer 2018-08-14 23:49:25 UTC

We have a downstream build, moving to ON_QA

Comment 7 guy chen 2018-08-28 07:27:04 UTC

From load run on 19.8 with ovirt-imageio-daemon-1.4.3 version with V2V migration of 10 VMS 100GB to FC times where greatly improved following the upgrade with the zero code.
Case 8 (disk 33% full) reduced from 50 minutes to 27, and case 8a  (disk 66% full) from 75 minutes to 42 minutes.

Note You need to log in before you can comment on or make changes to this bug.