1565467 – [RHHI] Creation of VM's taking a long time with VDO device

Bug 1565467 - [RHHI] Creation of VM's taking a long time with VDO device

Summary: [RHHI] Creation of VM's taking a long time with VDO device

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.3.1
Target Release:	4.3.0
Assignee:	Sahina Bose
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1613425
Blocks:	1565462
TreeView+	depends on / blocked

Reported:	2018-04-10 05:08 UTC by bipin
Modified:	2019-04-28 10:03 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1565462
Environment:
Last Closed:	2019-02-08 01:02:42 UTC
oVirt Team:	Gluster
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
before qemu-img-convert (6.92 KB, text/plain) 2018-06-22 05:45 UTC, Krutika Dhananjay	no flags	Details
after qemu-img-convert (6.92 KB, text/plain) 2018-06-22 05:49 UTC, Krutika Dhananjay	no flags	Details
View All

Description bipin 2018-04-10 05:08:59 UTC

+++ This bug was initially created as a clone of Bug #1565462 +++

Description of problem:

While creation of VM's via RHV Manager the time taken is more. Was creating 20 plus VM's (RHEL7.5) and it stayed in "Image locked" state for about 45 minutes.


Version-Release number of selected component (if applicable):
redhat-release-virtualization-host-4.2-2.1.el7.x86_64
rhvm-4.2.2.6-0.1.el7.noarch
ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.9-1.el7ev.noarch
glusterfs-3.12.2-7.el7rhgs.x86_64

How reproducible:
100%


Steps to Reproduce:
1. Login to the HE
2. Create a RHEL7.5 VM template then create 20 plus VM's parallely
3. The VM's gets freezed for long time in "image locked" state

Actual results:
Takes long time to create VM image


Expected results:
Shouldn't take more time

Additional info:

Cluster info: 3 node

Storage type : Glusterfs

Comment 2 Yaniv Kaul 2018-04-10 05:45:32 UTC

1. Missing logs (engine, VDSM)
2. Why is it cloned? It's either RHHI bug or RHV, no need for both.
3. How long does it take for a single VM? 
4. What's the NIC speed between the nodes?

(missing a lot of details for this BZ!)

Comment 10 Sahina Bose 2018-04-23 12:31:11 UTC

Bipin, are you able to recreate once the gluster network was moved to a 10GiB network?

Comment 13 Yaniv Kaul 2018-04-24 06:19:05 UTC

I assume the disk is a bit large (200G?) - then if so, it's the preallocation that takes that long, isn't it?
In 4.2 I believe we've switched to using fallocate, which is far faster?

Comment 14 Sahina Bose 2018-04-25 09:51:03 UTC

(In reply to Yaniv Kaul from comment #13)
> I assume the disk is a bit large (200G?) - then if so, it's the
> preallocation that takes that long, isn't it?
> In 4.2 I believe we've switched to using fallocate, which is far faster?

This is VM creation from template. I don't think there's pre-allocation here.

Bipin,

To analyse performance issues, you will need to provide gluster volume profile stats and iostats too.

Also, is VDO enabled in the stack?

Comment 16 Sahina Bose 2018-04-27 08:23:20 UTC

Krutika, could you look at the stats provided and see if anything stands out. 
If it's vdo related, we can ask Dennis to look at it

Comment 30 Sandro Bonazzola 2018-06-15 12:14:20 UTC

This bug is not marked as blocker and we entered blocker only phase for 4.2.4.
Please consider re-targeting to 4.2.5.

Comment 31 Krutika Dhananjay 2018-06-20 07:24:10 UTC

Did some experiments with dd as suggested by Dennis (thanks Sahina for helping with the setup!).

Plain distribute single-brick volume
====================================
Command used:
dd if=/dev/urandom of=/mnt/foo bs=4096 count=10 oflag=direct

Before 4K dd:
============
[root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 1
  bios in partial write               : 1043596
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 10
  bios acknowledged partial write     : 1043596
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0

After 4K dd:
===========

[root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 1
  bios in partial write               : 1043596
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 10
  bios acknowledged partial write     : 1043596
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0

But after dd with bs=512 (count=80):
===============================

[root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 1
  bios in partial write               : 1043666
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 10
  bios acknowledged partial write     : 1043666
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0

The number of partial writes didn't change with 4K writes on the gluster volume.

Next to see whether shard's metadata update operations through setxattr were leading to sub-4K writes, I created a plain distribute but sharded volume with 4MB block-size and ran the same tests:

Plain distribute but sharded volume:
====================================
dd if=/dev/urandom of=/mnt/foo bs=4096 count=10240 oflag=direct

Before 4K dd:
============

[root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 1
  bios in partial write               : 1043736
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 10
  bios acknowledged partial write     : 1043736
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0

After 4K dd:
=========== 
[root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 1
  bios in partial write               : 1043736
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 10
  bios acknowledged partial write     : 1043736
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0


And from the brick profile:
      6.02      23.97 us      18.00 us      84.00 us          10240    FXATTROP

There were as many fxattrops (= fgetxattr + modify in-memory + fsetxattr) as the number of writes. But still no partial writes seen.

Interval 0 Stats:
   Block Size:               4096b+ 
 No. of Reads:                    0 
No. of Writes:                20490 

At least so far it doesn't seem like the sub-4K writes are originating from gluster. With direct-IO, gluster stack uses the same block size that it receives from the application.

Will also profile the actual vmstore volume on the client side to see what kind of block size it is receiving writes on and get back.

Comment 32 Yaniv Kaul 2018-06-20 08:27:30 UTC

(In reply to Krutika Dhananjay from comment #31)
> Did some experiments with dd as suggested by Dennis (thanks Sahina for
> helping with the setup!).
> 
> Plain distribute single-brick volume
> ====================================
> Command used:
> dd if=/dev/urandom of=/mnt/foo bs=4096 count=10 oflag=direct

That's not a valid test, for 2 reasons:
1. VDO is completely ineffective here - it cannot compress nor dedup this data, coming from urandom.
2. The use of 4K is also artificial. Some writes are smaller than that, many are bigger. I highly suggest testing with some more realistic workload, or at least multiple block sizes.

Example - https://blog.purestorage.com/an-analysis-of-io-size-modalities-on-pure-storage-flasharrays/

Comment 33 Sahina Bose 2018-06-20 08:49:53 UTC

(In reply to Yaniv Kaul from comment #32)
> (In reply to Krutika Dhananjay from comment #31)
> > Did some experiments with dd as suggested by Dennis (thanks Sahina for
> > helping with the setup!).
> > 
> > Plain distribute single-brick volume
> > ====================================
> > Command used:
> > dd if=/dev/urandom of=/mnt/foo bs=4096 count=10 oflag=direct
> 
> That's not a valid test, for 2 reasons:
> 1. VDO is completely ineffective here - it cannot compress nor dedup this
> data, coming from urandom.
> 2. The use of 4K is also artificial. Some writes are smaller than that, many
> are bigger. I highly suggest testing with some more realistic workload, or
> at least multiple block sizes.
> 
> Example -
> https://blog.purestorage.com/an-analysis-of-io-size-modalities-on-pure-
> storage-flasharrays/

Yaniv, this was merely a test to check where the sub 4K writes were originating from in the clone operation. We wanted to see if any 4K write to gluster volume is translated to sub 4K either due to metadata ops that gluster does or alignment issues.

Comment 34 Dennis Keefe 2018-06-20 14:35:39 UTC

Thank you for testing this.
Would it be possible to  use the same setup to clone a VM and provide the VOD stats?

This is what I found in the PS output when cloning a VM.  Could the same command be used?

vdsm     23460  1.1  0.0 392240 32776 ?        S<l  00:03   0:00 /usr/bin/qemu-img convert -p -t none -T none -f qcow2 /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/bf2fa147-9f92-4a0d-9d15-30d94c3c5e36/c6ae0472-1147-4eb7-8c7f-32efb8523ea2 -O qcow2 -o compat=1.1 /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/8d88d52f-3091-402a-80aa-37a927ea26fc/57f85eef-1051-4a58-9e7a-350b7ccbfb4d

qemu-img convert -p -t none -T none -f qcow2 < source image> -O qcow2 -o compat=1.1 <destination image>

-p = progress
-t = cache
-T = source cache
-f = format
-O = output file
-o = option (in this case "compat=1.1"
compat=1.1 - use the qcow2 v 1.1 format


Also would using cache in the cloning process help with performance and avoiding sub 4k writes?

-t cache
           specifies the cache mode that should be used with the (destination) file. See the documentation of the emulator's "-drive
           cache=..." option for allowed values.

-T src_cache
           specifies the cache mode that should be used with the source file(s). See the documentation of the emulator's "-drive
           cache=..." option for allowed values.

Comment 35 Yaniv Kaul 2018-06-20 14:40:19 UTC

(In reply to Dennis Keefe from comment #34)
> Thank you for testing this.
> Would it be possible to  use the same setup to clone a VM and provide the
> VOD stats?
> 
> This is what I found in the PS output when cloning a VM.  Could the same
> command be used?
> 
> vdsm     23460  1.1  0.0 392240 32776 ?        S<l  00:03   0:00
> /usr/bin/qemu-img convert -p -t none -T none -f qcow2
> /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:
> _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/bf2fa147-9f92-4a0d-9d15-
> 30d94c3c5e36/c6ae0472-1147-4eb7-8c7f-32efb8523ea2 -O qcow2 -o compat=1.1
> /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:
> _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/8d88d52f-3091-402a-80aa-
> 37a927ea26fc/57f85eef-1051-4a58-9e7a-350b7ccbfb4d
> 

Sounds like the command we use indeed.

> qemu-img convert -p -t none -T none -f qcow2 < source image> -O qcow2 -o
> compat=1.1 <destination image>
> 
> -p = progress
> -t = cache
> -T = source cache
> -f = format
> -O = output file
> -o = option (in this case "compat=1.1"
> compat=1.1 - use the qcow2 v 1.1 format
> 
> 
> Also would using cache in the cloning process help with performance and
> avoiding sub 4k writes?

No, for two-three reasons:
1. It'll occupy memory from the host, which might be used for VMs.
2. At the end of the process, you might expect *another* host to use that image. So we really wish to use direct IO.
3. We really want to know the command completed end to end (we could have used caching + sync at the end? not sure it's worth the effort).

> 
> -t cache
>            specifies the cache mode that should be used with the
> (destination) file. See the documentation of the emulator's "-drive
>            cache=..." option for allowed values.
> 
> -T src_cache
>            specifies the cache mode that should be used with the source
> file(s). See the documentation of the emulator's "-drive
>            cache=..." option for allowed values.

Source cache is somewhat interesting, if you imagine that you'll be reading that image again. I doubt we can anticipate this.

Comment 36 Dennis Keefe 2018-06-20 15:05:01 UTC

How is direct in forced?  Is qemu-img doing direct IOs?  I believe that qemu-img does IO in 64K chunks.  Or is it doing direct IO at the sector level (sub 4K)?  Is there a way to increase the size to say 256K?

What about using -W
man qemu-img -> convert
"Out of order writes can be enabled with "-W" to improve performance.  This is only recommended for preallocated devices like host devices or other raw block devices. Out of order write does not work in combination with creating compressed images."

This comment leads me to believe that qemu-img is doing direct IO, which would be the only way to enforce ordering.  The block layer does not guarantee ordering of
writes.

Comment 37 Yaniv Kaul 2018-06-20 15:20:31 UTC

(In reply to Dennis Keefe from comment #36)
> How is direct in forced?  Is qemu-img doing direct IOs?  I believe that
> qemu-img does IO in 64K chunks.  Or is it doing direct IO at the sector
> level (sub 4K)?  Is there a way to increase the size to say 256K?

I believe the default cache is none.
I believe disks in the VM always have 512b sectors, and are read sector by sector. Kevin?
(Somewhere around https://github.com/qemu/qemu/blob/master/qemu-img.c#L1880 ?)

> 
> What about using -W
> man qemu-img -> convert
> "Out of order writes can be enabled with "-W" to improve performance.  This
> is only recommended for preallocated devices like host devices or other raw
> block devices. Out of order write does not work in combination with creating
> compressed images."

Yes, there was a discussion about it some time ago, and it wasn't clear this actually helped in any way.

> 
> This comment leads me to believe that qemu-img is doing direct IO, which
> would be the only way to enforce ordering.  The block layer does not
> guarantee ordering of
> writes.

Comment 38 Krutika Dhananjay 2018-06-21 06:07:05 UTC

(In reply to Dennis Keefe from comment #34)
> Thank you for testing this.
> Would it be possible to  use the same setup to clone a VM and provide the
> VOD stats?
> 
> This is what I found in the PS output when cloning a VM.  Could the same
> command be used?
> 
> vdsm     23460  1.1  0.0 392240 32776 ?        S<l  00:03   0:00
> /usr/bin/qemu-img convert -p -t none -T none -f qcow2
> /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:
> _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/bf2fa147-9f92-4a0d-9d15-
> 30d94c3c5e36/c6ae0472-1147-4eb7-8c7f-32efb8523ea2 -O qcow2 -o compat=1.1
> /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:
> _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/8d88d52f-3091-402a-80aa-
> 37a927ea26fc/57f85eef-1051-4a58-9e7a-350b7ccbfb4d
> 
> qemu-img convert -p -t none -T none -f qcow2 < source image> -O qcow2 -o
> compat=1.1 <destination image>
> 
> -p = progress
> -t = cache
> -T = source cache
> -f = format
> -O = output file
> -o = option (in this case "compat=1.1"
> compat=1.1 - use the qcow2 v 1.1 format
> 
> 
> Also would using cache in the cloning process help with performance and
> avoiding sub 4k writes?
> 
> -t cache
>            specifies the cache mode that should be used with the
> (destination) file. See the documentation of the emulator's "-drive
>            cache=..." option for allowed values.
> 
> -T src_cache
>            specifies the cache mode that should be used with the source
> file(s). See the documentation of the emulator's "-drive
>            cache=..." option for allowed values.

Done.

Here are the numbers on a plain distribute gluster volume with src and dst files residing on the same volume:

BEFORE QEMU-IMG-CONVERT:
=======================

[root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 30
  bios in partial write               : 1062568
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 2818445
  bios acknowledged partial write     : 1062568
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0

AFTER QEMU-IMG-CONVERT:
======================
[root@rhsdev-grafton3 tmp]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial
  bios in partial read                : 47
  bios in partial write               : 1081368
  bios in partial discard             : 0
  bios in partial flush               : 0
  bios in partial fua                 : 0
  bios acknowledged partial read      : 5606063
  bios acknowledged partial write     : 1081368
  bios acknowledged partial discard   : 0
  bios acknowledged partial flush     : 0
  bios acknowledged partial fua       : 0

I did also capture strace output for the qemu-img command during this run just to get the size of the writes sent by qemu-img command. Let me know if you need it, and I'll attach the same.

-Krutika

Comment 39 Dennis Keefe 2018-06-22 00:16:56 UTC

Krutika,

Would it be possible for you to post or send me the full output of vdostats --verbose for both before and after?  It is interesting that there is a large increase in bios acknowledged partial reads, but no bio in partial reads.

Comment 40 Krutika Dhananjay 2018-06-22 05:45:10 UTC

Created attachment 1453652 [details]
before qemu-img-convert

Comment 41 Krutika Dhananjay 2018-06-22 05:49:51 UTC

Created attachment 1453653 [details]
after qemu-img-convert

Comment 42 Kevin Wolf 2018-06-22 14:14:22 UTC

(In reply to Yaniv Kaul from comment #37)
> I believe the default cache is none.
> I believe disks in the VM always have 512b sectors, and are read sector by
> sector. Kevin?
> (Somewhere around https://github.com/qemu/qemu/blob/master/qemu-img.c#L1880
> ?)

The default cache mode is 'writeback', but that doesn't seem to matter here because you override it for both source and target anyway.

The sector size of guest devices is completely unrelated, but it defaults to 512 bytes and is configurable in -device phyical_block_size=...,logical_block_size=...

The buffer size that 'qemu-img convert' uses is 2 MB. Requests can be smaller than that for sparse output images, obviously. A possibly relevant option for this is -S, which controls the minimum size of a zeroed area that would leave the target image sparse instead of writing explicit zeros; it defaults to 4k.

Comment 43 Sahina Bose 2018-07-25 04:56:25 UTC

The time taken to create VMs is under a minutes (5-15 seconds) when tried on a gluster volume with vdo on SSDs. Same timing was noticed even on the QE setup with an arbiter gluster volume with vdo on HDD

The issue seems to be seen only with glustervolume on vdo with HDD and an SSD lvmcache.

Comment 44 SATHEESARAN 2018-07-25 09:28:40 UTC

(In reply to Sahina Bose from comment #43)
> The time taken to create VMs is under a minutes (5-15 seconds) when tried on
> a gluster volume with vdo on SSDs. Same timing was noticed even on the QE
> setup with an arbiter gluster volume with vdo on HDD
> 
> The issue seems to be seen only with glustervolume on vdo with HDD and an
> SSD lvmcache.

The issue is seen more pronounced with replica 3 + VDO + lvmcache combination.

Comment 45 Guillaume Pavese 2018-08-24 15:13:15 UTC

I see a similar problem here
oVirt 4.2.6 rc2 / replica 2 + 1 arbiter and VDO (no lvmcache)

Creating a vm is fast but cloning a vm has the two disks locked for ~45min

Comment 46 Sahina Bose 2018-10-09 11:09:18 UTC

Bipin, can you retest based on the changes for Bug 1616270?

Comment 47 SATHEESARAN 2019-01-23 04:20:22 UTC

(In reply to Sahina Bose from comment #46)
> Bipin, can you retest based on the changes for Bug 1616270?

Yes, we can retest and provide the results on the same

Comment 48 Sandro Bonazzola 2019-01-28 09:41:27 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 50 bipin 2019-01-28 11:13:34 UTC

(In reply to Sahina Bose from comment #46)
> Bipin, can you retest based on the changes for Bug 1616270?

Will test in the latest bits and update the bug soon. So will not be clearing the needinfo until then.

Comment 51 bipin 2019-02-05 11:14:13 UTC

(In reply to Sahina Bose from comment #46)
> Bipin, can you retest based on the changes for Bug 1616270?

Tested with the below versions:
------------------------------
glusterfs-3.12.2-40.el7rhgs.x86_64
rhv-4.2.8-5

Attempt 1: Created a VDO based template, created 12 vm's and the time overall was 12 min
Attempt 2: Again created 12 more vm's and the total time taken was 15 min.

PS: This was created on the same hypervisor with HDD disk's.

Comment 52 Yaniv Kaul 2019-02-08 01:02:42 UTC

(In reply to bipin from comment #51)
> (In reply to Sahina Bose from comment #46)
> > Bipin, can you retest based on the changes for Bug 1616270?
> 
> Tested with the below versions:
> ------------------------------
> glusterfs-3.12.2-40.el7rhgs.x86_64
> rhv-4.2.8-5
> 
> Attempt 1: Created a VDO based template, created 12 vm's and the time
> overall was 12 min
> Attempt 2: Again created 12 more vm's and the total time taken was 15 min.

Sounds to me like much better performance than the initial report. Closing based on the tests.
Thanks!

> 
> PS: This was created on the same hypervisor with HDD disk's.

Note You need to log in before you can comment on or make changes to this bug.