+++ This bug was initially created as a clone of Bug #1565462 +++ Description of problem: While creation of VM's via RHV Manager the time taken is more. Was creating 20 plus VM's (RHEL7.5) and it stayed in "Image locked" state for about 45 minutes. Version-Release number of selected component (if applicable): redhat-release-virtualization-host-4.2-2.1.el7.x86_64 rhvm-4.2.2.6-0.1.el7.noarch ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.9-1.el7ev.noarch glusterfs-3.12.2-7.el7rhgs.x86_64 How reproducible: 100% Steps to Reproduce: 1. Login to the HE 2. Create a RHEL7.5 VM template then create 20 plus VM's parallely 3. The VM's gets freezed for long time in "image locked" state Actual results: Takes long time to create VM image Expected results: Shouldn't take more time Additional info: Cluster info: 3 node Storage type : Glusterfs
1. Missing logs (engine, VDSM) 2. Why is it cloned? It's either RHHI bug or RHV, no need for both. 3. How long does it take for a single VM? 4. What's the NIC speed between the nodes? (missing a lot of details for this BZ!)
Bipin, are you able to recreate once the gluster network was moved to a 10GiB network?
I assume the disk is a bit large (200G?) - then if so, it's the preallocation that takes that long, isn't it? In 4.2 I believe we've switched to using fallocate, which is far faster?
(In reply to Yaniv Kaul from comment #13) > I assume the disk is a bit large (200G?) - then if so, it's the > preallocation that takes that long, isn't it? > In 4.2 I believe we've switched to using fallocate, which is far faster? This is VM creation from template. I don't think there's pre-allocation here. Bipin, To analyse performance issues, you will need to provide gluster volume profile stats and iostats too. Also, is VDO enabled in the stack?
Krutika, could you look at the stats provided and see if anything stands out. If it's vdo related, we can ask Dennis to look at it
This bug is not marked as blocker and we entered blocker only phase for 4.2.4. Please consider re-targeting to 4.2.5.
Did some experiments with dd as suggested by Dennis (thanks Sahina for helping with the setup!). Plain distribute single-brick volume ==================================== Command used: dd if=/dev/urandom of=/mnt/foo bs=4096 count=10 oflag=direct Before 4K dd: ============ [root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 1 bios in partial write : 1043596 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 10 bios acknowledged partial write : 1043596 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 After 4K dd: =========== [root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 1 bios in partial write : 1043596 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 10 bios acknowledged partial write : 1043596 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 But after dd with bs=512 (count=80): =============================== [root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 1 bios in partial write : 1043666 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 10 bios acknowledged partial write : 1043666 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 The number of partial writes didn't change with 4K writes on the gluster volume. Next to see whether shard's metadata update operations through setxattr were leading to sub-4K writes, I created a plain distribute but sharded volume with 4MB block-size and ran the same tests: Plain distribute but sharded volume: ==================================== dd if=/dev/urandom of=/mnt/foo bs=4096 count=10240 oflag=direct Before 4K dd: ============ [root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 1 bios in partial write : 1043736 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 10 bios acknowledged partial write : 1043736 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 After 4K dd: =========== [root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 1 bios in partial write : 1043736 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 10 bios acknowledged partial write : 1043736 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 And from the brick profile: 6.02 23.97 us 18.00 us 84.00 us 10240 FXATTROP There were as many fxattrops (= fgetxattr + modify in-memory + fsetxattr) as the number of writes. But still no partial writes seen. Interval 0 Stats: Block Size: 4096b+ No. of Reads: 0 No. of Writes: 20490 At least so far it doesn't seem like the sub-4K writes are originating from gluster. With direct-IO, gluster stack uses the same block size that it receives from the application. Will also profile the actual vmstore volume on the client side to see what kind of block size it is receiving writes on and get back.
(In reply to Krutika Dhananjay from comment #31) > Did some experiments with dd as suggested by Dennis (thanks Sahina for > helping with the setup!). > > Plain distribute single-brick volume > ==================================== > Command used: > dd if=/dev/urandom of=/mnt/foo bs=4096 count=10 oflag=direct That's not a valid test, for 2 reasons: 1. VDO is completely ineffective here - it cannot compress nor dedup this data, coming from urandom. 2. The use of 4K is also artificial. Some writes are smaller than that, many are bigger. I highly suggest testing with some more realistic workload, or at least multiple block sizes. Example - https://blog.purestorage.com/an-analysis-of-io-size-modalities-on-pure-storage-flasharrays/
(In reply to Yaniv Kaul from comment #32) > (In reply to Krutika Dhananjay from comment #31) > > Did some experiments with dd as suggested by Dennis (thanks Sahina for > > helping with the setup!). > > > > Plain distribute single-brick volume > > ==================================== > > Command used: > > dd if=/dev/urandom of=/mnt/foo bs=4096 count=10 oflag=direct > > That's not a valid test, for 2 reasons: > 1. VDO is completely ineffective here - it cannot compress nor dedup this > data, coming from urandom. > 2. The use of 4K is also artificial. Some writes are smaller than that, many > are bigger. I highly suggest testing with some more realistic workload, or > at least multiple block sizes. > > Example - > https://blog.purestorage.com/an-analysis-of-io-size-modalities-on-pure- > storage-flasharrays/ Yaniv, this was merely a test to check where the sub 4K writes were originating from in the clone operation. We wanted to see if any 4K write to gluster volume is translated to sub 4K either due to metadata ops that gluster does or alignment issues.
Thank you for testing this. Would it be possible to use the same setup to clone a VM and provide the VOD stats? This is what I found in the PS output when cloning a VM. Could the same command be used? vdsm 23460 1.1 0.0 392240 32776 ? S<l 00:03 0:00 /usr/bin/qemu-img convert -p -t none -T none -f qcow2 /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/bf2fa147-9f92-4a0d-9d15-30d94c3c5e36/c6ae0472-1147-4eb7-8c7f-32efb8523ea2 -O qcow2 -o compat=1.1 /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/8d88d52f-3091-402a-80aa-37a927ea26fc/57f85eef-1051-4a58-9e7a-350b7ccbfb4d qemu-img convert -p -t none -T none -f qcow2 < source image> -O qcow2 -o compat=1.1 <destination image> -p = progress -t = cache -T = source cache -f = format -O = output file -o = option (in this case "compat=1.1" compat=1.1 - use the qcow2 v 1.1 format Also would using cache in the cloning process help with performance and avoiding sub 4k writes? -t cache specifies the cache mode that should be used with the (destination) file. See the documentation of the emulator's "-drive cache=..." option for allowed values. -T src_cache specifies the cache mode that should be used with the source file(s). See the documentation of the emulator's "-drive cache=..." option for allowed values.
(In reply to Dennis Keefe from comment #34) > Thank you for testing this. > Would it be possible to use the same setup to clone a VM and provide the > VOD stats? > > This is what I found in the PS output when cloning a VM. Could the same > command be used? > > vdsm 23460 1.1 0.0 392240 32776 ? S<l 00:03 0:00 > /usr/bin/qemu-img convert -p -t none -T none -f qcow2 > /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com: > _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/bf2fa147-9f92-4a0d-9d15- > 30d94c3c5e36/c6ae0472-1147-4eb7-8c7f-32efb8523ea2 -O qcow2 -o compat=1.1 > /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com: > _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/8d88d52f-3091-402a-80aa- > 37a927ea26fc/57f85eef-1051-4a58-9e7a-350b7ccbfb4d > Sounds like the command we use indeed. > qemu-img convert -p -t none -T none -f qcow2 < source image> -O qcow2 -o > compat=1.1 <destination image> > > -p = progress > -t = cache > -T = source cache > -f = format > -O = output file > -o = option (in this case "compat=1.1" > compat=1.1 - use the qcow2 v 1.1 format > > > Also would using cache in the cloning process help with performance and > avoiding sub 4k writes? No, for two-three reasons: 1. It'll occupy memory from the host, which might be used for VMs. 2. At the end of the process, you might expect *another* host to use that image. So we really wish to use direct IO. 3. We really want to know the command completed end to end (we could have used caching + sync at the end? not sure it's worth the effort). > > -t cache > specifies the cache mode that should be used with the > (destination) file. See the documentation of the emulator's "-drive > cache=..." option for allowed values. > > -T src_cache > specifies the cache mode that should be used with the source > file(s). See the documentation of the emulator's "-drive > cache=..." option for allowed values. Source cache is somewhat interesting, if you imagine that you'll be reading that image again. I doubt we can anticipate this.
How is direct in forced? Is qemu-img doing direct IOs? I believe that qemu-img does IO in 64K chunks. Or is it doing direct IO at the sector level (sub 4K)? Is there a way to increase the size to say 256K? What about using -W man qemu-img -> convert "Out of order writes can be enabled with "-W" to improve performance. This is only recommended for preallocated devices like host devices or other raw block devices. Out of order write does not work in combination with creating compressed images." This comment leads me to believe that qemu-img is doing direct IO, which would be the only way to enforce ordering. The block layer does not guarantee ordering of writes.
(In reply to Dennis Keefe from comment #36) > How is direct in forced? Is qemu-img doing direct IOs? I believe that > qemu-img does IO in 64K chunks. Or is it doing direct IO at the sector > level (sub 4K)? Is there a way to increase the size to say 256K? I believe the default cache is none. I believe disks in the VM always have 512b sectors, and are read sector by sector. Kevin? (Somewhere around https://github.com/qemu/qemu/blob/master/qemu-img.c#L1880 ?) > > What about using -W > man qemu-img -> convert > "Out of order writes can be enabled with "-W" to improve performance. This > is only recommended for preallocated devices like host devices or other raw > block devices. Out of order write does not work in combination with creating > compressed images." Yes, there was a discussion about it some time ago, and it wasn't clear this actually helped in any way. > > This comment leads me to believe that qemu-img is doing direct IO, which > would be the only way to enforce ordering. The block layer does not > guarantee ordering of > writes.
(In reply to Dennis Keefe from comment #34) > Thank you for testing this. > Would it be possible to use the same setup to clone a VM and provide the > VOD stats? > > This is what I found in the PS output when cloning a VM. Could the same > command be used? > > vdsm 23460 1.1 0.0 392240 32776 ? S<l 00:03 0:00 > /usr/bin/qemu-img convert -p -t none -T none -f qcow2 > /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com: > _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/bf2fa147-9f92-4a0d-9d15- > 30d94c3c5e36/c6ae0472-1147-4eb7-8c7f-32efb8523ea2 -O qcow2 -o compat=1.1 > /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com: > _vmstore/288cfe58-913e-48f9-82a3-bb966e00ba15/images/8d88d52f-3091-402a-80aa- > 37a927ea26fc/57f85eef-1051-4a58-9e7a-350b7ccbfb4d > > qemu-img convert -p -t none -T none -f qcow2 < source image> -O qcow2 -o > compat=1.1 <destination image> > > -p = progress > -t = cache > -T = source cache > -f = format > -O = output file > -o = option (in this case "compat=1.1" > compat=1.1 - use the qcow2 v 1.1 format > > > Also would using cache in the cloning process help with performance and > avoiding sub 4k writes? > > -t cache > specifies the cache mode that should be used with the > (destination) file. See the documentation of the emulator's "-drive > cache=..." option for allowed values. > > -T src_cache > specifies the cache mode that should be used with the source > file(s). See the documentation of the emulator's "-drive > cache=..." option for allowed values. Done. Here are the numbers on a plain distribute gluster volume with src and dst files residing on the same volume: BEFORE QEMU-IMG-CONVERT: ======================= [root@rhsdev-grafton3 ~]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 30 bios in partial write : 1062568 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 2818445 bios acknowledged partial write : 1062568 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 AFTER QEMU-IMG-CONVERT: ====================== [root@rhsdev-grafton3 tmp]# vdostats /dev/mapper/vdo_sdc --verbose | grep partial bios in partial read : 47 bios in partial write : 1081368 bios in partial discard : 0 bios in partial flush : 0 bios in partial fua : 0 bios acknowledged partial read : 5606063 bios acknowledged partial write : 1081368 bios acknowledged partial discard : 0 bios acknowledged partial flush : 0 bios acknowledged partial fua : 0 I did also capture strace output for the qemu-img command during this run just to get the size of the writes sent by qemu-img command. Let me know if you need it, and I'll attach the same. -Krutika
Krutika, Would it be possible for you to post or send me the full output of vdostats --verbose for both before and after? It is interesting that there is a large increase in bios acknowledged partial reads, but no bio in partial reads.
Created attachment 1453652 [details] before qemu-img-convert
Created attachment 1453653 [details] after qemu-img-convert
(In reply to Yaniv Kaul from comment #37) > I believe the default cache is none. > I believe disks in the VM always have 512b sectors, and are read sector by > sector. Kevin? > (Somewhere around https://github.com/qemu/qemu/blob/master/qemu-img.c#L1880 > ?) The default cache mode is 'writeback', but that doesn't seem to matter here because you override it for both source and target anyway. The sector size of guest devices is completely unrelated, but it defaults to 512 bytes and is configurable in -device phyical_block_size=...,logical_block_size=... The buffer size that 'qemu-img convert' uses is 2 MB. Requests can be smaller than that for sparse output images, obviously. A possibly relevant option for this is -S, which controls the minimum size of a zeroed area that would leave the target image sparse instead of writing explicit zeros; it defaults to 4k.
The time taken to create VMs is under a minutes (5-15 seconds) when tried on a gluster volume with vdo on SSDs. Same timing was noticed even on the QE setup with an arbiter gluster volume with vdo on HDD The issue seems to be seen only with glustervolume on vdo with HDD and an SSD lvmcache.
(In reply to Sahina Bose from comment #43) > The time taken to create VMs is under a minutes (5-15 seconds) when tried on > a gluster volume with vdo on SSDs. Same timing was noticed even on the QE > setup with an arbiter gluster volume with vdo on HDD > > The issue seems to be seen only with glustervolume on vdo with HDD and an > SSD lvmcache. The issue is seen more pronounced with replica 3 + VDO + lvmcache combination.
I see a similar problem here oVirt 4.2.6 rc2 / replica 2 + 1 arbiter and VDO (no lvmcache) Creating a vm is fast but cloning a vm has the two disks locked for ~45min
Bipin, can you retest based on the changes for Bug 1616270?
(In reply to Sahina Bose from comment #46) > Bipin, can you retest based on the changes for Bug 1616270? Yes, we can retest and provide the results on the same
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
(In reply to Sahina Bose from comment #46) > Bipin, can you retest based on the changes for Bug 1616270? Will test in the latest bits and update the bug soon. So will not be clearing the needinfo until then.
(In reply to Sahina Bose from comment #46) > Bipin, can you retest based on the changes for Bug 1616270? Tested with the below versions: ------------------------------ glusterfs-3.12.2-40.el7rhgs.x86_64 rhv-4.2.8-5 Attempt 1: Created a VDO based template, created 12 vm's and the time overall was 12 min Attempt 2: Again created 12 more vm's and the total time taken was 15 min. PS: This was created on the same hypervisor with HDD disk's.
(In reply to bipin from comment #51) > (In reply to Sahina Bose from comment #46) > > Bipin, can you retest based on the changes for Bug 1616270? > > Tested with the below versions: > ------------------------------ > glusterfs-3.12.2-40.el7rhgs.x86_64 > rhv-4.2.8-5 > > Attempt 1: Created a VDO based template, created 12 vm's and the time > overall was 12 min > Attempt 2: Again created 12 more vm's and the total time taken was 15 min. Sounds to me like much better performance than the initial report. Closing based on the tests. Thanks! > > PS: This was created on the same hypervisor with HDD disk's.