Description of problem: The current defaults for thin provisioning (volume_utilization_percent and volume_utilization_chunk_mb) are too low for modern workloads (e.g. Kubernetes). Please update the VDSM default values. If possible, please also preallocate at least 5-10% of the disk to avoid VM pauses on first start.
This goes against other ideas aimed to minimize storage consumption (e.g., bz 2041352) I think we need to distinguish between OCP on RHV and traditional virtualization workloads At one extreme, that of OCP on RHV, we don't expect high overcommit and the virtualization workloads are supposed to run relatively high number of nested workloads At the other extreme, that of VDI solution, the ratio between VMs to hosts is higher and the virtualization workloads are less IO extensive Nir, what do you think? How about simplifying the tuning instead of changing the defaults?
(In reply to Arik from comment #2) > This goes against other ideas aimed to minimize storage consumption (e.g., > bz 2041352) > I think we need to distinguish between OCP on RHV and traditional > virtualization workloads > At one extreme, that of OCP on RHV, we don't expect high overcommit and the > virtualization workloads are supposed to run relatively high number of > nested workloads > At the other extreme, that of VDI solution, the ratio between VMs to hosts > is higher and the virtualization workloads are less IO extensive > > Nir, what do you think? > How about simplifying the tuning instead of changing the defaults? I agree that the defaults are two low. Extending a disk takes 2-6 seconds in normal conditions, and we extend a disk only when it has 512 MiB free space. Fast storage can easily write 1 GiB in one second, so VM is likey to pause for several seconds before the extend request is completed. When we create a VM we don't know how the VM will access storage, but maybe we need different defaults for different kinds of VMs. Currently we recommend to use preallocated disks for VMs that want best performance and reliability. Thin disks are best used for VMs with low storage needs (e.g. VDI use case). The current tunables for thin block storage are in vdsm, and they are global. A better way would to be have default per: - system - cluster - vm - disk For example if a have a DB VM, the DB disk(s) should use different tuning compared with mostly idle desktop VM that never write to storage. So the tunables should move to engine, and passed vdsm when starting a VM. Vdsm can use the provided values to monitor disks. Regarding pre-extend a disk when creating disks - this is possible in engine API, and already used in lot of example scripts like: https://github.com/oVirt/python-ovirt-engine-sdk4/blob/76dc3e7405b9acfefb34147a63172d164d8d7016/examples/upload_disk.py#L230 This example allocates the exact size needed to write the disk to storage. If users want to allocate additional space to avoid pausing right after the VM stats, they can add additional size when creating the disks. As quick way to improve the situation, we can change vdsm defaults to use 2 GiB chunk size instead of 1 GiB. This will mitigate most case, with minimal additional storage space. But this change should be done also one engine, which also keeps the same default.
Regarding bz 2041352 - the issue there is using 10% extra allocation when the possible allocation is more like 1%. So this bug does contradict the requirement for more aggressive chunk size. For example we use the same chunk size (1 GiB) for 5 GiB disk and 500 GiB disk. We can adapt the chunk size to the disk size, and/or to the current allocation. Ideally we could adapt the chunk size to the recent write rate, so when you start writing 10 GiB of data, instead of extending 10 times, and pause many times during the write, we can use bigger chunks and avoid most pauses. This is possible by using adaptive block threshold - instead of registering the threshold old to the point you want to extend, register an threshold that will trigger an event that will fire early for collecting the current write rage. Based on the write rate, we can adapt the next extend size.
> When we create a VM we don't know how the VM will access storage, but maybe we need different defaults for different kinds of VMs. If by "kind of vms" you mean to the vm type (server/desktop/high-performance) then yes, it also crossed my mind - to configure higher tuning properties for high performance (assuming the master nodes are set as high performance vms) > As quick way to improve the situation, we can change vdsm defaults to use 2 GiB chunk size instead of 1 GiB. This will mitigate most case, with minimal additional storage space. But this change should be done also one engine, which also keeps the same default. Yes but then we really need to reproduce this issue and see if this change is enough If we go with bigger changes, then it conflicts with efficient storage consumption If we go with smaller changes like this, they may not be enough Janos, were you able to reproduce this issue?
We have VM pauses on startup in our CI system for every installation since the disk starts at 0. Other than that, OCS seems to do a pretty good job pausing the VM under high load on the OSDs. I agree that being able to tune it to the workload would be ideal, but that may call for a larger code change. Bumping the values to match most use cases would be very welcome.
(In reply to Janos Bonic from comment #6) > We have VM pauses on startup in our CI system for every installation since > the disk starts at 0. What do you mean by disk starts at 0? If the disk is a snapshot, it starts at empty 1 GiB logical volume. It will be extended to 2 GiB when the allocation is 512 MiB. If you want to allocate more than 1 GiB, you can use larger initial_size when you create the snapshot. > Other than that, OCS seems to do a pretty good job > pausing the VM under high load on the OSDs. Do you mean Ceph storage? Do we run Ceph nodes in RHV VMs, using thin disk on block storage? > I agree that being able to tune > it to the workload would be ideal, but that may call for a larger code > change. Bumping the values to match most use cases would be very welcome. We cannot optimize defaults for one use case when this is unwanted for other use cases. We can improve the defaults so they work for most use cases, but this will never be the optimal values for certain use case. For best results you should optimize the specific system. For example "our CI" is a system you fully control, so you can use preallocated disks, or large initial size when provisioning this system. If the issue is not "our CI" but all OCP on RHV installations, the best solution is to fix this in the installer. Using your own defaults is very easy - you install your own defaults in: $ cat /etc/vdsm/vdsm.conf.d/99-ocp.conf # Settings optimized for running OCP on RHV. [irs] # Size of extension chunk in megabytes. # default value: # volume_utilization_chunk_mb = 1024 volume_utilization_chunk_mb = 2048 And restart vdsm service if you installed the file after vdsm was started.
> If the issue is not "our CI" but all OCP on RHV installations, the best solution is to fix this in the installer +1 I'd like to add that the assumption that Kubernetes represents modern workloads that run on traditional data centers is debatable (it's more of an exception) and unless we come up with a magic number that fits (almost) all scenarios, I would rather prefer to make the parameters easier to tune. I don't see a reason to rush with this though because specifically for OCP on RHV, the installer can change to provision VMs with preallocated disks or users can be advised to tune it as explained in comment 7
it's not only Kubernetes usecases, we do know we have the same issue for general "database" usecase for ages. It really ids that nowadays most storage is very much capable of writing more than 512MB in few seconds. Same as the default RAM size of 1GB is not really usable for anything today. We need to match the hardware capabilities in general. Instead of fancy profiles I think this RFE should track just an update of our defaults to keep the right balance between performance and over-allocation for current hw.
To keep this simple enough I'd like to focus only on a simple change and decide if we should - just change the values to another predefined value for volume_utilization_chunk_mb - change volume_utilization_percent as well - and/or make them proportional to volume size
(In reply to Michal Skrivanek from comment #11) > it's not only Kubernetes usecases, we do know we have the same issue for > general "database" usecase for ages. It really ids that nowadays most > storage is very much capable of writing more than 512MB in few seconds. Same > as the default RAM size of 1GB is not really usable for anything today. > We need to match the hardware capabilities in general. > > Instead of fancy profiles I think this RFE should track just an update of > our defaults to keep the right balance between performance and > over-allocation for current hw. What blocks us from preallocating the disks for k8s and such database workloads? Is there a real demand for having these workloads with thin-provisioning?
(In reply to Arik from comment #14) > (In reply to Michal Skrivanek from comment #11) > > it's not only Kubernetes usecases, we do know we have the same issue for > > general "database" usecase for ages. It really ids that nowadays most > > storage is very much capable of writing more than 512MB in few seconds. Same > > as the default RAM size of 1GB is not really usable for anything today. > > We need to match the hardware capabilities in general. > > > > Instead of fancy profiles I think this RFE should track just an update of > > our defaults to keep the right balance between performance and > > over-allocation for current hw. > > What blocks us from preallocating the disks for k8s and such database > workloads? > > Is there a real demand for having these workloads with thin-provisioning? storage overprovisioning is important, it's the whole reason why thin provisioning even exists. I do believe it has high value and people widely use it and will continue to do so.
(In reply to Michal Skrivanek from comment #16) > storage overprovisioning is important, it's the whole reason why thin > provisioning even exists. I do believe it has high value and people widely > use it and will continue to do so. +1 Exactly, and the importance of over-provisioning is why I think we should be careful with changing the settings in a way that may lead to over allocation My point is whether there really are users that have IO intensive operations and expects to take advantages of thin-provisioning or it is rather just an incorrect configuration that OCP on RHV installers applies
Discussed offline, we can go with minimal changes that won't necessarily fix any real-world issue but will also not (significantly) hinder over-commitment. About making it configurable - that requires a separate RFE
A previous bug for reference https://bugzilla.redhat.com/show_bug.cgi?id=1408594#c8
Created attachment 1862381 [details] Extract extend stats from vdsm log Usage: python3 extend-stats.py < /var/log/vdsm/vdsm.log
I posted https://github.com/oVirt/vdsm/pull/80, improving extend info timing. With this change we log the time since we received a block threshold event, until the new size of the disk is visible to the guest. I collected extend info for vm with 50 GiB disk, extending the disk 50 times. https://github.com/oVirt/vdsm/pull/80#issuecomment-1046745262 From this log we can extract the total extend time using the attached script (attachment 1862381 [details]): $ python3 extend-stats.py <vdsm.log min=2.270 avg=3.668 max=6.240 With this data can can adapt the defaults so we don't pause in the common case with a common write rate in common servers.
I posted a pr with new defaults: https://github.com/oVirt/vdsm/pull/82 Tests results with the new defaults show that we can cope now with 4x times faster write rate before VM start to pause. Before: write rate extends pauses ---------------------------- 75 MiB/s 50 0 100 MiB/s 50 4 125 MiB/s 50 4 150 MiB/s 53 24 After: write rate extends pauses ---------------------------- 200 MiB/s 20 0 250 MiB/s 20 0 300 MiB/s 20 0 350 MiB/s 21 0 400 MiB/s 20 1 450 MiB/s 20 2 500 MiB/s 22 7 550 MiB/s 23 7
Created attachment 1862576 [details] Extend script for testing paused during extend flow The attached script should run inside the guest cause a 50 GiB disk to be fully extended by very busy guest. To use the script: 1. Create VM with 50 GiB thin data disk on fast iSCSI/FC storage 2. If the disk in the guest is not /dev/sdb, update the script For example testing disk /dev/sdc PATH = "/dev/sdc" 3. Set the required write rate (RATE) For example, testing write rate of 500 MiB/s: RATE = 500 * 1024**2 4. Clear vdsm log # echo -n >/var/log/vdsm/vdsm.log 5. Run the script # python3 extend.py 100.00% 50.00 GiB 340.81 seconds 150.23 MiB/s 5. Copy and analyze vdsm log: # cp /var/log/vdsm/vdsm.log 150mbs.log
Created attachment 1862578 [details] Logs from testing various configurations The tarball includes logs form testing old and new configurations: $ tree . . ├── 1024-50 │ ├── 100mbs.log │ ├── 125mbs.log │ ├── 150mbs.log │ └── 75mbs.log ├── 2560-20 │ ├── 200mbs.log │ ├── 250mbs.log │ ├── 300mbs.log │ ├── 350mbs.log │ ├── 400mbs.log │ ├── 450mbs.log │ ├── 500mbs.log │ └── 550mbs.log └── extend-stats.py 2 directories, 13 files The script extend-stats.py extract extend stats from vdsm log. The directory 1024-50 contains logs from testing the old configuration: [irs] volume_utilization_chunk_mb = 1024 volume_utilization_precent = 50 The directory 2560-20 contains logs from testing the new configuration: [irs] volume_utilization_chunk_mb = 2560 volume_utilization_precent = 20 Examples from the attached logs: $ for n in */*.log; do echo $n; python3 extend-stats.py <$n; echo; done 1024-50/100mbs.log min=2.320 avg=3.427 max=6.400 1024-50/125mbs.log min=2.320 avg=3.569 max=5.960 1024-50/150mbs.log min=2.350 avg=3.597 max=7.880 1024-50/75mbs.log min=2.430 avg=3.599 max=6.320 2560-20/200mbs.log min=2.660 avg=3.640 max=5.860 2560-20/250mbs.log min=2.310 avg=3.341 max=4.800 2560-20/300mbs.log min=2.500 avg=3.604 max=5.970 2560-20/350mbs.log min=2.290 avg=3.664 max=6.970 2560-20/400mbs.log min=2.600 avg=3.696 max=5.610 2560-20/450mbs.log min=2.410 avg=3.598 max=6.040 2560-20/500mbs.log min=2.720 avg=4.332 max=8.580 2560-20/550mbs.log min=2.510 avg=4.199 max=8.350 In general extend total time is stable regardless of the write rate, but with high write maximum extend time increases. How time is spent during extend? $ grep 'completed <Clock' 2560-20/400mbs.log | head -1 2022-02-22 10:18:27,125+0200 INFO (mailbox-hsm/0) [virt.vm] (vmId='d0730833-98aa-4138-94d1-8497763807c7') Extend volume 3a69ed39-055a-4a30-bd74-b1f51a5ed5cc completed <Clock(total=2.99, wait=0.62, extend-volume=2.11, refresh-volume=0.25)> (thinp:567) total=2.99 - total time since we received block threshold event until new size is available to guest wait=0.62 - time from receiving the event until we start the extend attempt extend-volume=2.11 - time since we send extend request until we get extend reply refresh-volume=0.25 - time to refresh the volume to update the size on the host
How to get extend and pause stats from the logs: Number of completed extends: $ grep 'completed <Clock(' 2560-20/400mbs.log | wc -l 20 Number of pauses: [nsoffer@host4 bug-2051997]$ grep 'onResume' 2560-20/400mbs.log | wc -l 1 We check the number of resumes since we get muliple pause events for every pause, but only one resume event after a vm is resumed.
This change improved the maximum write rate from 75 MiB/s to 350 MiB/s. We have a upstream PR improving maximum write rate by to 610 MiB/s: https://github.com/oVirt/vdsm/pull/103 See the upstream issue describing the why and how this works: https://github.com/oVirt/vdsm/issues/102 oVirt 4.5 will be 8 times better compared to ovirt 4.4, reducing the chance for pausing VMs when writing quickly to fast storage. Examples stats from 100 extensions: # ./extend-stats <vdsm.log Total time min=0.860 avg=2.001 max=3.270 Wait time min=0.050 avg=1.005 max=1.990 Extend time min=0.530 avg=0.819 max=2.080 Refresh time min=0.150 avg=0.175 max=0.210 Note that the wait time (0.05-1.99) seconds is completely unneeded. Eliminating it will improve the write rate by factor of 2. This is tracked upstream in https://github.com/oVirt/vdsm/issues/85
(In reply to Shir Fishbain from comment #29) Comment 22 and comment 23 explain how to reproduce and test, but since they were written we have better tools for testing and the system was improved to support higher write rate (using mailbox events). ## Setup 1. Download thinp.py tool from vdsm stress tests in the guest: wget https://raw.githubusercontent.com/oVirt/vdsm/master/tests/storage/stress/thinp.py 2. Download extend-stats tool to the host running the vm wget https://raw.githubusercontent.com/oVirt/vdsm/master/contrib/extend-stats chmod +x extend-stats ## How to test write rate 1. Add a 50g thin disk on block storage to running VM 2. Clear vdsm logs on the host running the VM echo -n >/var/log/vdsm/vdsm.log 3. Run in the guest python3 thinp.py --rate XXX This writes 50g to the disk, triggering 20 extensions per run. 4. Copy vdsm log for inspection cp /var/log/vdsm/vdsm.log 75m.log 5. Extract stats from vdsm log extend-stats <XXX.log 6. Check if vm paused during the test If the VM paused we will have multiple onIOError logs when the VM pause with ENOSPC, and single onResume log when the VM is resumed after the extension. On engine events page, we will see an error event about vm pausing because of no space for each pause. 7. Deactivate and remove the test disk You can test only once with the same disk. Once the disk was extended the VM will not pause when writing to the disk since the disk was extended to the maximum size (53 GiB). ## Test variants 1. Multiple hosts - running VM on one host, the SPM is on the other host 2. Single host - running VM on the SPM host 3. Test disk is on the master storage domain 4. Test disk is on another storage domain ## Expected results RHV 4.4 - Writing at 75 MiB/s: vm do not pause during the run - Writing at 100 MiB/s: vm pause at least once during the test - Writing at 150 MiB/s: vm pause many times during the test RHV 4.4sp1: - Writing at 600 MiB/s: vm do not pause during the run - Writing at 700 MiB/s: vm pause at least once during the test - Writing at 1200 MiB/s: vm pause many times during the test To find the limits of RHV 4.4sp1 you need to use fast storage. I tested using local storage on the hosts, exposed as FC devices. Please attach output from the thinp.py tool, extend-stats, and vdsm logs from all run.
According to the Doc text "The minimum allocation size was increased from 1 GiB to 2.5G". I don't see this change, according to logs and the rest API the disk is still 1 GiB in the actual size when choosing provision.
(In reply to sshmulev from comment #31) > According to the Doc text "The minimum allocation size was increased from 1 > GiB to 2.5G". > I don't see this change, according to logs and the rest API the disk is > still 1 GiB in the actual size when choosing provision. Not clear what is "the disk". If you create a 1 GiB, we create 1 GiB disk. If you create 10 GiB disk, the initial size of the disk is 2.5 GiB.
Verified. I couldn't make the VM pause even at the highest rate of 1200 MB/s. Checked writing with this rate with an older version of 4.4.10-7 and I could immediately reproduce it there, just when it started writing the VM paused, and several times more. Version tested: rhv-4.5.0-8 ovirt-engine-4.5.0.5-0.7 vdsm-4.50.0.13-1 results: With ISCSI, spm-host, disk on another SD python3 thinp.py /dev/sda --rate 600 50.00 GiB, 121.64 s, 420.9 MiB/s # ./extend-stats < spm-host_600.log Total time min=1.300 avg=2.394 max=3.720 Wait time min=0.110 avg=0.908 max=1.990 Extend time min=0.590 avg=0.945 max=1.160 Refresh time min=0.330 avg=0.540 max=1.200 The vm didn't pause at all. --------------------------------------------------------- python3 thinp.py /dev/sda --rate 700 50.00 GiB, 108.56 s, 471.6 MiB/s ./extend-stats < spm-host_700.log Total time min=1.330 avg=2.333 max=3.460 Wait time min=0.170 avg=1.033 max=2.010 Extend time min=0.560 avg=0.929 max=1.140 Refresh time min=0.290 avg=0.369 max=0.760 The VM didn't pause at all. --------------------------------------------------------- python3 thinp.py /dev/sda --rate 1200 50.00 GiB, 110.53 s, 463.2 MiB/s ./extend-stats < spm-host_1200.log Total time min=1.180 avg=2.891 max=4.090 Wait time min=0.210 avg=1.091 max=1.930 Extend time min=0.570 avg=0.932 max=1.150 Refresh time min=0.390 avg=0.864 max=1.650 ##################################################################### With FCP, non-spm-host, disk on another SD # python3 thinp.py /dev/sda --rate 600 50.00 GiB, 106.01 s, 483.0 MiB/s # ./extend-stats < non-spm-host_600.log Total time min=1.700 avg=2.635 max=4.290 Wait time min=0.130 avg=0.969 max=1.970 Extend time min=0.560 avg=0.927 max=1.150 Refresh time min=0.190 avg=0.737 max=1.820 --------------------------------------------------------- # python3 thinp.py /dev/sda --rate 700 50.00 GiB, 78.98 s, 648.3 MiB/s # ./extend-stats < non-spm-host_700.log Total time min=0.940 avg=1.899 max=2.430 Wait time min=0.170 avg=0.762 max=1.090 Extend time min=0.570 avg=0.931 max=1.140 Refresh time min=0.180 avg=0.208 max=0.240 --------------------------------------------------------- # python3 thinp.py /dev/sda --rate 1200 50.00 GiB, 103.55 s, 494.5 MiB/s # ./extend-stats < non-spm-host_1200.log Total time min=1.870 avg=2.514 max=3.380 Wait time min=0.060 avg=0.935 max=1.870 Extend time min=0.580 avg=0.952 max=1.590 Refresh time min=0.210 avg=0.627 max=1.230 ------------------------------------------------------------ python3 thinp.py /dev/sda --rate 2500 50.00 GiB, 78.87 s, 649.2 MiB/s # ./extend-stats < non-spm-host_2500.log Total time min=1.530 avg=2.442 max=3.030 Wait time min=0.660 avg=1.344 max=1.720 Extend time min=0.550 avg=0.892 max=1.130 Refresh time min=0.190 avg=0.205 max=0.280 50.00 GiB, 78.87 s, 649.2 MiB/s
(In reply to sshmulev from comment #33) > Verified. > I couldn't make the VM pause even at the highest rate of 1200 MB/s. The actual output from the command show that in most tests you could not write more than 480 MiB/s. > Checked writing with this rate with an older version of 4.4.10-7 and I could > immediately reproduce it there, just when it started writing the VM paused, > and several times more. > > Version tested: > rhv-4.5.0-8 > ovirt-engine-4.5.0.5-0.7 > vdsm-4.50.0.13-1 > > results: > With ISCSI, spm-host, disk on another SD > > python3 thinp.py /dev/sda --rate 600 > 50.00 GiB, 121.64 s, 420.9 MiB/s > > # ./extend-stats < spm-host_600.log > > Total time > min=1.300 avg=2.394 max=3.720 > > Wait time > min=0.110 avg=0.908 max=1.990 > > Extend time > min=0.590 avg=0.945 max=1.160 > > Refresh time > min=0.330 avg=0.540 max=1.200 > > The vm didn't pause at all. > --------------------------------------------------------- > python3 thinp.py /dev/sda --rate 700 > 50.00 GiB, 108.56 s, 471.6 MiB/s > > > ./extend-stats < spm-host_700.log > > Total time > min=1.330 avg=2.333 max=3.460 > > Wait time > min=0.170 avg=1.033 max=2.010 > > Extend time > min=0.560 avg=0.929 max=1.140 > > Refresh time > min=0.290 avg=0.369 max=0.760 > > The VM didn't pause at all. > > --------------------------------------------------------- > > python3 thinp.py /dev/sda --rate 1200 > 50.00 GiB, 110.53 s, 463.2 MiB/s > > ./extend-stats < spm-host_1200.log > > Total time > min=1.180 avg=2.891 max=4.090 > > Wait time > min=0.210 avg=1.091 max=1.930 > > Extend time > min=0.570 avg=0.932 max=1.150 > > Refresh time > min=0.390 avg=0.864 max=1.650 > > > ##################################################################### > With FCP, non-spm-host, disk on another SD > > # python3 thinp.py /dev/sda --rate 600 > 50.00 GiB, 106.01 s, 483.0 MiB/s > > # ./extend-stats < non-spm-host_600.log > > Total time > min=1.700 avg=2.635 max=4.290 > > Wait time > min=0.130 avg=0.969 max=1.970 > > Extend time > min=0.560 avg=0.927 max=1.150 > > Refresh time > min=0.190 avg=0.737 max=1.820 > > --------------------------------------------------------- > > # python3 thinp.py /dev/sda --rate 700 > 50.00 GiB, 78.98 s, 648.3 MiB/s > > # ./extend-stats < non-spm-host_700.log > > Total time > min=0.940 avg=1.899 max=2.430 > > Wait time > min=0.170 avg=0.762 max=1.090 > > Extend time > min=0.570 avg=0.931 max=1.140 > > Refresh time > min=0.180 avg=0.208 max=0.240 > > --------------------------------------------------------- > > # python3 thinp.py /dev/sda --rate 1200 > 50.00 GiB, 103.55 s, 494.5 MiB/s > > # ./extend-stats < non-spm-host_1200.log > > Total time > min=1.870 avg=2.514 max=3.380 > > Wait time > min=0.060 avg=0.935 max=1.870 > > Extend time > min=0.580 avg=0.952 max=1.590 > > Refresh time > min=0.210 avg=0.627 max=1.230 > > ------------------------------------------------------------ > python3 thinp.py /dev/sda --rate 2500 > 50.00 GiB, 78.87 s, 649.2 MiB/s > > # ./extend-stats < non-spm-host_2500.log > > Total time > min=1.530 avg=2.442 max=3.030 > > Wait time > min=0.660 avg=1.344 max=1.720 > > Extend time > min=0.550 avg=0.892 max=1.130 > > Refresh time > min=0.190 avg=0.205 max=0.280 > > 50.00 GiB, 78.87 s, 649.2 MiB/s Looks good!
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.