Description of problem: When Starting a live VM backup, a scratch disk is created for each disk that participates in the backup. When the backed-up disk resides on a block-based storage domain, the scratch disk created with an initial size that can be equal to the backed-up disk's actual size. So when a block-based disk in the size of 1TB is backed-up when the VM is up, the storage domain that holds that disk should have 1TB of free space. This will be fixed when the scratch disk can be created as a thin (bug 1913389). But until then, the initial size of the scratch disk should be configurable according to the backed-up disk size. Version-Release number of selected component (if applicable): 4.5 master How reproducible: 100$ Steps to Reproduce: 1. Create a VM with RAW block-based disk 2. Run the VM 3. Start a live backup Actual results: The scratch disk created with the same initial size as the backed-up disk Expected results: Scratch disk initial size can be determined in the engine configuration according to the percentage from the backed-up disk size (BackupBlockScratchDiskInitialSizePercents) Additional info:
This is not only about raw disk, the same issue exists for qcow2 disk with or without a template. For example we can have qcow2 chain which qemu-img measure reports 700g. If we don't have 700g free in the storage domain, the backup fail. In practice, the guest will never write 700g of data during the backup, so the backup would complete successfully even if we allocated only 70g for the scratch disk. Backing up 700g is likey to take lot of time, so the guest has more time to write data during the backup. When the guest write data during the backup, qemu will copy the old data from the disk to the scratch disk before writing the data to the disk. So the size of the scratch disk is a function of the allocated size in the original disk. So test case should be: - raw disk - qcow2 disk based template with lot of data - qcow2 disk with snaphost, with lot of data in the snapshot For testing: 1. Create vm with second data disk (don't mount or create a file system) 2. Start writing data in the guest to the data disk (e.g. /dev/sdb) 3. Run full backup 4. Run incremental backup We want to test: - the default configuration (10%?) - double configuration (20%) - maximum size configuration (100%) For writing data in the guest, we want to simulate typical situation, for example writing 10g in a day, ~416 MiB per hour, 6.9 MiB per minute. We can simulate this using: for i in $(seq 416); do dd if=/dev/zero bs=1M count=1 seek=$i of=/dev/sdb oflag=direct conv=fsync sleep 8 done We can play with the sleep to test different write rate, and check how much data we can write with the default configuration, then double the configuration and test again. This should be tested with real server and storage, something close to what a customer will have. If the scratch disk becomes full during the backup, the VM will pause. The VM should resume when the backup complete.
Changed the configuration values to 100%, (first checked it with 50% and it was as expected- half of the size from the original disk) and seems like the actual value still remains half. Versions: 4.4.10-0.17.el8ev vdsm-4.40.100.1-1.el8ev Steps: 1) Created VM from template , attached 2 disks: - raw 10g - qcow2 10g 2) on the engine: # engine-config -g "MaxBackupBlockScratchDiskInitialSizePercents" Picked up JAVA_TOOL_OPTIONS: -Dcom.redhat.fips=false MaxBackupBlockScratchDiskInitialSizePercents: 100 version: general # engine-config -g "MaxBackupBlockScratchDiskInitialSizePercents" Picked up JAVA_TOOL_OPTIONS: -Dcom.redhat.fips=false MaxBackupBlockScratchDiskInitialSizePercents: 100 version: general 3) Perform full backup on the VM # ./backup_vm.py -c engine full e7e354f9-d3a0-4427-a5c6-7a4c6da6f8a4 [ 0.0 ] Starting full backup for VM 'e7e354f9-d3a0-4427-a5c6-7a4c6da6f8a4' [ 1.4 ] Waiting until backup '8cb32883-48c2-4e85-a3ff-e85ef42a007a' is ready [ 17.6 ] Created checkpoint '111a64c2-3550-4d01-b1a8-069046980580' (to use in --from-checkpoint-uuid for the next incremental backup) [ 17.7 ] Downloading full backup for disk 'e7aca0b1-f363-48ea-a493-0f7b28df7a10' [ 18.8 ] Image transfer 'fc111217-ab52-4fb3-9ada-c5c5f5c9a799' is ready [ 100.00% ] 10.00 GiB, 0.20 seconds, 50.98 GiB/s [ 19.0 ] Finalizing image transfer [ 22.1 ] Download completed successfully [ 22.1 ] Downloading full backup for disk '0590cda2-06c5-4ddb-83eb-dc93813632be' [ 23.2 ] Image transfer '5e38b668-021d-4aa2-a88c-8dca64364e25' is ready [ 100.00% ] 10.00 GiB, 32.86 seconds, 311.66 MiB/s when it was in ready state got the disks details: <name>VM vm_test backup 8cb32883-48c2-4e85-a3ff-e85ef42a007a scratch disk for qcow2_vitio-scsi</name> <description>Backup 8cb32883-48c2-4e85-a3ff-e85ef42a007a scratch disk</description> <link href="/ovirt-engine/api/disks/4fad70a8-4e1a-469e-a826-c7559a0b20d3/permissions" rel="permissions"/> <link href="/ovirt-engine/api/disks/4fad70a8-4e1a-469e-a826-c7559a0b20d3/disksnapshots" rel="disksnapshots"/> <link href="/ovirt-engine/api/disks/4fad70a8-4e1a-469e-a826-c7559a0b20d3/statistics" rel="statistics"/> <actual_size>4831838208</actual_size> <alias>VM vm_test backup 8cb32883-48c2-4e85-a3ff-e85ef42a007a scratch disk for qcow2_vitio-scsi</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>b42d9429-39b0-4b26-832b-822d8d00b088</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> <qcow_version>qcow2_v3</qcow_version> <name>VM vm_test backup 8cb32883-48c2-4e85-a3ff-e85ef42a007a scratch disk for raw_vitio-scsi</name> <description>Backup 8cb32883-48c2-4e85-a3ff-e85ef42a007a scratch disk</description> <link href="/ovirt-engine/api/disks/ff4896bb-88cb-4526-b084-877aefe45b20/permissions" rel="permissions"/> <link href="/ovirt-engine/api/disks/ff4896bb-88cb-4526-b084-877aefe45b20/disksnapshots" rel="disksnapshots"/> <link href="/ovirt-engine/api/disks/ff4896bb-88cb-4526-b084-877aefe45b20/statistics" rel="statistics"/> <actual_size>4831838208</actual_size> <alias>VM vm_test backup 8cb32883-48c2-4e85-a3ff-e85ef42a007a scratch disk for raw_vitio-scsi</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>1db5a769-8d55-4f72-a871-3b79849212cd</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> <qcow_version>qcow2_v3</qcow_version> [ 56.1 ] Finalizing image transfer [ 58.1 ] Download completed successfully [ 58.1 ] Downloading full backup for disk '76e3d0c2-f1d0-43d6-a141-3fcacea6808c' [ 59.2 ] Image transfer 'ecd2d945-5f02-4122-8cd0-1c5557c0a91b' is ready [ 100.00% ] 10.00 GiB, 12.21 seconds, 838.58 MiB/s [ 71.4 ] Finalizing image transfer [ 74.5 ] Download completed successfully [ 74.5 ] Finalizing backup [ 81.7 ] Full backup '8cb32883-48c2-4e85-a3ff-e85ef42a007a' completed successfully Expected results: when configuring MaxBackupBlockScratchDiskInitialSizePercents=100, the scratch disks size should be the same size as the provisioned ones that are backed up. Actual results: The actual size is half from each backed-up disk.
After changing an engine configuration value, the engine should be restarted. Did you restarted the engine after changing the the percentage value?
Verified with two disks one at a time on 3 main configurations(20%-default,50%,100%) The scratch disks are created according to the configuration - needed to restart the VM in order to make it work (each time you configure it). Versions: 4.4.10-0.17.el8ev vdsm-4.40.100.1-1.el8ev Test cases: 1) <name>raw_disk</name> <actual_size>10737418240</actual_size> <alias>raw_disk</alias> <format>raw</format> 2) (qcow disk with full data on the disk): <name>qcow_disk</name> <actual_size>10737418240</actual_size> <alias>qcow_disk</alias> Results: 100%: # engine-config -g "MaxBackupBlockScratchDiskInitialSizePercents" Picked up JAVA_TOOL_OPTIONS: -Dcom.redhat.fips=false MaxBackupBlockScratchDiskInitialSizePercents: 100 version: general <name>VM vm1 backup e7cb4598-5696-490c-8f31-50f017d260f6 scratch disk for raw_disk</name> <description>Backup e7cb4598-5696-490c-8f31-50f017d260f6 scratch disk</description> <actual_size>11811160064</actual_size> <alias>VM vm1 backup e7cb4598-5696-490c-8f31-50f017d260f6 scratch disk for raw_disk</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>8838d5eb-d2b2-4999-8c2c-b601aaf70c44</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> <name>VM vm1 backup 942a8b9e-8bcb-43c2-b4b0-5e90c0177062 scratch disk for qcow_disk</name> <description>Backup 942a8b9e-8bcb-43c2-b4b0-5e90c0177062 scratch disk</description> <actual_size>11811160064</actual_size> <alias>VM vm1 backup 942a8b9e-8bcb-43c2-b4b0-5e90c0177062 scratch disk for qcow_disk</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>da99c094-c7bc-4b0c-8fb9-26a36f29a286</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> 50%: # engine-config -g "MaxBackupBlockScratchDiskInitialSizePercents" Picked up JAVA_TOOL_OPTIONS: -Dcom.redhat.fips=false MaxBackupBlockScratchDiskInitialSizePercents: 50 version: general <name>VM vm1 backup eeb03c41-92ff-4bfd-9d3e-2acd41255c1a scratch disk for raw_disk</name> <description>Backup eeb03c41-92ff-4bfd-9d3e-2acd41255c1a scratch disk</description> <actual_size>5905580032</actual_size> <alias>VM vm1 backup eeb03c41-92ff-4bfd-9d3e-2acd41255c1a scratch disk for raw_disk</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>068eca0d-95ed-455e-8b99-348b29c242b4</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> <name>VM vm1 backup 31204c4b-cf40-440f-b99a-4e8ce520c38b scratch disk for qcow_disk</name> <description>Backup 31204c4b-cf40-440f-b99a-4e8ce520c38b scratch disk</description> <alias>VM vm1 backup 31204c4b-cf40-440f-b99a-4e8ce520c38b scratch disk for qcow_disk</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>1b6f4dfb-eb12-497b-9cc5-b7e4865af597</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> 20%: # engine-config -g "MaxBackupBlockScratchDiskInitialSizePercents" Picked up JAVA_TOOL_OPTIONS: -Dcom.redhat.fips=false MaxBackupBlockScratchDiskInitialSizePercents: 20 version: general <name>VM vm1 backup 3d6a5fe2-3844-49d5-950e-7fffc5023162 scratch disk for raw_disk</name> <description>Backup 3d6a5fe2-3844-49d5-950e-7fffc5023162 scratch disk</description> <actual_size>4831838208</actual_size> <alias>VM vm1 backup 3d6a5fe2-3844-49d5-950e-7fffc5023162 scratch disk for raw_disk</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>9d4b7636-d9a7-4da4-ada2-41a43e3bd9e6</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size> <name>VM vm1 backup 54e12ef5-40d5-4f2e-929c-dfdd2b462c2c scratch disk for qcow_disk</name> <description>Backup 54e12ef5-40d5-4f2e-929c-dfdd2b462c2c scratch disk</description> <actual_size>4831838208</actual_size> <alias>VM vm1 backup 54e12ef5-40d5-4f2e-929c-dfdd2b462c2c scratch disk for qcow_disk</alias> <backup>none</backup> <content_type>backup_scratch</content_type> <format>cow</format> <image_id>0c8a40a6-9006-49b7-834f-4394437b4cb3</image_id> <propagate_errors>false</propagate_errors> <provisioned_size>10737418240</provisioned_size>
More info on how to test the case when guest writes large amount of data during the backup, and qemu copy the written data to the scratch disk. The most interesting test is raw disk and full backup, since in this case we must backup the entire disk. The backup will be longer, and the guest will have more time to write data to the disk during the backup. 1. Setup Create vm with: - OS disk (e.g. thin disk on a template) (sda) - Data disk - raw preallocated 50g (sdb) Start the VM, and fill the disk with data: dd if=/dev/sdb bs=1M count=$((50*1024*1024*1024)) | tr "\0" "\1" > /dev/sdb sync 2. Start full backup of the data disk ./backup_vm.py -c engine full \ --disk-uuid sdb-didk-id \ --backup-dir /path/to/backups \ vm-id Assuming that you have 1 Gbit network to the VM, and you backup via the network (e.g. some backup applications do), the backup will take about 500 seconds. Note that the backup will create ~51 GiB qcow2 image in /path/to/backup/. You need to have enough space there. 3. Write data in the guest during the backup Lets start with writing 20% of the disk during the backup - writing less data cannot be an issue with scratch disk using 20% of the disk size. So we want to write 10 GiB in 500 seconds. Writing 20 MiB should be quick, so with 1 second sleep we will ~20 MiB/s. We may need to tweak the numbers. for i in $(seq 500); do dd if=/dev/zero bs=20M count=1 seek=$i of=/dev/sdb oflag=direct conv=fsync sleep 1 done
(In reply to Nir Soffer from comment #6) > Adding some results of the TC Nir suggested to test: Steps to reproduce: Set the env 1. Create VM from template 2. Attach new raw disk 20G 3. run the VM 4. Set two hosts to maintenance (need just on active in order to slow down the backup, it is faster on our envs due to fast nic capacity)-optional in different envs 5. On the Engine - engine-config -s "MaxBackupBlockScratchDiskInitialSizePercents=20" this is default size (need to restart if change it) Test case: 1. Run backup on the VM's disk from the Active host - (Used --max-workers 1 to slow down the backup) # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 --max-workers 1 2. Run the script "writing.sh" just after you see the message "Downloading full backup for disk 'ec0f8535-4d76-4a8f-b245-399114c5f0f0'" Expected results: Since the initial size of the scratch disk is 4G (as we configured 20% from 20G) we want to test here when we're writing 20% data to it during the backup that the VM doesn't pause. When we increase the writing size more than its capacity(5G) - The VM should pause and after backup is done, need to resume the VM if it's a raw disk (run the VM). The next step was to increase the "MaxBackupBlockScratchDiskInitialSizePercents=30" and see that the same size that paused the VM doesn't make this anymore. In qcow disk it acts differently: Used the same raw disk and created a snapshot. repeated the same flow. Although, after the VM paused, it resumed automatically. Actual results: Initial scratch disk size: 4G Raw disk MaxBackupBlockScratchDiskInitialSizePercents: 20% 1. Writing 4G while backuping 20G full disk of data # ./writing.sh total=4090 MiB time=58 sec # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 --max-workers 1 [ 0.0 ] Starting full backup for VM 'be2514cb-10d2-457e-86f1-cc77dc99cf83' [ 0.3 ] Waiting until backup '9ea98fd6-c183-49e0-a585-dfb3a6ea26ef' is ready [ 23.7 ] Downloading full backup for disk 'a029376c-72bf-4e0b-9ddd-df8d5a938449' [ 24.8 ] Image transfer 'a42ca3d9-5d1b-41e9-93fa-ea1540e7c14c' is ready [ 100.00% ] 20.00 GiB, 165.74 seconds, 123.57 MiB/s [ 190.5 ] Finalizing image transfer [ 201.7 ] Download completed successfully [ 201.7 ] Finalizing backup [ 208.9 ] Full backup '9ea98fd6-c183-49e0-a585-dfb3a6ea26ef' completed successfully All passed successfully without pausing the VM. 2. Writing 5G while backuping 20G full disk of data # ./writing.sh total=5000 MiB # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 --max-workers 1 [ 0.0 ] Starting full backup for VM 'be2514cb-10d2-457e-86f1-cc77dc99cf83' [ 0.3 ] Waiting until backup 'd8cbdf4f-19f0-4e60-98c7-196866b147e0' is ready [ 17.6 ] Downloading full backup for disk 'a029376c-72bf-4e0b-9ddd-df8d5a938449' [ 18.7 ] Image transfer 'e979e0be-90ca-474e-a23b-6435a3cce7e5' is ready [ 100.00% ] 20.00 GiB, 136.46 seconds, 150.08 MiB/s [ 155.2 ] Finalizing image transfer [ 164.4 ] Download completed successfully [ 164.4 ] Finalizing backup [ 171.6 ] Full backup 'd8cbdf4f-19f0-4e60-98c7-196866b147e0' completed successfully In this case, the VM paused due to no Storage space error. The VM wouldn’t come up, need to “run” the VM to resume it. 3. Initial scratch disk size: 6G MaxBackupBlockScratchDiskInitialSizePercents: 30% # ./writing.sh total=5000 MiB time=63 sec # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 --max-workers 1 [ 0.0 ] Starting full backup for VM 'be2514cb-10d2-457e-86f1-cc77dc99cf83' [ 0.9 ] Waiting until backup '0342b7e1-c532-4818-b665-df806ade4c8a' is ready [ 12.3 ] Downloading full backup for disk 'a029376c-72bf-4e0b-9ddd-df8d5a938449' [ 13.6 ] Image transfer '8d8fc4e3-68cf-49ff-9c2c-563aeea96163' is ready [ 100.00% ] 20.00 GiB, 148.51 seconds, 137.90 MiB/s [ 162.1 ] Finalizing image transfer [ 170.2 ] Download completed successfully [ 170.2 ] Finalizing backup [ 176.6 ] Full backup '0342b7e1-c532-4818-b665-df806ade4c8a' completed successfully Passed successfully - VM didn’t pause. Moving to test it on thin disk: Initial scratch disk size: 4G Thin disk MaxBackupBlockScratchDiskInitialSizePercents: 20% 4. Writing 4G while backuping 20G full disk of data # ./writing.sh total=4090 MiB time=95 sec # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 [ 0.0 ] Starting full backup for VM 'be2514cb-10d2-457e-86f1-cc77dc99cf83' [ 0.3 ] Waiting until backup 'ff94809c-aee8-40ea-9403-f30d75a4427d' is ready [ 10.4 ] Created checkpoint 'ec901e0f-8bed-436f-9ca2-e04795fe6a43' (to use in --from-checkpoint-uuid for the next incremental backup) [ 10.5 ] Downloading full backup for disk 'a029376c-72bf-4e0b-9ddd-df8d5a938449' [ 11.6 ] Image transfer 'ea77c26d-d41a-430c-97c4-0191ac1f5241' is ready [ 100.00% ] 20.00 GiB, 78.29 seconds, 261.59 MiB/s [ 89.9 ] Finalizing image transfer [ 97.0 ] Download completed successfully [ 97.0 ] Finalizing backup [ 104.3 ] Full backup 'ff94809c-aee8-40ea-9403-f30d75a4427d' completed successfully Passed successfully - VM didn’t pause. 5.Writing 5G # ./writing.sh total=5000 MiB time=96 sec # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 [ 0.0 ] Starting full backup for VM 'be2514cb-10d2-457e-86f1-cc77dc99cf83' [ 0.3 ] Waiting until backup '6fbc8125-fe1b-4a78-a3e8-c40cc8d03ccc' is ready [ 17.5 ] Created checkpoint '02fe9dd1-5f78-44e2-bb9d-1dccf5a56da0' (to use in --from-checkpoint-uuid for the next incremental backup) [ 17.6 ] Downloading full backup for disk 'a029376c-72bf-4e0b-9ddd-df8d5a938449' [ 18.7 ] Image transfer '1f499698-4f36-4606-a641-59bd4a22f580' is ready [ 100.00% ] 20.00 GiB, 81.54 seconds, 251.15 MiB/s [ 100.2 ] Finalizing image transfer [ 104.3 ] Download completed successfully [ 104.3 ] Finalizing backup [ 111.6 ] Full backup '6fbc8125-fe1b-4a78-a3e8-c40cc8d03ccc' completed successfully The VM didn't pause because the disk has been already extended, so needed to increase the next writing size. 6. Writing 8G # ./writing.sh total=8000 MiB time=112 sec # ./backup_vm.py -c engine full --disk-uuid a029376c-72bf-4e0b-9ddd-df8d5a938449 --backup-dir /var/tmp/backups be2514cb-10d2-457e-86f1-cc77dc99cf83 [ 0.0 ] Starting full backup for VM 'be2514cb-10d2-457e-86f1-cc77dc99cf83' [ 0.4 ] Waiting until backup '274c3878-c941-4bba-a57f-0fa0a737b218' is ready [ 23.7 ] Created checkpoint '26eca85e-8066-4c23-b7d4-e88d6119f851' (to use in --from-checkpoint-uuid for the next incremental backup) [ 23.8 ] Downloading full backup for disk 'a029376c-72bf-4e0b-9ddd-df8d5a938449' [ 24.9 ] Image transfer 'c6d4444c-d520-4f78-af82-84c92263c81c' is ready [ 100.00% ] 20.00 GiB, 83.48 seconds, 245.32 MiB/s [ 108.4 ] Finalizing image transfer [ 110.4 ] Download completed successfully [ 110.4 ] Finalizing backup [ 116.7 ] Full backup '274c3878-c941-4bba-a57f-0fa0a737b218' completed successfully Audit log: VM vm1 has been paused due to no Storage space error VM vm1 has recovered from paused back to up.
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022. Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.