Description of problem: If a vm in backup state it won't change its powerstate. Usually, hypervisor administrators can’t control what users do with their vms. Opposite, users don’t have access to the hypervisor administration. So, it means that a backup administrator can’t ask users not to stop/start their vms during backup cycle and users don’t know when backup works. In ideal scenario users mustn’t care about backup process it should be transparent for them. Current implementation creates few cases: 1) VM is on -> backup -> user powered-off the vm -> the backup failed. The case isn’t bad, but there are few problems: a. User can’t power-off vm from hypervisor console b. Backup can be interrupted by user, so administrator can’t control this c. The backup flow can lose large amount of time, just by inaccurate user action 2) VM is off -> backup -> user can’t power-on vm until backup will be finished. It’s worse case: a. Before power-on vm, user must ask backup administrator to stop backup or he must wait until backup will be finished. Any of this action will require some time, and if the vm has critical business data, the downtime will cost much money. The problems might not be serious for small business and basic systems, but it's really strong for the large customers. In such systems backup of one vm can take hours and sometimes days. For example, it’s really bad to terminate 5 hours vm backup by user action. This can completely break backup window schedule. From another side a vm which locked for hours looks bad for a user. Large customer may backup thousand vms during one backup flow, communicate with users about their vms will be nightmare for the backup administrator. How reproducible: always
Trying to extract functional requirements from comment 0. 1. Online backup - when VM is during online backup, user should be able to power off the VM without interrupting the backup. 2. Offline backup - when VM is during offline backup, user should be able to power on the VM without interrupting the backup. 3. Power off within the guest during online backup should not interrupt the backup Additional requirement not mentioned in comment 0: 4. Migration - when is during online backup, the system or the user should be able to migrate the VM to another host. An example use case is HA VM that the system try to keep available. 5. HA VM termination - when a HA VM lost the storage lease, sanlock will terminate the VM. If the VM was running a backup, the backup should not be interrupted. Yuri, do you anything to add to these requirements?
Hello Nir > Yuri, do you anything to add to these requirements? Thank you, There isn't anything to add from me.
Most of the work is in engine, but to enable this we need small API change in vdsm, allowing creating a snapshot with a new bitmap. https://github.com/oVirt/vdsm/pull/86
*** Bug 1994663 has been marked as a duplicate of this bug. ***
The only disadvantage that I see here is that we have snapshot involved again, which causes IO to commit the snapshot at the end. While using the scratch disk method, there was no commit at the end (just wipe the scratch disk), which could be an advantage over snapshots on disks with a lot of changes during the backup frame.
(In reply to Jean-Louis Dupond from comment #5) > The only disadvantage that I see here is that we have snapshot involved > again, which causes IO to commit the snapshot at the end. > While using the scratch disk method, there was no commit at the end (just > wipe the scratch disk), which could be an advantage over snapshots on disks > with a lot of changes during the backup frame. True, the new way introduces possibly slow delete snapshot at the end of the backup. But with this disadvantage we get lot of advantages: - Can start, stop, migrate, snapshot a VM during backup - Can start backup in most VM state - Have only one kind of backup - Backup I/O does not affect guest I/O - Guest I/O does not affect backup I/O - No scratch disks, no pauses - Simpler flow on engine side - Does not interfere with user snapshots like the old snapshot based backup We have a stress test for the new backup mode here: https://gitlab.com/nirs/ovirt-stress/-/tree/master/backup We did many runs in the last week, doing around 15,000 backups without any issue in the actual backup. Engine API should allow user to disable the snapshot based backup, using the previous snapshot-less way, with the risk of pausing vms during backup if scratch disk become full. Benny, can you explain how the snapshot is disabled in current API?
We have a config value that can be toggled: $ engine-config -s UseHybridBackup=false can be used to switch to the existing backup mechanism that does not use snapshots
(In reply to Benny Zlotnik from comment #7) > We have a config value that can be toggled: > > $ engine-config -s UseHybridBackup=false This is good for globally disabling the feature by the system admin but it does not give enough power to backup application. I think we need a way to disable the mechanism per backup call. We discussed an option like: POST /ovirt-engine/api/vms/vm-id/backups <backup> <from_checkpoint_id>checkpoint-id</from_checkpoint_id> <use_snapshot>true</use_snapshot> <disks> <disk id="disk-id" /> ... </disks> </backup> If the backup was started with the use_snapshot option, it will report the snapshot during the backup: GET /ovirt-engine/api/vms/vm-id/backups/backup-id <backup> <from_checkpoint_id>checkpoint-id</from_checkpoint_id> <use_snapshot>true</use_snapshot> <snapshot id="snapshot-id"/> <disks> <disk id="disk-id" /> ... </disks> </backup> Yuri, what do think?
Hello Nir, I think it's a good idea to have possibility change backup type in the backup request. But let's keep the new backup as a default, so the app doens't have to pass any option to use it. (option <use_snapshot> always true if the app doesn't change it) If some backup app would like to use the old method, it must use something like <use_snapshot>false</use_snapshot> thanks
Verified on engine-4.5.0-0.237.el8ev
Can you please update doctext?
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.