Bug 2053669 - [RFE] Allow changing vm powerstate during backup operation without interrupting the backup
Summary: [RFE] Allow changing vm powerstate during backup operation without interrupti...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.9.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.0
: 4.5.0
Assignee: Nir Soffer
QA Contact: Evelina Shames
URL:
Whiteboard:
: 1994663 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-11 17:46 UTC by Yury.Panchenko
Modified: 2022-08-23 19:41 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.5.0
Clone Of:
Environment:
Last Closed: 2022-04-20 06:33:59 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.5+
eshames: testing_plan_complete+
pm-rhel: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 107 0 None Merged core: hybrid backup 2022-03-15 13:02:48 UTC
Github oVirt vdsm pull 86 0 None Merged Support adding a bitmap to new volume 2022-03-09 12:41:57 UTC
Red Hat Bugzilla 1994663 1 high CLOSED [RFE][CBT][UX] Let's allow to change VM power state during a backup operation and cancel the current running backup with... 2022-03-13 15:13:13 UTC
Red Hat Issue Tracker RHV-44685 0 None None None 2022-02-11 17:51:49 UTC

Description Yury.Panchenko 2022-02-11 17:46:04 UTC
Description of problem:
If a vm in backup state it won't change its powerstate.

Usually, hypervisor administrators can’t control what users do with their vms. Opposite, users don’t have access to the hypervisor administration. 
So, it means that a backup administrator can’t ask users not to stop/start their vms during backup cycle and users don’t know when backup works. In ideal scenario users mustn’t care about backup process it should be transparent for them.


Current implementation creates few cases:
1)	VM is on -> backup -> user powered-off the vm -> the backup failed. The case isn’t bad, but there are few problems: 
a.	User can’t power-off vm from hypervisor console
b.	Backup can be interrupted by user, so administrator can’t control this
c.	The backup flow can lose large amount of time, just by inaccurate user action
2)	VM is off -> backup -> user can’t power-on vm until backup will be finished. It’s worse case:
a.	Before power-on vm, user must ask backup administrator to stop backup or he must wait until backup will be finished. Any of this action will require some time, and if the vm has critical business data, the downtime will cost much money.

The problems might not be serious for small business and basic systems, but it's really strong for the large customers.
In such systems backup of one vm can take hours and sometimes days. 
For example, it’s really bad to terminate 5 hours vm backup by user action. This can completely break backup window schedule.
From another side a vm which locked for hours looks bad for a user.
Large customer may backup thousand vms during one backup flow, communicate with users about their vms will be nightmare for the backup administrator.



How reproducible:
always

Comment 1 Nir Soffer 2022-02-11 19:46:30 UTC
Trying to extract functional requirements from comment 0.

1. Online backup - when VM is during online backup, user should be able to power
   off the VM without interrupting the backup.

2. Offline backup - when VM is during offline backup, user should be able to power
   on the VM without interrupting the backup.

3. Power off within the guest during online backup should not interrupt the backup

Additional requirement not mentioned in comment 0:

4. Migration - when is during online backup, the system or the user should be able
   to migrate the VM to another host. An example use case is HA VM that the system
   try to keep available.

5. HA VM termination - when a HA VM lost the storage lease, sanlock will terminate
   the VM. If the VM was running a backup, the backup should not be interrupted.

Yuri, do you anything to add to these requirements?

Comment 2 Yury.Panchenko 2022-02-14 12:23:19 UTC
Hello Nir

> Yuri, do you anything to add to these requirements?
Thank you, There isn't anything to add from me.

Comment 3 Nir Soffer 2022-02-28 14:55:54 UTC
Most of the work is in engine, but to enable this we need small API
change in vdsm, allowing creating a snapshot with a new bitmap.
https://github.com/oVirt/vdsm/pull/86

Comment 4 Arik 2022-03-13 12:52:13 UTC
*** Bug 1994663 has been marked as a duplicate of this bug. ***

Comment 5 Jean-Louis Dupond 2022-03-16 13:06:07 UTC
The only disadvantage that I see here is that we have snapshot involved again, which causes IO to commit the snapshot at the end.
While using the scratch disk method, there was no commit at the end (just wipe the scratch disk), which could be an advantage over snapshots on disks with a lot of changes during the backup frame.

Comment 6 Nir Soffer 2022-03-16 13:48:55 UTC
(In reply to Jean-Louis Dupond from comment #5)
> The only disadvantage that I see here is that we have snapshot involved
> again, which causes IO to commit the snapshot at the end.
> While using the scratch disk method, there was no commit at the end (just
> wipe the scratch disk), which could be an advantage over snapshots on disks
> with a lot of changes during the backup frame.

True, the new way introduces possibly slow delete snapshot at the end of the 
backup.

But with this disadvantage we get lot of advantages:

- Can start, stop, migrate, snapshot a VM during backup
- Can start backup in most VM state
- Have only one kind of backup
- Backup I/O does not affect guest I/O
- Guest I/O does not affect backup I/O
- No scratch disks, no pauses
- Simpler flow on engine side
- Does not interfere with user snapshots like the old snapshot based backup

We have a stress test for the new backup mode here:
https://gitlab.com/nirs/ovirt-stress/-/tree/master/backup

We did many runs in the last week, doing around 15,000 backups without any
issue in the actual backup.

Engine API should allow user to disable the snapshot based backup, using
the previous snapshot-less way, with the risk of pausing vms during backup
if scratch disk become full.

Benny, can you explain how the snapshot is disabled in current API?

Comment 7 Benny Zlotnik 2022-03-17 12:18:42 UTC
We have a config value that can be toggled:

   $ engine-config -s UseHybridBackup=false

can be used to switch to the existing backup mechanism that does not use snapshots

Comment 8 Nir Soffer 2022-03-17 12:45:53 UTC
(In reply to Benny Zlotnik from comment #7)
> We have a config value that can be toggled:
> 
>    $ engine-config -s UseHybridBackup=false

This is good for globally disabling the feature by the system admin
but it does not give enough power to backup application.

I think we need a way to disable the mechanism per backup call.

We discussed an option like:

POST /ovirt-engine/api/vms/vm-id/backups

<backup>
  <from_checkpoint_id>checkpoint-id</from_checkpoint_id>
  <use_snapshot>true</use_snapshot>
  <disks>
      <disk id="disk-id" />
      ...
  </disks>
</backup>

If the backup was started with the use_snapshot option, it will report
the snapshot during the backup:

GET /ovirt-engine/api/vms/vm-id/backups/backup-id

<backup>
  <from_checkpoint_id>checkpoint-id</from_checkpoint_id>
  <use_snapshot>true</use_snapshot>
  <snapshot id="snapshot-id"/>
  <disks>
      <disk id="disk-id" />
      ...
  </disks>
</backup>


Yuri, what do think?

Comment 11 Yury.Panchenko 2022-03-30 09:36:40 UTC
Hello Nir,
I think it's a good idea to have possibility change backup type in the backup request.
But let's keep the new backup as a default, so the app doens't have to pass any option to use it. (option <use_snapshot> always true if the app doesn't change it)
If some backup app would like to use the old method, it must use something like <use_snapshot>false</use_snapshot>
thanks

Comment 12 Evelina Shames 2022-04-12 11:17:28 UTC
Verified on engine-4.5.0-0.237.el8ev

Comment 13 Sandro Bonazzola 2022-04-15 09:33:54 UTC
Can you please update doctext?

Comment 14 Sandro Bonazzola 2022-04-20 06:33:59 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.