Bug 1952577 - [CBT] Preview to older snapshot breaks vm backup [NEEDINFO]
Summary: [CBT] Preview to older snapshot breaks vm backup
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.VMs
Version: 4.4.5.11
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-4.4.7
: ---
Assignee: Eyal Shenitzky
QA Contact: sshmulev
bugs@ovirt.org
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-22 15:13 UTC by Yury.Panchenko
Modified: 2021-08-02 08:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the VM snapshot was restored the VM backup chain got broken, which might lead to backup corruption. Fix: When a VM snapshot restored, all the VM backups/checkpoints become invalid and removed from the VM and the engine database. Result: When a VM snapshot restored a full backup should be taken for the VM to start new VM backup chain.
Clone Of:
Environment:
Last Closed: 2021-07-06 07:28:31 UTC
oVirt Team: Storage
nsoffer: needinfo? (oliel)
pm-rhel: ovirt-4.4+
sshmulev: testing_plan_complete+


Attachments (Terms of Use)
Backup logs (3.78 MB, application/zip)
2021-04-22 15:13 UTC, Yury.Panchenko
no flags Details
Logs for the subcase (1.45 MB, application/zip)
2021-05-11 10:46 UTC, Yury.Panchenko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 114885 0 master MERGED core: block VM backup for VM in preview mode 2021-05-25 10:24:16 UTC
oVirt gerrit 114892 0 master MERGED Checkpoint: add 'state' field for a checkpoint 2021-05-25 08:05:34 UTC
oVirt gerrit 114893 0 master MERGED core: add checkpoint states 2021-06-22 14:50:51 UTC
oVirt gerrit 114895 0 master MERGED webadmin: add a warning for invalidating VM checkpoints 2021-06-22 14:50:55 UTC
oVirt gerrit 114913 0 master MERGED remove_bitmap_test.py: add tests for bitmap removal 2021-06-16 11:22:06 UTC
oVirt gerrit 114914 0 master ABANDONED remove_bitmap.py: remove the bitmap from all the volumes in the chain 2021-06-07 04:32:00 UTC
oVirt gerrit 114915 0 master MERGED caps.py: add support for bitmap removal 2021-06-16 11:22:14 UTC
oVirt gerrit 114943 0 master MERGED core: add clear_bitmaps_enabled field to VDS 2021-06-23 11:55:18 UTC
oVirt gerrit 114946 0 master MERGED core: allow removing a VM checkpoint for non-running VMs 2021-06-22 14:51:09 UTC
oVirt gerrit 115010 0 master MERGED core: introduce DeleteAllVmCheckpointsCommand 2021-06-23 11:55:28 UTC
oVirt gerrit 115011 0 master MERGED core: remove all VM checkpoints when restoring a snapshot. 2021-06-23 11:55:31 UTC
oVirt gerrit 115015 0 master MERGED upgrade to model 4.4.30 2021-06-01 13:21:35 UTC
oVirt gerrit 115053 0 master ABANDONED volume_info.py: allow lock all volume chain 2021-06-07 04:31:53 UTC
oVirt gerrit 115093 0 master MERGED bitmaps.py: add clear all volume bitmaps operation 2021-06-16 11:22:09 UTC
oVirt gerrit 115094 0 master MERGED clear_bitmaps.py: introduce clear_bitmaps job 2021-06-16 11:22:11 UTC
oVirt gerrit 115097 0 master MERGED core: introduce ClearVolumeBitmapsVDSCommand 2021-06-23 11:55:20 UTC
oVirt gerrit 115098 0 master MERGED core: introduce ClearVolumeBitmapsCommand 2021-06-23 11:55:23 UTC
oVirt gerrit 115241 0 master MERGED upgrade to model 4.4.31 2021-06-14 10:00:05 UTC

Description Yury.Panchenko 2021-04-22 15:13:40 UTC
Created attachment 1774523 [details]
Backup logs

Description of problem:
VM backup will always fail if user starts preview to older snapshot

Version-Release number of selected component (if applicable):
Ovirt-4.4.5
vdsm-4.40.50.10-1.el8ev.x86_64
ovirt-imageio-daemon-2.1.1-1.el8ev.x86_64
libvirt-daemon-6.6.0-13.2.module+el8.3.1+10483+85317cf0.x86_64
qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.x86_64


How reproducible:


Steps to Reproduce:
1.Make full vm backup
2.Create snapshot snap1
3.Make incremental backup
4.Create snapshot snap2
5.Make incremental backup
6.Preview snapshot snap1
7.Start vm backup

Actual results:
Vm backup failed on every next run

Expected results:
Full vm backup started

Additional info:
[]VDSError:{code='null', message='checkpoint inconsistent: missing or broken bitmap 'd62d5072-4ef0-4397-9802-e9d6c91f819e' for disk 'sda''}, log id: 64ac0bd5

2021-04-22 12:55:00,858+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-231) [ed1d3e40-cd6c-4c2c-a258-ff322f31e798] EVENT_ID: VM_INCREMENTAL_BACKUP_FAILED_FULL_VM_BACKUP_NEEDED(10,791), Incremental backup 29c5900a-4a23-40b5-b7f3-c96b714f687c for VM Cent8-t1_restored2 failed. Full backup is now needed (User: admin@internal-authz).
2021-04-22 12:55:00,874+02 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-231) [ed1d3e40-cd6c-4c2c-a258-ff322f31e798] Trying to release a shared lock for key: '16a6c58b-4fbf-4862-8fd7-e71b4e70d5e9VM' , but lock does not exist
2021-04-22 12:55:00,874+02 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-231) [ed1d3e40-cd6c-4c2c-a258-ff322f31e798] Trying to release exclusive lock which does not exist, lock key: 'e432953d-3ba9-4b71-bf6a-51390d281b83DISK'
2021-04-22 12:55:00,874+02 INFO

Comment 1 Eyal Shenitzky 2021-04-25 11:32:05 UTC
Hi Yury,

Can you please elaborate on what is the specific flow.

It seems to me that you are trying to start an incremental backup while the VM is in 'preview' that is based on a checkpoint that was take only in a later snapshot and basically doesn't exists.

For example:

Steps to Reproduce:
1.Make full VM backup - backup_1
2.Create snapshot snap1
3.Make an incremental backup - backup_2 (based on backup_1)
4.Create snapshot snap2
5.Make an incremental backup - backup_3 (based on backup_2)
6.Preview snapshot snap1 -> contains only backup_1 checkpoint
7.Start VM incremental backup - backup_4 (based on backup_3) ==> Failed, which is expected since the checkpoint of backup_3 doesn't exists when previewing snap1, only backup_1 checkpoint.

Comment 2 Yury.Panchenko 2021-04-27 12:23:53 UTC
Hello, Eyal.

> It seems to me that you are trying to start an incremental backup while the VM is in 'preview' that is based on a checkpoint that was take  > only in a later snapshot and basically doesn't exists.
You're right. But, backup application can't control user actions, and it doesn't know anything about what snapshot does user select.

I don't expect error in this way. if vm can't be backed incremnetal, let's do a full backup

Comment 3 Eyal Shenitzky 2021-05-06 07:04:28 UTC
(In reply to Yury.Panchenko from comment #2)
> I don't expect error in this way. if vm can't be backed incremnetal, let's
> do a full backup

We cannot determine for the user / backup application the type of the backup that will be taken in case of a failure.

In this case, you should identify the specific error (VM_INCREMENTAL_BACKUP_FAILED_FULL_VM_BACKUP_NEEDED(10,791)) and retry to take a full backup for the VM.

There is one problem that I am thinking of, In case of a VM in preview and a failure to redefine a checkpoint as you mentioned, the entire checkpoints chain will be removed.
I am not sure if we need to avoid it or not in case of a VM in preview mode, failure can be due to a real issue also.

Nir, what do you think?

Comment 4 Yury.Panchenko 2021-05-07 15:12:23 UTC
Hello, Eyal.
In this case, i've had another problem with error output.
That event: VM_INCREMENTAL_BACKUP_FAILED_FULL_VM_BACKUP_NEEDED(10,791), doesn't return in the web response.
In our app it looks like 
[2021-05-05] [16:11:13.861] [162498] [Info ] [RestCli] Response data (400 Bad Request): 
[2021-05-05] [16:11:13.862] [162498] [Debug] [JsonReporter] Event: 2, Text: Creating incremental backup failed: Reason: 'Operation Failed', Detail: '[]'
Can you check that please?


> We cannot determine for the user / backup application the type of the backup that will be taken in case of a failure.
What option does user have? He want's to create backup. 
Of course, he will restart backup operation and make full. I don't see any other option.

Comment 5 Eyal Shenitzky 2021-05-09 06:04:53 UTC
(In reply to Yury.Panchenko from comment #4)
> Hello, Eyal.
> In this case, i've had another problem with error output.
> That event: VM_INCREMENTAL_BACKUP_FAILED_FULL_VM_BACKUP_NEEDED(10,791),
> doesn't return in the web response.
> In our app it looks like 
> [2021-05-05] [16:11:13.861] [162498] [Info ] [RestCli] Response data (400
> Bad Request): 
> [2021-05-05] [16:11:13.862] [162498] [Debug] [JsonReporter] Event: 2, Text:
> Creating incremental backup failed: Reason: 'Operation Failed', Detail: '[]'
> Can you check that please?

The error will not return in the response, it should be fetched from the engine events.
The events will include the created backup ID and can be filtered according to it.

You can see and en example for it using the Python SDK - 
https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py#L437

> 
> 
> > We cannot determine for the user / backup application the type of the backup that will be taken in case of a failure.
> What option does user have? He want's to create backup. 
> Of course, he will restart backup operation and make full. I don't see any
> other option.

Right, but the user needs to initiate it, we cannot add those assumptions.

Comment 6 Nir Soffer 2021-05-10 17:53:41 UTC
(In reply to Eyal Shenitzky from comment #3)
Preview is a very special case. Basically we have 2 backup systems managing
the same vm at the same time.

Assuming we have vm with these snapshots:

  A < B < C

During preview the chain may be:

  A < D
    < B < C

Creating incremental backup at this point is likey to create corrupted
backup.

Creating full backup in this state is correct, but this backup cannot be
used if the user undo the preview operation and the VM go back to the
original state:

  A < B < C

Preview is a temporary state. It can end in 2 ways:

1. User commit the operation - we drop the rest of the chain:

  A < D

In this case the backup application must create new full backup.

2. User undo the operation - we drop the preview volume

  A < B < C

In this case the backup application can continue to do incremental
backup.

So I think the right way is to block backup in preview state. Full or
incremental backup will fail with special error (VM_IN_PREVIEW).

This can be implemented in 4.4.7. Until we implement this the backup
application can check using the API if the vm is in preview mode.

Comment 7 Yury.Panchenko 2021-05-11 10:45:00 UTC
Hello Nir and Eyal,

> So I think the right way is to block backup in preview state. Full or
> incremental backup will fail with special error (VM_IN_PREVIEW).
> This can be implemented in 4.4.7. Until we implement this the backup
> application can check using the API if the vm is in preview mode.

I can't agree with that. 
Backup application can't resolve that situation. 
So, it adds abbility to lost critical data when vm is in preview mode.

App can do a full backup in preview mode. if user reverts snapshot - ok, then app will do new full backup.
If user commits snapshot the app just continue with incremental backup.


I've got an additional subcase -
So, I repeated the orignal case but in the end, vm left in poweroff state.
After that, i start next incremental backup
The backup created and start NBD server failed
2021-05-11 11:45:21,360+02 INFO  [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-1442) [22ea70bb-b1e6-4896-8044-7c74199e2eac] Running command: StartVmBackupCommand internal: false. Entities affected :  ID: 323c918c-ceb5-47f2-aaea-26a1249fd2bd Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: 8113d4f9-7792-4e02-bfc9-0fccfb1cd1f3 Type: DiskAction group BACKUP_DISK with role type ADMIN

2021-05-11 11:45:40,022+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1443) [92c2ba48-72c0-4e6a-afdb-7e98d10f490e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM pan-rhv44rel1 command StartNbdServerVDS failed: Bitmap does not exist: "{'bitmap': '1ccb732b-a7fe-4c97-afe6-67f4f33e5cc8'}"

Comment 8 Yury.Panchenko 2021-05-11 10:46:08 UTC
Created attachment 1781972 [details]
Logs for the subcase

Comment 9 Nir Soffer 2021-05-11 12:31:52 UTC
(In reply to Yury.Panchenko from comment #7)
> Hello Nir and Eyal,
> 
> > So I think the right way is to block backup in preview state. Full or
> > incremental backup will fail with special error (VM_IN_PREVIEW).
> > This can be implemented in 4.4.7. Until we implement this the backup
> > application can check using the API if the vm is in preview mode.
> 
> I can't agree with that. 
> Backup application can't resolve that situation. 
> So, it adds abbility to lost critical data when vm is in preview mode.

VM in preview mode is like a *new* vm being created. No critical data
is added in preview mode. When a user commit, the vm becomes a normal
vm and at this point you can backup the vm.

> App can do a full backup in preview mode. if user reverts snapshot - ok,
> then app will do new full backup.
> If user commits snapshot the app just continue with incremental backup.

I don't think it is useful to try to backup vm in temporary state, and
the way preview mode is implemented will make this hard.

> I've got an additional subcase -
> So, I repeated the orignal case but in the end, vm left in poweroff state.
> After that, i start next incremental backup
> The backup created and start NBD server failed
> 2021-05-11 11:45:21,360+02 INFO 
> [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default
> task-1442) [22ea70bb-b1e6-4896-8044-7c74199e2eac] Running command:
> StartVmBackupCommand internal: false. Entities affected :  ID:
> 323c918c-ceb5-47f2-aaea-26a1249fd2bd Type: VMAction group BACKUP_DISK with
> role type ADMIN,  ID: 8113d4f9-7792-4e02-bfc9-0fccfb1cd1f3 Type: DiskAction
> group BACKUP_DISK with role type ADMIN
> 
> 2021-05-11 11:45:40,022+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (default task-1443) [92c2ba48-72c0-4e6a-afdb-7e98d10f490e] EVENT_ID:
> VDS_BROKER_COMMAND_FAILURE(10,802), VDSM pan-rhv44rel1 command
> StartNbdServerVDS failed: Bitmap does not exist: "{'bitmap':
> '1ccb732b-a7fe-4c97-afe6-67f4f33e5cc8'}"

I'm not sure what is the case, you need to provide detailed instructions
so we can reproduce your flow. But the error means that the requested bitmap
does not exist. This is likely result of starting vm in preview mode.

For example:

1. create snapshot 1

   A < B

2. full backup (creating checkpoint 1)

   A < B (checkpoint 1) 

3. create snapshot 2 (checkpoint 1 continue to track new snapshot)

   A <- B (checkpoint 1) <- C (checkpoint 1, checkpoint 2)

4. incremental backup (creating checkpoint 2, exist only in image C)

   A <- B (checkpoint 1) <- C (checkpoint 1, checkpoint 2)

5. user preview reverting to snapshot 1

   A <- D

We still have this chain on storage:

   B (checkpoint 1) <- C (checkpoint 1, checkpoint 2)
   
But it is not used by the vm.

Note that D does cannot have checkpoint 1 or 2.
  
At this point incremental backup is not possible (cold or live).

If you want to treat preview state as new vm, you will have to do full backup
at this point. If a user undo the preview mode, you will have to drop the full
backup and continue to do incremental backup using the old vm state.

Comment 10 Yury.Panchenko 2021-05-14 11:24:36 UTC
Hello, Nir.
> VM in preview mode is like a *new* vm being created. No critical data
> is added in preview mode. When a user commit, the vm becomes a normal
> vm and at this point you can backup the vm.
So, ok. I agree with you. If vm in temporal stage we can skip backup.

Comment 15 sshmulev 2021-06-30 16:58:13 UTC
Verified.


Steps to Reproduce:
1.Make full VM backup
2.Create snapshot snap1
3.Make incremental backup
4.Create snapshot snap2
5.Make incremental backup
6.Preview snapshot snap1
7.Start VM backup(Full/incremental) -> Action should be blocked
"[Cannot backup VM. VM is previewing a Snapshot.]

8)commit

9)Incremental backup to see that the checkpoint was removed:
2021-06-30 19:06:11,715+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-72) [] Operation Failed: [Cannot backup VM. Checkpoint ID 9591d7d7-0c3b-43cc-819b-350501352180 doesn't exist.]

10)Start Full backup of the VM - > Should succeed.
2021-06-30 19:44:28,817+03 INFO  [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-77) [6cfed041-b7ba-4524-972a-855ed757d6a7] Running command: StartVmBackupCommand internal: false. Entities affected :  ID: 7943cec4-0af0-4912-9d86-a1714742018a Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: fe0d84bb-f605-4bdd-8694-18b2ecb27728 Type: DiskAction group BACKUP_DISK with role type ADMIN

2021-06-30 19:44:28,837+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-77) [6cfed041-b7ba-4524-972a-855ed757d6a7] EVENT_ID: VM_BACKUP_STARTED(10,790), Backup b2ddfbd2-90f1-4dbc-ada4-700cf0946a82 for VM New_VM started


<creation_date>2021-06-30T19:44:28.817+03:00</creation_date>
<phase>ready</phase>
<to_checkpoint_id>9eac6c09-7a81-411a-91b2-c94c6b5a4d22</to_checkpoint_id>

Comment 16 Sandro Bonazzola 2021-07-06 07:28:31 UTC
This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.