Bug 1919248 - [CBT] Race condition in deleting checkpoints causes inconsistency and failed backups
Summary: [CBT] Race condition in deleting checkpoints causes inconsistency and failed ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.6
: 4.4.6.4
Assignee: Eyal Shenitzky
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-22 12:50 UTC by Jean-Louis Dupond
Modified: 2021-05-05 05:36 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.4.6.4
Clone Of:
Environment:
Last Closed: 2021-05-05 05:36:08 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 113777 0 master MERGED core: refactor delete VM checkpoint to not use CoCo framework 2021-03-10 09:26:36 UTC
oVirt gerrit 113778 0 master MERGED core: take an exclusive lock on the VM during backup operations 2021-03-10 09:26:43 UTC

Description Jean-Louis Dupond 2021-01-22 12:50:56 UTC
We noticed backups started failing after some tests on 1 VM.
Fails with:
2021-01-22 12:29:03,496+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.StartVmBackupVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [ce60eb94-2579-4d50-a6f3-21f0ef0a431f] Command 'StartVmBackupVDSCommand(HostName = ovirt-ovn003, VmBackupVDSParameters:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', backupId='ca6dbdc9-d067-40c0-804d-0da802cc6b6d', requireConsistency='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to StartVmBackupVDS, error = Checkpoint Error: {'parent_checkpoint_id': None, 'leaf_checkpoint_id': '0e9cedb2-dcd9-4736-bdb1-72654e74b3ba', 'vm_id': '011fc4cd-4831-49b3-8456-ebdcdf9a310b', 'reason': 'Parent checkpoint ID does not match the actual leaf checkpoint'}, code = 1610

After digging in the logs what caused this, it seems to be some race condition:

1. Deleting the checkpoint:
2021-01-21 18:48:43,016+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:43,021+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Running command: DeleteVmCheckpointCommand internal: false. Entities affected :  ID: 011fc4cd-4831-49b3-8456-ebdcdf9a310b Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: a12c4bf3-e6a4-4d0e-9994-beb5ebc0cfb8 Type: DiskAction group BACKUP_DISK with role type ADMIN
2021-01-21 18:48:43,032+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-41) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).
2021-01-21 18:48:43,978+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-73) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Redefine previous VM checkpoints for VM '011fc4cd-4831-49b3-8456-ebdcdf9a310b'
2021-01-21 18:48:45,991+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Deleting VmCheckpoint '0e9cedb2-dcd9-4736-bdb1-72654e74b3ba'
2021-01-21 18:48:45,991+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DeleteVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] START, DeleteVmCheckpointsVDSCommand(HostName = ovirt-ovn003, VmCheckpointsVDSParameters:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', vmId='011fc4cd-4831-49b3-8456-ebdcdf9a310b'}), log id: 66cfbf75
2021-01-21 18:48:45,997+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DeleteVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] FINISH, DeleteVmCheckpointsVDSCommand, return: 
2021-01-21 18:48:59,077+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Ending command 'org.ovirt.engine.core.bll.DeleteVmCheckpointCommand' successfully.
2021-01-21 18:48:59,081+01 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Trying to release a shared lock for key: '011fc4cd-4831-49b3-8456-ebdcdf9a310bVM' , but lock does not exist
2021-01-21 18:48:59,081+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:59,088+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).

2. Deleting the same checkpoint, at the same moment:
2021-01-21 18:48:48,916+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:48,922+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Running command: DeleteVmCheckpointCommand internal: false. Entities affected :  ID: 011fc4cd-4831-49b3-8456-ebdcdf9a310b Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: a12c4bf3-e6a4-4d0e-9994-beb5ebc0cfb8 Type: DiskAction group BACKUP_DISK with role type ADMIN
2021-01-21 18:48:48,930+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-41) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).
2021-01-21 18:48:49,002+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Redefine previous VM checkpoints for VM '011fc4cd-4831-49b3-8456-ebdcdf9a310b'
2021-01-21 18:48:51,053+01 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Command 'DeleteVmCheckpoint' id: 'f13aa032-9ed6-4d2d-987a-9ea4ff381752' with children [] failed when attempting to perform the next operation, marking as 'ACTIVE'
2021-01-21 18:48:51,053+01 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] null: java.lang.NullPointerException
2021-01-21 18:48:51,053+01 ERROR [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Command 'DeleteVmCheckpoint' id 'f13aa032-9ed6-4d2d-987a-9ea4ff381752' failed executing step 'DELETE_CHECKPOINT'
2021-01-21 18:48:51,053+01 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Command 'DeleteVmCheckpoint' id: 'f13aa032-9ed6-4d2d-987a-9ea4ff381752' child commands '[]' executions were completed, status 'FAILED'
2021-01-21 18:48:52,057+01 ERROR [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Ending command 'org.ovirt.engine.core.bll.DeleteVmCheckpointCommand' with failure.
2021-01-21 18:48:52,060+01 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Trying to release a shared lock for key: '011fc4cd-4831-49b3-8456-ebdcdf9a310bVM' , but lock does not exist
2021-01-21 18:48:52,060+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:52,066+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).


The redefine triggered by 18fca750-21f0-4c3e-98f9-98e1fdcb0f45
2021-01-21 18:48:49,005+01 INFO  [org.ovirt.engine.core.bll.RedefineVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] Running command: RedefineVmCheckpointCommand internal: true. Entities affected :  ID: 011fc4cd-4831-49b3-8456-ebdcdf9a310b Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: a12c4bf3-e6a4-4d0e-9994-beb5ebc0cfb8 Type: DiskAction group BACKUP_DISK with role type ADMIN
2021-01-21 18:48:49,006+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] START, ListVmCheckpointsVDSCommand(HostName = ovirt-ovn003, VdsAndVmIDVDSParametersBase:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', vmId='011fc4cd-4831-49b3-8456-ebdcdf9a310b'}), log id: 1906e285
2021-01-21 18:48:49,009+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] FINISH, ListVmCheckpointsVDSCommand, return: [], log id: 1906e285
2021-01-21 18:48:49,009+01 INFO  [org.ovirt.engine.core.bll.RedefineVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] Redefine VM '011fc4cd-4831-49b3-8456-ebdcdf9a310b' checkpoint '0e9cedb2-dcd9-4736-bdb1-72654e74b3ba'
2021-01-21 18:48:49,011+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RedefineVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] START, RedefineVmCheckpointsVDSCommand(HostName = ovirt-ovn003, VmCheckpointsVDSParameters:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', vmId='011fc4cd-4831-49b3-8456-ebdcdf9a310b'}), log id: 1f052f79
2021-01-21 18:48:49,040+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RedefineVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] FINISH, RedefineVmCheckpointsVDSCommand, return: 
2021-01-21 18:48:49,041+01 INFO  [org.ovirt.engine.core.bll.RedefineVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'

On VDSM:
2021-01-21 18:48:52,432+0100 INFO  (jsonrpc/4) [api.virt] START delete_checkpoints(checkpoint_ids=['0e9cedb2-dcd9-4736-bdb1-72654e74b3ba']) from=::ffff:x.91,42386, flow_id=2478aaec-eccb-4360-ad5a-b0574bb8dc15, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:48)
2021-01-21 18:48:52,436+0100 INFO  (jsonrpc/4) [api.virt] FINISH delete_checkpoints return={'result': {'checkpoint_ids': ['0e9cedb2-dcd9-4736-bdb1-72654e74b3ba']}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.91,42386, flow_id=2478aaec-eccb-4360-ad5a-b0574bb8dc15, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:54)


2021-01-21 18:48:55,447+0100 INFO  (jsonrpc/7) [api.virt] START list_checkpoints() from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:48)
2021-01-21 18:48:55,448+0100 INFO  (jsonrpc/7) [api.virt] FINISH list_checkpoints return={'result': [], 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:54)
2021-01-21 18:48:55,453+0100 INFO  (jsonrpc/1) [api.virt] START redefine_checkpoints(checkpoints=[{'xml': '<domaincheckpoint>\n  <name>0e9cedb2-dcd9-4736-bdb1-72654e74b3ba</name>\n  <description>checkpoint for backup &apos;5c9f4ff5-67db-48cd-83b1-9633eab6f219&apos;</description>\n  <creationTime>1611251255</........
]) from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:48)
2021-01-21 18:48:55,479+0100 INFO  (jsonrpc/1) [api.virt] FINISH redefine_checkpoints return={'result': {'checkpoint_ids': ['0e9cedb2-dcd9-4736-bdb1-72654e74b3ba']}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:54)

So it seems like after the checkpoint was deleted, the redefine caused it to get added again.
But it was completely removed from the oVirt database.

I guess some more locking is needed to make both deleted do not cause issues.
The checkpoint is gone from the database, but its still active if you check via libvirt.

Comment 1 RHEL Program Management 2021-01-26 12:52:50 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Ilan Zuckerman 2021-04-08 09:42:40 UTC
Please describe steps to verify this. thanks.

Comment 3 Eyal Shenitzky 2021-04-08 10:59:18 UTC
Steps to reproduce:
1) Create a VM with disk enabled for backup
2) Run the VM
3) start a full backup
4) finalize the backup
5) start and incremental backup
6) finalize the incremental backup
7) Turn off the VM
8) Start the VM
9) When the VM is UP remove the root checkpoint and immediately, remove it again from a different shell/client.

Expected results - 
The first call to remove the checkpoint should succeed, the second should fail.

Comment 4 Ilan Zuckerman 2021-04-19 08:26:18 UTC
Verified on rhv-4.4.6-4 according steps from comment #3

Second (immediate) attempt of removing the root checkpoint failed as expected.

9) When the VM is UP remove the root checkpoint and immediately, remove it again from a different shell/client.

First shell:

[root@storage-ge13-vdsm2 examples]# python3 remove_checkpoint.py  0d688b5a-799d-4f83-956c-0c999afa3461 -c engine
[   0.0 ] Removing root checkpoint for VM '0d688b5a-799d-4f83-956c-0c999afa3461'
[   0.0 ] Looking up checkpoints 0d688b5a-799d-4f83-956c-0c999afa3461
[   2.6 ] Removing root checkpoint 'b7c5a5a6-8199-4ccb-8877-6fa264d85be6'
[   4.6 ] Root checkpoint removed successfully


Second shell:

[root@storage-ge13-vdsm2 examples]# python3 remove_checkpoint.py  0d688b5a-799d-4f83-956c-0c999afa3461 -c engine
[   0.0 ] Removing root checkpoint for VM '0d688b5a-799d-4f83-956c-0c999afa3461'
[   0.0 ] Looking up checkpoints 0d688b5a-799d-4f83-956c-0c999afa3461
[   1.6 ] Removing root checkpoint 'b7c5a5a6-8199-4ccb-8877-6fa264d85be6'
Traceback (most recent call last):
  File "remove_checkpoint.py", line 63, in <module>
    checkpoint_service.remove()
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 34479, in remove
    self._internal_remove(headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 271, in _internal_remove
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 268, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot remove ${type}. The VM is during a checkpoint removal operation.]". HTTP response code is 409.



Engine log:

2021-04-19 11:14:40,994+03 INFO  [org.ovirt.engine.core.bll.storage.backup.DeleteVmCheckpointCommand] (default task-86) [fe7dca46-912e-49c0-b664-3d38464d5b10] Redefine previous VM checkpoints for VM '0d688b5a-799d-4f83-956c-0c999afa3461'
2021-04-19 11:14:41,008+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-85) [] Operation Failed: [Cannot remove ${type}. The VM is during a checkpoint removal operation.]

Comment 5 Sandro Bonazzola 2021-05-05 05:36:08 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.