Bug 1919248

Summary: [CBT] Race condition in deleting checkpoints causes inconsistency and failed backups
Product: [oVirt] ovirt-engine Reporter: Jean-Louis Dupond <jean-louis>
Component: BLL.StorageAssignee: Eyal Shenitzky <eshenitz>
Status: CLOSED CURRENTRELEASE QA Contact: Ilan Zuckerman <izuckerm>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.4.5CC: ahadas, bugs, dfodor, eshenitz, sfishbai
Target Milestone: ovirt-4.4.6   
Target Release: 4.4.6.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.6.4 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-05 05:36:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jean-Louis Dupond 2021-01-22 12:50:56 UTC
We noticed backups started failing after some tests on 1 VM.
Fails with:
2021-01-22 12:29:03,496+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.StartVmBackupVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-51) [ce60eb94-2579-4d50-a6f3-21f0ef0a431f] Command 'StartVmBackupVDSCommand(HostName = ovirt-ovn003, VmBackupVDSParameters:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', backupId='ca6dbdc9-d067-40c0-804d-0da802cc6b6d', requireConsistency='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to StartVmBackupVDS, error = Checkpoint Error: {'parent_checkpoint_id': None, 'leaf_checkpoint_id': '0e9cedb2-dcd9-4736-bdb1-72654e74b3ba', 'vm_id': '011fc4cd-4831-49b3-8456-ebdcdf9a310b', 'reason': 'Parent checkpoint ID does not match the actual leaf checkpoint'}, code = 1610

After digging in the logs what caused this, it seems to be some race condition:

1. Deleting the checkpoint:
2021-01-21 18:48:43,016+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:43,021+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Running command: DeleteVmCheckpointCommand internal: false. Entities affected :  ID: 011fc4cd-4831-49b3-8456-ebdcdf9a310b Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: a12c4bf3-e6a4-4d0e-9994-beb5ebc0cfb8 Type: DiskAction group BACKUP_DISK with role type ADMIN
2021-01-21 18:48:43,032+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-41) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).
2021-01-21 18:48:43,978+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-73) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Redefine previous VM checkpoints for VM '011fc4cd-4831-49b3-8456-ebdcdf9a310b'
2021-01-21 18:48:45,991+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Deleting VmCheckpoint '0e9cedb2-dcd9-4736-bdb1-72654e74b3ba'
2021-01-21 18:48:45,991+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DeleteVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] START, DeleteVmCheckpointsVDSCommand(HostName = ovirt-ovn003, VmCheckpointsVDSParameters:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', vmId='011fc4cd-4831-49b3-8456-ebdcdf9a310b'}), log id: 66cfbf75
2021-01-21 18:48:45,997+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DeleteVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-70) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] FINISH, DeleteVmCheckpointsVDSCommand, return: 
2021-01-21 18:48:59,077+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Ending command 'org.ovirt.engine.core.bll.DeleteVmCheckpointCommand' successfully.
2021-01-21 18:48:59,081+01 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Trying to release a shared lock for key: '011fc4cd-4831-49b3-8456-ebdcdf9a310bVM' , but lock does not exist
2021-01-21 18:48:59,081+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:59,088+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-29) [2478aaec-eccb-4360-ad5a-b0574bb8dc15] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).

2. Deleting the same checkpoint, at the same moment:
2021-01-21 18:48:48,916+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:48,922+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (default task-41) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Running command: DeleteVmCheckpointCommand internal: false. Entities affected :  ID: 011fc4cd-4831-49b3-8456-ebdcdf9a310b Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: a12c4bf3-e6a4-4d0e-9994-beb5ebc0cfb8 Type: DiskAction group BACKUP_DISK with role type ADMIN
2021-01-21 18:48:48,930+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-41) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).
2021-01-21 18:48:49,002+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Redefine previous VM checkpoints for VM '011fc4cd-4831-49b3-8456-ebdcdf9a310b'
2021-01-21 18:48:51,053+01 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Command 'DeleteVmCheckpoint' id: 'f13aa032-9ed6-4d2d-987a-9ea4ff381752' with children [] failed when attempting to perform the next operation, marking as 'ACTIVE'
2021-01-21 18:48:51,053+01 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] null: java.lang.NullPointerException
2021-01-21 18:48:51,053+01 ERROR [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Command 'DeleteVmCheckpoint' id 'f13aa032-9ed6-4d2d-987a-9ea4ff381752' failed executing step 'DELETE_CHECKPOINT'
2021-01-21 18:48:51,053+01 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Command 'DeleteVmCheckpoint' id: 'f13aa032-9ed6-4d2d-987a-9ea4ff381752' child commands '[]' executions were completed, status 'FAILED'
2021-01-21 18:48:52,057+01 ERROR [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Ending command 'org.ovirt.engine.core.bll.DeleteVmCheckpointCommand' with failure.
2021-01-21 18:48:52,060+01 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Trying to release a shared lock for key: '011fc4cd-4831-49b3-8456-ebdcdf9a310bVM' , but lock does not exist
2021-01-21 18:48:52,060+01 INFO  [org.ovirt.engine.core.bll.DeleteVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'
2021-01-21 18:48:52,066+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-25) [18fca750-21f0-4c3e-98f9-98e1fdcb0f45] EVENT_ID: VM_CHECKPOINT_DELETED(10,797), Checkpoint 0e9cedb2-dcd9-4736-bdb1-72654e74b3ba for VM vm-test1 deleted (User: admin@internal-authz).


The redefine triggered by 18fca750-21f0-4c3e-98f9-98e1fdcb0f45
2021-01-21 18:48:49,005+01 INFO  [org.ovirt.engine.core.bll.RedefineVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] Running command: RedefineVmCheckpointCommand internal: true. Entities affected :  ID: 011fc4cd-4831-49b3-8456-ebdcdf9a310b Type: VMAction group BACKUP_DISK with role type ADMIN,  ID: a12c4bf3-e6a4-4d0e-9994-beb5ebc0cfb8 Type: DiskAction group BACKUP_DISK with role type ADMIN
2021-01-21 18:48:49,006+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] START, ListVmCheckpointsVDSCommand(HostName = ovirt-ovn003, VdsAndVmIDVDSParametersBase:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', vmId='011fc4cd-4831-49b3-8456-ebdcdf9a310b'}), log id: 1906e285
2021-01-21 18:48:49,009+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] FINISH, ListVmCheckpointsVDSCommand, return: [], log id: 1906e285
2021-01-21 18:48:49,009+01 INFO  [org.ovirt.engine.core.bll.RedefineVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] Redefine VM '011fc4cd-4831-49b3-8456-ebdcdf9a310b' checkpoint '0e9cedb2-dcd9-4736-bdb1-72654e74b3ba'
2021-01-21 18:48:49,011+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RedefineVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] START, RedefineVmCheckpointsVDSCommand(HostName = ovirt-ovn003, VmCheckpointsVDSParameters:{hostId='e8facc41-7088-4c5a-9a69-680d1d5e6dee', vmId='011fc4cd-4831-49b3-8456-ebdcdf9a310b'}), log id: 1f052f79
2021-01-21 18:48:49,040+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RedefineVmCheckpointsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] FINISH, RedefineVmCheckpointsVDSCommand, return: 
2021-01-21 18:48:49,041+01 INFO  [org.ovirt.engine.core.bll.RedefineVmCheckpointCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [70691619] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[011fc4cd-4831-49b3-8456-ebdcdf9a310b=VM]'}'

On VDSM:
2021-01-21 18:48:52,432+0100 INFO  (jsonrpc/4) [api.virt] START delete_checkpoints(checkpoint_ids=['0e9cedb2-dcd9-4736-bdb1-72654e74b3ba']) from=::ffff:x.91,42386, flow_id=2478aaec-eccb-4360-ad5a-b0574bb8dc15, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:48)
2021-01-21 18:48:52,436+0100 INFO  (jsonrpc/4) [api.virt] FINISH delete_checkpoints return={'result': {'checkpoint_ids': ['0e9cedb2-dcd9-4736-bdb1-72654e74b3ba']}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.91,42386, flow_id=2478aaec-eccb-4360-ad5a-b0574bb8dc15, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:54)


2021-01-21 18:48:55,447+0100 INFO  (jsonrpc/7) [api.virt] START list_checkpoints() from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:48)
2021-01-21 18:48:55,448+0100 INFO  (jsonrpc/7) [api.virt] FINISH list_checkpoints return={'result': [], 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:54)
2021-01-21 18:48:55,453+0100 INFO  (jsonrpc/1) [api.virt] START redefine_checkpoints(checkpoints=[{'xml': '<domaincheckpoint>\n  <name>0e9cedb2-dcd9-4736-bdb1-72654e74b3ba</name>\n  <description>checkpoint for backup &apos;5c9f4ff5-67db-48cd-83b1-9633eab6f219&apos;</description>\n  <creationTime>1611251255</........
]) from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:48)
2021-01-21 18:48:55,479+0100 INFO  (jsonrpc/1) [api.virt] FINISH redefine_checkpoints return={'result': {'checkpoint_ids': ['0e9cedb2-dcd9-4736-bdb1-72654e74b3ba']}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:x.91,42386, flow_id=70691619, vmId=011fc4cd-4831-49b3-8456-ebdcdf9a310b (api:54)

So it seems like after the checkpoint was deleted, the redefine caused it to get added again.
But it was completely removed from the oVirt database.

I guess some more locking is needed to make both deleted do not cause issues.
The checkpoint is gone from the database, but its still active if you check via libvirt.

Comment 1 RHEL Program Management 2021-01-26 12:52:50 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Ilan Zuckerman 2021-04-08 09:42:40 UTC
Please describe steps to verify this. thanks.

Comment 3 Eyal Shenitzky 2021-04-08 10:59:18 UTC
Steps to reproduce:
1) Create a VM with disk enabled for backup
2) Run the VM
3) start a full backup
4) finalize the backup
5) start and incremental backup
6) finalize the incremental backup
7) Turn off the VM
8) Start the VM
9) When the VM is UP remove the root checkpoint and immediately, remove it again from a different shell/client.

Expected results - 
The first call to remove the checkpoint should succeed, the second should fail.

Comment 4 Ilan Zuckerman 2021-04-19 08:26:18 UTC
Verified on rhv-4.4.6-4 according steps from comment #3

Second (immediate) attempt of removing the root checkpoint failed as expected.

9) When the VM is UP remove the root checkpoint and immediately, remove it again from a different shell/client.

First shell:

[root@storage-ge13-vdsm2 examples]# python3 remove_checkpoint.py  0d688b5a-799d-4f83-956c-0c999afa3461 -c engine
[   0.0 ] Removing root checkpoint for VM '0d688b5a-799d-4f83-956c-0c999afa3461'
[   0.0 ] Looking up checkpoints 0d688b5a-799d-4f83-956c-0c999afa3461
[   2.6 ] Removing root checkpoint 'b7c5a5a6-8199-4ccb-8877-6fa264d85be6'
[   4.6 ] Root checkpoint removed successfully


Second shell:

[root@storage-ge13-vdsm2 examples]# python3 remove_checkpoint.py  0d688b5a-799d-4f83-956c-0c999afa3461 -c engine
[   0.0 ] Removing root checkpoint for VM '0d688b5a-799d-4f83-956c-0c999afa3461'
[   0.0 ] Looking up checkpoints 0d688b5a-799d-4f83-956c-0c999afa3461
[   1.6 ] Removing root checkpoint 'b7c5a5a6-8199-4ccb-8877-6fa264d85be6'
Traceback (most recent call last):
  File "remove_checkpoint.py", line 63, in <module>
    checkpoint_service.remove()
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 34479, in remove
    self._internal_remove(headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 271, in _internal_remove
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 268, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot remove ${type}. The VM is during a checkpoint removal operation.]". HTTP response code is 409.



Engine log:

2021-04-19 11:14:40,994+03 INFO  [org.ovirt.engine.core.bll.storage.backup.DeleteVmCheckpointCommand] (default task-86) [fe7dca46-912e-49c0-b664-3d38464d5b10] Redefine previous VM checkpoints for VM '0d688b5a-799d-4f83-956c-0c999afa3461'
2021-04-19 11:14:41,008+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-85) [] Operation Failed: [Cannot remove ${type}. The VM is during a checkpoint removal operation.]

Comment 5 Sandro Bonazzola 2021-05-05 05:36:08 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.