Description of problem: When trying to delete a snapshot the engine fails with a null pointer exception after the merge is finished and the engine attempts to destroy the merged volume. Version-Release number of selected component (if applicable): 4.4.5 How reproducible: So far ones. Steps to Reproduce: N/A Actual results: The null pointer is raised and further merges fail as the DB still contains the already merged volume. The VM is successfully merged. 2021-05-01 03:35:35,034+02 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [2ea4228e-f87a-4426-a3bc-897938b38dd1] Command 'RemoveSnapshot' (id: '4d0f3411-cd69-4fbb-b1d9-885ab4f9dbad') waiting on child command id: 'e0e728f1-e7bf-4d21-b0f9-41809776c052' type:'RemoveSnapshotSingleDiskLive' to complete 2021-05-01 03:35:35,077+02 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Executing Live Merge command step 'DESTROY_IMAGE' 2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command 'RemoveSnapshotSingleDiskLive' id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe' with children [bff72d6e-1e0f-4699-8100-ec2e7bc7f1f5, 280d65b3-a186-4db1-9eee-884a2167177e, 289d17ea-144e-4a85-a660-d838f224147e] failed when attempting to perform the next operation, marking as 'ACTIVE' 2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] null: java.lang.NullPointerException at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.performNextOperation(RemoveSnapshotSingleDiskLiveCommand.java:132) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) 2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe failed child command status for step 'DESTROY_IMAGE' Expected results: Volume is destroyed.
Created attachment 1779282 [details] Engien logs
Benny, can you please have a look?
Roman, please add all relevant info (VDSM logs, DB dump, versions)
Since the issue occurs at around 03:35:35, which is when the CommandEntityCleanupManager runs, this shouldn't remove anything from the command_entities table but it does cause a reload from the table, which causes this method[1] to behave incorrectly: List<Guid> childCommandIds = commandCoordinatorUtil.getChildCommandIds(getCommandId()); if (childCommandIds.size() != parameters.getChildCommands().size()) { parameters.getChildCommands().put(parameters.getCommandStep(), childCommandIds.get(childCommandIds.size()-1)); } In a normal run where everything is read and written in the memory, the order will be kept correctly, but when it's reloaded from the database the order changes, which creates an incorrect CommandStep -> commandId mapping [1] https://github.com/oVirt/ovirt-engine/blob/b7bc2ee76e48f3373f3a2d584f36f22efed1daed/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskCommandBase.java#L131
Verified: 1. Add forced cache refresh on each live merge step [1] to simulate CommandEntityCleanupManager 2. Run live merge The flow was tested both locally and on multiple OST runs [1] https://gerrit.ovirt.org/c/ovirt-engine/+/114635/1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskLiveCommand.java#93
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (0-day RHV Manager (ovirt-engine) [ovirt-4.4.6]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2182
*** Bug 1984000 has been marked as a duplicate of this bug. ***