Bug 1956677

Summary: Snapshot merge fails with null pointer exception.
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: ovirt-engineAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED ERRATA QA Contact: Shir Fishbain <sfishbai>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.4.5CC: aefrat, bzlotnik, eshenitz, joboyer, michal.skrivanek, mjankula
Target Milestone: ovirt-4.4.6-1Keywords: ZStream
Target Release: 4.4.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.6.7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-01 13:25:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Engien logs none

Description Roman Hodain 2021-05-04 08:24:36 UTC
Description of problem:
When trying to delete a snapshot the engine fails with a null pointer exception after the merge is finished and the engine attempts to destroy the merged volume. 

Version-Release number of selected component (if applicable):
4.4.5

How reproducible:
So far ones.

Steps to Reproduce:
N/A

Actual results:
The null pointer is raised and further merges fail as the DB still contains the already merged volume. The VM is successfully merged. 

2021-05-01 03:35:35,034+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [2ea4228e-f87a-4426-a3bc-897938b38dd1] Command 'RemoveSnapshot' (id: '4d0f3411-cd69-4fbb-b1d9-885ab4f9dbad') waiting on child command id: 'e0e728f1-e7bf-4d21-b0f9-41809776c052' type:'RemoveSnapshotSingleDiskLive' to complete
2021-05-01 03:35:35,077+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Executing Live Merge command step 'DESTROY_IMAGE'
2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command 'RemoveSnapshotSingleDiskLive' id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe' with children [bff72d6e-1e0f-4699-8100-ec2e7bc7f1f5, 280d65b3-a186-4db1-9eee-884a2167177e, 289d17ea-144e-4a85-a660-d838f224147e] failed when attempting to perform the next operation, marking as 'ACTIVE'
2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] null: java.lang.NullPointerException
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.performNextOperation(RemoveSnapshotSingleDiskLiveCommand.java:132)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)

2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe failed child command status for step 'DESTROY_IMAGE'


Expected results:
Volume is destroyed.

Comment 1 Roman Hodain 2021-05-04 08:25:15 UTC
Created attachment 1779282 [details]
Engien logs

Comment 2 Eyal Shenitzky 2021-05-04 08:44:57 UTC
Benny, can you please have a look?

Comment 3 Eyal Shenitzky 2021-05-04 08:47:15 UTC
Roman, please add all relevant info (VDSM logs, DB dump, versions)

Comment 12 Benny Zlotnik 2021-05-04 16:51:05 UTC
Since the issue occurs at around 03:35:35, which is when the CommandEntityCleanupManager runs, this shouldn't remove anything from the command_entities table but it does cause a reload from the table, which causes this method[1] to behave incorrectly:

        List<Guid> childCommandIds = commandCoordinatorUtil.getChildCommandIds(getCommandId());
        if (childCommandIds.size() != parameters.getChildCommands().size()) {
            parameters.getChildCommands().put(parameters.getCommandStep(), childCommandIds.get(childCommandIds.size()-1));
        }

In a normal run where everything is read and written in the memory, the order will be kept correctly, but when it's reloaded from the database the order changes, which creates an incorrect CommandStep -> commandId mapping



[1] https://github.com/oVirt/ovirt-engine/blob/b7bc2ee76e48f3373f3a2d584f36f22efed1daed/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskCommandBase.java#L131

Comment 19 Benny Zlotnik 2021-05-09 08:15:32 UTC
Verified:
1. Add forced cache refresh on each live merge step [1] to simulate CommandEntityCleanupManager
2. Run live merge

The flow was tested both locally and on multiple OST runs



[1] https://gerrit.ovirt.org/c/ovirt-engine/+/114635/1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskLiveCommand.java#93

Comment 23 errata-xmlrpc 2021-06-01 13:25:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (0-day RHV Manager (ovirt-engine) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2182

Comment 24 Benny Zlotnik 2021-07-21 14:20:04 UTC
*** Bug 1984000 has been marked as a duplicate of this bug. ***