Bug 1956677 - Snapshot merge fails with null pointer exception.
Summary: Snapshot merge fails with null pointer exception.
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.4.6-1
: 4.4.6
Assignee: Benny Zlotnik
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-04 08:24 UTC by Roman Hodain
Modified: 2021-05-09 08:15 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.4.6.7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)
Engien logs (4.31 MB, application/gzip)
2021-05-04 08:25 UTC, Roman Hodain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 114635 0 master MERGED core: sort fetched commandIds 2021-05-06 11:10:39 UTC
oVirt gerrit 114677 0 ovirt-engine-4.4.6.z MERGED core: sort fetched commandIds 2021-05-06 13:18:53 UTC

Description Roman Hodain 2021-05-04 08:24:36 UTC
Description of problem:
When trying to delete a snapshot the engine fails with a null pointer exception after the merge is finished and the engine attempts to destroy the merged volume. 

Version-Release number of selected component (if applicable):
4.4.5

How reproducible:
So far ones.

Steps to Reproduce:
N/A

Actual results:
The null pointer is raised and further merges fail as the DB still contains the already merged volume. The VM is successfully merged. 

2021-05-01 03:35:35,034+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [2ea4228e-f87a-4426-a3bc-897938b38dd1] Command 'RemoveSnapshot' (id: '4d0f3411-cd69-4fbb-b1d9-885ab4f9dbad') waiting on child command id: 'e0e728f1-e7bf-4d21-b0f9-41809776c052' type:'RemoveSnapshotSingleDiskLive' to complete
2021-05-01 03:35:35,077+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Executing Live Merge command step 'DESTROY_IMAGE'
2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command 'RemoveSnapshotSingleDiskLive' id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe' with children [bff72d6e-1e0f-4699-8100-ec2e7bc7f1f5, 280d65b3-a186-4db1-9eee-884a2167177e, 289d17ea-144e-4a85-a660-d838f224147e] failed when attempting to perform the next operation, marking as 'ACTIVE'
2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] null: java.lang.NullPointerException
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.performNextOperation(RemoveSnapshotSingleDiskLiveCommand.java:132)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent@1.0.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
        at org.glassfish.javax.enterprise.concurrent@1.0.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
        at org.glassfish.javax.enterprise.concurrent@1.0.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)

2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe failed child command status for step 'DESTROY_IMAGE'


Expected results:
Volume is destroyed.

Comment 1 Roman Hodain 2021-05-04 08:25:15 UTC
Created attachment 1779282 [details]
Engien logs

Comment 2 Eyal Shenitzky 2021-05-04 08:44:57 UTC
Benny, can you please have a look?

Comment 3 Eyal Shenitzky 2021-05-04 08:47:15 UTC
Roman, please add all relevant info (VDSM logs, DB dump, versions)

Comment 12 Benny Zlotnik 2021-05-04 16:51:05 UTC
Since the issue occurs at around 03:35:35, which is when the CommandEntityCleanupManager runs, this shouldn't remove anything from the command_entities table but it does cause a reload from the table, which causes this method[1] to behave incorrectly:

        List<Guid> childCommandIds = commandCoordinatorUtil.getChildCommandIds(getCommandId());
        if (childCommandIds.size() != parameters.getChildCommands().size()) {
            parameters.getChildCommands().put(parameters.getCommandStep(), childCommandIds.get(childCommandIds.size()-1));
        }

In a normal run where everything is read and written in the memory, the order will be kept correctly, but when it's reloaded from the database the order changes, which creates an incorrect CommandStep -> commandId mapping



[1] https://github.com/oVirt/ovirt-engine/blob/b7bc2ee76e48f3373f3a2d584f36f22efed1daed/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskCommandBase.java#L131

Comment 19 Benny Zlotnik 2021-05-09 08:15:32 UTC
Verified:
1. Add forced cache refresh on each live merge step [1] to simulate CommandEntityCleanupManager
2. Run live merge

The flow was tested both locally and on multiple OST runs



[1] https://gerrit.ovirt.org/c/ovirt-engine/+/114635/1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskLiveCommand.java#93


Note You need to log in before you can comment on or make changes to this bug.