Bug 1956677 - Snapshot merge fails with null pointer exception.
Summary: Snapshot merge fails with null pointer exception.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.4.6-1
: 4.4.6
Assignee: Benny Zlotnik
QA Contact: Shir Fishbain
URL:
Whiteboard:
: 1984000 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-04 08:24 UTC by Roman Hodain
Modified: 2022-02-21 07:24 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.4.6.7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-01 13:25:23 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Engien logs (4.31 MB, application/gzip)
2021-05-04 08:25 UTC, Roman Hodain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:2182 0 None None None 2021-06-01 13:25:29 UTC
oVirt gerrit 114635 0 master MERGED core: sort fetched commandIds 2021-05-06 11:10:39 UTC
oVirt gerrit 114677 0 ovirt-engine-4.4.6.z MERGED core: sort fetched commandIds 2021-05-06 13:18:53 UTC

Description Roman Hodain 2021-05-04 08:24:36 UTC
Description of problem:
When trying to delete a snapshot the engine fails with a null pointer exception after the merge is finished and the engine attempts to destroy the merged volume. 

Version-Release number of selected component (if applicable):
4.4.5

How reproducible:
So far ones.

Steps to Reproduce:
N/A

Actual results:
The null pointer is raised and further merges fail as the DB still contains the already merged volume. The VM is successfully merged. 

2021-05-01 03:35:35,034+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [2ea4228e-f87a-4426-a3bc-897938b38dd1] Command 'RemoveSnapshot' (id: '4d0f3411-cd69-4fbb-b1d9-885ab4f9dbad') waiting on child command id: 'e0e728f1-e7bf-4d21-b0f9-41809776c052' type:'RemoveSnapshotSingleDiskLive' to complete
2021-05-01 03:35:35,077+02 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Executing Live Merge command step 'DESTROY_IMAGE'
2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command 'RemoveSnapshotSingleDiskLive' id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe' with children [bff72d6e-1e0f-4699-8100-ec2e7bc7f1f5, 280d65b3-a186-4db1-9eee-884a2167177e, 289d17ea-144e-4a85-a660-d838f224147e] failed when attempting to perform the next operation, marking as 'ACTIVE'
2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] null: java.lang.NullPointerException
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.performNextOperation(RemoveSnapshotSingleDiskLiveCommand.java:132)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:32)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)

2021-05-01 03:35:35,077+02 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [3c38e773-cddb-44af-8e71-32a1b5d8ff31] Command id: '1e8d04ed-e6d1-4a3f-b57a-f2738ca1b0fe failed child command status for step 'DESTROY_IMAGE'


Expected results:
Volume is destroyed.

Comment 1 Roman Hodain 2021-05-04 08:25:15 UTC
Created attachment 1779282 [details]
Engien logs

Comment 2 Eyal Shenitzky 2021-05-04 08:44:57 UTC
Benny, can you please have a look?

Comment 3 Eyal Shenitzky 2021-05-04 08:47:15 UTC
Roman, please add all relevant info (VDSM logs, DB dump, versions)

Comment 12 Benny Zlotnik 2021-05-04 16:51:05 UTC
Since the issue occurs at around 03:35:35, which is when the CommandEntityCleanupManager runs, this shouldn't remove anything from the command_entities table but it does cause a reload from the table, which causes this method[1] to behave incorrectly:

        List<Guid> childCommandIds = commandCoordinatorUtil.getChildCommandIds(getCommandId());
        if (childCommandIds.size() != parameters.getChildCommands().size()) {
            parameters.getChildCommands().put(parameters.getCommandStep(), childCommandIds.get(childCommandIds.size()-1));
        }

In a normal run where everything is read and written in the memory, the order will be kept correctly, but when it's reloaded from the database the order changes, which creates an incorrect CommandStep -> commandId mapping



[1] https://github.com/oVirt/ovirt-engine/blob/b7bc2ee76e48f3373f3a2d584f36f22efed1daed/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskCommandBase.java#L131

Comment 19 Benny Zlotnik 2021-05-09 08:15:32 UTC
Verified:
1. Add forced cache refresh on each live merge step [1] to simulate CommandEntityCleanupManager
2. Run live merge

The flow was tested both locally and on multiple OST runs



[1] https://gerrit.ovirt.org/c/ovirt-engine/+/114635/1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/RemoveSnapshotSingleDiskLiveCommand.java#93

Comment 23 errata-xmlrpc 2021-06-01 13:25:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (0-day RHV Manager (ovirt-engine) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2182

Comment 24 Benny Zlotnik 2021-07-21 14:20:04 UTC
*** Bug 1984000 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.