Bug 2001923

Summary: NPE during RemoveSnapshotSingleDisk command
Product: Red Hat Enterprise Virtualization Manager Reporter: Martin Perina <mperina>
Component: ovirt-engineAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED ERRATA QA Contact: Shir Fishbain <sfishbai>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4.7CC: ahadas, apinnick, bzlotnik, dfodor, josgutie, mavital, sfishbai
Target Milestone: ovirt-4.5.1Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.1.2 Doc Type: Bug Fix
Doc Text:
Previously, when a failed VM snapshot was removed from the Manager database while the volume remained on the storage, subsequent operations failed because there was a discrepancy between the storage and the database. Now, the VM snapshot is retained if the volume is not removed from the storage.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-14 12:54:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2081294    
Bug Blocks:    

Description Martin Perina 2021-09-07 13:11:51 UTC
Following exception appeared in the engine.log, details are in attached customer case:

2021-09-04 03:18:19,019+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [c49799a3-82bd-459a-a9cb-385c960dd366] FINISH, GetVolumeInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@f684c2bf, log id: 1342fe90                         
2021-09-04 03:18:19,019+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [c49799a3-82bd-459a-a9cb-385c960dd366] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@f684c2bf, log id: 5fb14bf5                           
2021-09-04 03:18:19,044+02 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [c49799a3-82bd-459a-a9cb-385c960dd366] transaction rolled back                                                                                                                                            
2021-09-04 03:18:19,044+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [c49799a3-82bd-459a-a9cb-385c960dd366] Failed invoking callback end method 'onSucceeded' for command 'f9c566d0-963c-4c19-8db3-3cfafa0b447a' with exception 'null', the callback is marked for end method retries 
2021-09-04 03:18:19,054+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [c49799a3-82bd-459a-a9cb-385c960dd366] Command 'RemoveSnapshot' (id: 'e1013db3-7c37-4ad8-a44d-5054be2c6e2f') waiting on child command id: 'f9c566d0-963c-4c19-8db3-3cfafa0b447a' type:'RemoveSnapshotSingleDiskLive' to complete

2021-09-04 03:19:38,932+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.GetImageInfoVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [c49799a3-82bd-459a-a9cb-385c960dd366] FINISH, GetImageInfoVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@f684c2bf, log id: 784329b1                           
2021-09-04 03:19:38,956+02 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [c49799a3-82bd-459a-a9cb-385c960dd366] transaction rolled back                                                                                                                                            
2021-09-04 03:19:38,965+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [c49799a3-82bd-459a-a9cb-385c960dd366] Failed invoking callback end method 'onSucceeded' for command 'f9c566d0-963c-4c19-8db3-3cfafa0b447a' with exception 'null', the callback is marked for end method retries but max number of retries have been attempted. The command will be marked as Failed.
2021-09-04 03:19:38,966+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [c49799a3-82bd-459a-a9cb-385c960dd366] Exception in invoking callback of command RemoveSnapshotSingleDiskLive (f9c566d0-963c-4c19-8db3-3cfafa0b447a): NullPointerException:␠                                  
2021-09-04 03:19:38,966+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [c49799a3-82bd-459a-a9cb-385c960dd366] Error invoking callback method 'onSucceeded' for 'SUCCEEDED' command 'f9c566d0-963c-4c19-8db3-3cfafa0b447a'                                                            
2021-09-04 03:19:38,966+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-88) [c49799a3-82bd-459a-a9cb-385c960dd366] Exception: java.lang.NullPointerException                                                                                                                              
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.isRemoveTopImageMemoryNeeded(RemoveSnapshotSingleDiskCommandBase.java:267)                                                                                                                                                                                     
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.removeTopImageMemoryIfNeeded(RemoveSnapshotSingleDiskCommandBase.java:290)                                                                                                                                                                                     
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.handleBackwardMerge(RemoveSnapshotSingleDiskCommandBase.java:257)     
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.lambda$syncDbRecords$1(RemoveSnapshotSingleDiskCommandBase.java:182)  
␉·······at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:181)                                 
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.syncDbRecords(RemoveSnapshotSingleDiskCommandBase.java:172)           
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.onSucceeded(RemoveSnapshotSingleDiskLiveCommand.java:232)             
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback.onSucceeded(RemoveSnapshotSingleDiskLiveCommandCallback.java:27)                                                                                                                                                                                       
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.endCallback(CommandCallbacksPoller.java:69)                                            
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:166)                             
␉·······at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)                                 
␉·······at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)                                                                                            
␉·······at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)                                                                                                   
␉·······at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)                                                                                                                                          
␉·······at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)                                                                                                                                                 
␉·······at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)                                                                                    
␉·······at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)                                                                                    
␉·······at java.base/java.lang.Thread.run(Thread.java:829)                                                                                                                              
␉·······at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)

Comment 2 Eyal Shenitzky 2021-09-13 14:10:59 UTC
Martin,
Can you provide engine and VDSM logs?

Comment 7 José Enrique 2022-02-01 17:21:11 UTC
Hello I've attached another case that is suffering the same issue and produce db snapshot data inconsistencies that interrupt backup operations and create case with a huge amount of VMs affected.

Comment 8 Arik 2022-02-01 20:37:41 UTC
Fixing this should be fairly easy but we need to figure out the flow this happens in, seems to be related to backward merge

Comment 13 Benny Zlotnik 2022-06-21 07:48:09 UTC
1. Edit /usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py
Add a two minute sleep in line 493:
                time.sleep(120)
                self._vm.run_dom_snapshot(snapxml, snap_flags)

2. Create a snapshot on a VM
3. Run the VM on the edited host
4. Create a live snapshot, wait a minute and shutdown the VM

The operation should fail, but there should be a snapshot that should be removed without issues

Comment 14 Arik 2022-06-21 08:12:43 UTC
(In reply to Benny Zlotnik from comment #13)
> The operation should fail, but there should be a snapshot that should be
> removed without issues

to be more concrete: ... but there should be a snapshot that can be removed manually without issues

Comment 20 Shir Fishbain 2022-06-23 11:33:47 UTC
Verified - RemoveSnapshotSingleDisk command ran without issues and the live snapshot 's2_live' deletion for VM 'new_vm' was completed successfully.

ovirt-engine-4.5.1.2-0.11.el8ev.noarch
vdsm-4.50.1.3-1.el8ev.x86_64

Comment 24 errata-xmlrpc 2022-07-14 12:54:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.1] security, bug fix and update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5555