1899768 – Live merge fails on invoking callback end method 'onSucceeded' for a VM with Cluster Chipset/Firmware Type "Cluster default" or "Legacy".

Bug 1899768 - Live merge fails on invoking callback end method 'onSucceeded' for a VM with Cluster Chipset/Firmware Type "Cluster default" or "Legacy".

Summary: Live merge fails on invoking callback end method 'onSucceeded' for a VM with ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.4
Target Release:	4.4.4
Assignee:	Arik
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-20 00:03 UTC by Bimal Chollera
Modified:	2024-03-25 17:10 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ovirt-engine-4.4.4.2
Doc Type:	Bug Fix
Doc Text:	Previously, live-merge failed on snapshots of virtual machines that are set with bios-type = CLUSTER-DEFAULT. In this release, live-merge works on snapshots of virtual machines that are set with bios-type = CLUSTER-DEFAULT.
Clone Of:
Environment:
Last Closed:	2021-02-02 13:58:29 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5588251	None	None	None	2020-11-20 16:56:15 UTC
Red Hat Knowledge Base (Solution)	5595601	None	None	None	2020-11-23 21:49:58 UTC
Red Hat Product Errata	RHBA-2021:0312	None	None	None	2021-02-02 13:58:35 UTC
oVirt gerrit	112315	master	MERGED	core: preserve bios-type setting on live-merge	2021-02-15 19:56:07 UTC

Description Bimal Chollera 2020-11-20 00:03:27 UTC

Description of problem:

Live merge (snapshot delete) fails with Engine NEP on a VM with Cluster Chipset/Firmware Type "Cluster default" or "Legacy".
The merge operation completes on the host side and images are removed.
But this leaves the VM in a inconsistent state where the disk images have illegal state and snapshot not removed on the engine side.

Also the VM will fail to start if stopped as the leaf image or images in the disk chain doesn't exist on the storage domain.


~~~
2020-11-19 21:54:43,110Z ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Failed invoking callback end method 'onSucceeded' for command '910303a8-5eda-49b4-b99e-b36a32c52afa' with exception 'null', the callback is marked for end method retries but max number of retries have been attempted. The command will be marked as Failed.
2020-11-19 21:54:43,110Z INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Exception in invoking callback of command RemoveSnapshotSingleDiskLive (910303a8-5eda-49b4-b99e-b36a32c52afa): NullPointerException: 
2020-11-19 21:54:43,110Z ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Error invoking callback method 'onSucceeded' for 'SUCCEEDED' command '910303a8-5eda-49b4-b99e-b36a32c52afa'
2020-11-19 21:54:43,110Z ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.ovf.OvfWriter.writeBiosType(OvfWriter.java:337)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.ovf.OvfWriter.writeGeneralData(OvfWriter.java:309)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.ovf.OvfOvirtWriter.writeGeneralData(OvfOvirtWriter.java:188)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.ovf.OvfVmWriter.writeGeneralData(OvfVmWriter.java:40)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.ovf.OvfWriter.buildVirtualSystem(OvfWriter.java:192)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.ovf.IOvfBuilder.build(IOvfBuilder.java:34)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.utils.ovf.OvfManager.exportVm(OvfManager.java:77)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.ImagesHandler.prepareSnapshotConfigWithAlternateImage(ImagesHandler.java:803)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.ImagesHandler.prepareSnapshotConfigWithoutImageSingleImage(ImagesHandler.java:759)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.lambda$updateVmConfigurationForImageRemoval$2(RemoveSnapshotSingleDiskCommandBase.java:318)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:181)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.updateVmConfigurationForImageRemoval(RemoveSnapshotSingleDiskCommandBase.java:316)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.handleBackwardMerge(RemoveSnapshotSingleDiskCommandBase.java:253)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.lambda$syncDbRecords$0(RemoveSnapshotSingleDiskCommandBase.java:173)
        at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:181)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskCommandBase.syncDbRecords(RemoveSnapshotSingleDiskCommandBase.java:163)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand.onSucceeded(RemoveSnapshotSingleDiskLiveCommand.java:232)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback.onSucceeded(RemoveSnapshotSingleDiskLiveCommandCallback.java:27)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.endCallback(CommandCallbacksPoller.java:69)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:166)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)
~~~

Version-Release number of selected component (if applicable):

4.4.2.6

How reproducible:
100%

Steps to Reproduce:
1.  With Cluster setting of "Cluster Default"
2.  Create a VM with 2 disks and start a VM.
3.  Create a snasphot.
4.  Delete of the snapshot fails with a NEP

Actual results:

Engine faults with a NEP leaving VM disk images in illegal state.

Expected results:

Engine shouldn't fault with a NEP.


Additional info:

Comment 3 Michal Skrivanek 2020-11-20 10:33:43 UTC

this is a major flow and serious impact, raising to Urgent

Comment 4 Arik 2020-11-22 11:56:56 UTC

Reducing back to high severity since the likelihood of having a cluster that live-merge can be executed in and yet set with bios-type=CLUSTER-DEFAULT is rather low
This is most likely solved already by other changes in this area, but needs to make sure

Comment 5 Arik 2020-11-22 21:17:01 UTC

Changing back to urgent since this could happen also when the cluster's bios type is set to 'Legacy'.
It is kind of fixed already in 4.4.3 but not entirely -
there is no NPE anymore so the live-merge operation succeeds but the VM configuration within the snapshot that is modified during the live-merge operation would lack the original cluster's bios-type. The posted patch would address that.

Comment 6 Germano Veit Michel 2020-11-22 23:15:04 UTC

Looking for a workaround, I noticed that a CL 4.4 with 'Cluster Default' setting will set the VMs to Legacy (i440fx+SeaBIOS).
Is this right? I was expecting Q35+SeaBIOS.

So to workaround this one needs to change the cluster to Q35+Bios and then power cycle all VMs?

Comment 7 Arik 2020-11-23 08:22:41 UTC

(In reply to Germano Veit Michel from comment #6)
> Looking for a workaround, I noticed that a CL 4.4 with 'Cluster Default'
> setting will set the VMs to Legacy (i440fx+SeaBIOS).
> Is this right? I was expecting Q35+SeaBIOS.

That 'Cluster Default' setting on the cluster is a bit confusing - what it actually means is something like 'Auto Detect' and once there's an active host in the cluster this bios-type field should change to a different value. We blocked the ability to change a cluster that is set with bios-type!='Cluster Default' back to 'Cluster Default' (IIRC, in 4.4.3) so it should not be possible anymore to start VMs in a cluster which is set with 'Cluster Default'.
It could be that in 4.4.2 it caused VMs to be set with the legacy bios but this scenario of a cluster with an active host that is set with 'Cluster Default' should really be avoided.

> 
> So to workaround this one needs to change the cluster to Q35+Bios and then
> power cycle all VMs?

Actually this one can happen regardless of the cluster's bios-type so even changing it to Q35+BIOS wouldn't help here.
Setting the VMs with custom bios-type can prevent this.

Comment 10 Michal Skrivanek 2020-11-23 15:33:04 UTC

keeping open for comment #5, it's still worth a fix. But the original issue shouldn't happen anymore in 4.4.3

Comment 13 Marina Kalinin 2020-11-23 18:39:00 UTC

How do I know in each setup what Cluster Default equals to?
If it was upgraded from RHV 4.3, would Cluster Default mean Legacy? 
Where do we set Cluster Default and when is it changed, if it is changed automatically with upgrade at any point?

Comment 19 Arik 2020-11-23 19:57:22 UTC

(In reply to Marina Kalinin from comment #13)
> How do I know in each setup what Cluster Default equals to?

think of Cluster-Default as 'Auto-Detect' with the following logic:
on PPC, it will become Legacy
on pre 4.4 clusters, it will become legacy
otherwise, it will become Q35+SeaBIOS

> If it was upgraded from RHV 4.3, would Cluster Default mean Legacy? 

it will actually be set to Legacy, there's no auto-detection in this case.


> Where do we set Cluster Default and when is it changed, if it is changed
> automatically with upgrade at any point?

it is the default value when creating new clusters.
it changes when the first host is activated in the cluster.
it doesn't change on upgrade.

Comment 27 Beni Pelled 2020-11-30 09:12:18 UTC

Verified with:
- ovirt-engine-4.4.4.2-0.1.el8ev.noarch
- vdsm-4.40.38-1.el8ev.x86_64
- libvirt-6.6.0-7.module+el8.3.0+8424+5ea525c5.x86_64

Verification steps:
1. Create a VM with 2 disks
2. Make sure the VM's 'Custom Chipset/Firmware Type' is 'Cluster Default' (the cluster default in this case is 'Q35 Chipset with BIOS')
3. Start the VM and create a snapshot (with memory)
3. Delete the snapshot created in section 3

Result:
- Snapshot deleted successfully and no NPE appears on engine.log

Comment 28 Arik 2020-12-02 08:59:28 UTC

Beni, can you please also check the following:
1. Create a VM with a single disk
2. Run the VM
3. Create a snapshot - snapshot1
4. Create a snapshot - snapshot2
5. Remove 'snapshot1'
6. Preview 'snapshot2'
7. Run the VM in preview mode

Comment 29 Beni Pelled 2020-12-03 08:08:55 UTC

Verified by automation (same env. as on comment 27)

Comment 35 errata-xmlrpc 2021-02-02 13:58:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0312

Note You need to log in before you can comment on or make changes to this bug.