Description of problem:
Live merge (snapshot delete) fails with Engine NEP on a VM with Cluster Chipset/Firmware Type "Cluster default" or "Legacy".
The merge operation completes on the host side and images are removed.
But this leaves the VM in a inconsistent state where the disk images have illegal state and snapshot not removed on the engine side.
Also the VM will fail to start if stopped as the leaf image or images in the disk chain doesn't exist on the storage domain.
2020-11-19 21:54:43,110Z ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Failed invoking callback end method 'onSucceeded' for command '910303a8-5eda-49b4-b99e-b36a32c52afa' with exception 'null', the callback is marked for end method retries but max number of retries have been attempted. The command will be marked as Failed.
2020-11-19 21:54:43,110Z INFO [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Exception in invoking callback of command RemoveSnapshotSingleDiskLive (910303a8-5eda-49b4-b99e-b36a32c52afa): NullPointerException:
2020-11-19 21:54:43,110Z ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Error invoking callback method 'onSucceeded' for 'SUCCEEDED' command '910303a8-5eda-49b4-b99e-b36a32c52afa'
2020-11-19 21:54:43,110Z ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [addad639-cb2e-49b8-9ffe-b34ad2139ab0] Exception: java.lang.NullPointerException
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. With Cluster setting of "Cluster Default"
2. Create a VM with 2 disks and start a VM.
3. Create a snasphot.
4. Delete of the snapshot fails with a NEP
Engine faults with a NEP leaving VM disk images in illegal state.
Engine shouldn't fault with a NEP.
this is a major flow and serious impact, raising to Urgent
Reducing back to high severity since the likelihood of having a cluster that live-merge can be executed in and yet set with bios-type=CLUSTER-DEFAULT is rather low
This is most likely solved already by other changes in this area, but needs to make sure
Changing back to urgent since this could happen also when the cluster's bios type is set to 'Legacy'.
It is kind of fixed already in 4.4.3 but not entirely -
there is no NPE anymore so the live-merge operation succeeds but the VM configuration within the snapshot that is modified during the live-merge operation would lack the original cluster's bios-type. The posted patch would address that.
Looking for a workaround, I noticed that a CL 4.4 with 'Cluster Default' setting will set the VMs to Legacy (i440fx+SeaBIOS).
Is this right? I was expecting Q35+SeaBIOS.
So to workaround this one needs to change the cluster to Q35+Bios and then power cycle all VMs?
(In reply to Germano Veit Michel from comment #6)
> Looking for a workaround, I noticed that a CL 4.4 with 'Cluster Default'
> setting will set the VMs to Legacy (i440fx+SeaBIOS).
> Is this right? I was expecting Q35+SeaBIOS.
That 'Cluster Default' setting on the cluster is a bit confusing - what it actually means is something like 'Auto Detect' and once there's an active host in the cluster this bios-type field should change to a different value. We blocked the ability to change a cluster that is set with bios-type!='Cluster Default' back to 'Cluster Default' (IIRC, in 4.4.3) so it should not be possible anymore to start VMs in a cluster which is set with 'Cluster Default'.
It could be that in 4.4.2 it caused VMs to be set with the legacy bios but this scenario of a cluster with an active host that is set with 'Cluster Default' should really be avoided.
> So to workaround this one needs to change the cluster to Q35+Bios and then
> power cycle all VMs?
Actually this one can happen regardless of the cluster's bios-type so even changing it to Q35+BIOS wouldn't help here.
Setting the VMs with custom bios-type can prevent this.
keeping open for comment #5, it's still worth a fix. But the original issue shouldn't happen anymore in 4.4.3
How do I know in each setup what Cluster Default equals to?
If it was upgraded from RHV 4.3, would Cluster Default mean Legacy?
Where do we set Cluster Default and when is it changed, if it is changed automatically with upgrade at any point?
(In reply to Marina Kalinin from comment #13)
> How do I know in each setup what Cluster Default equals to?
think of Cluster-Default as 'Auto-Detect' with the following logic:
on PPC, it will become Legacy
on pre 4.4 clusters, it will become legacy
otherwise, it will become Q35+SeaBIOS
> If it was upgraded from RHV 4.3, would Cluster Default mean Legacy?
it will actually be set to Legacy, there's no auto-detection in this case.
> Where do we set Cluster Default and when is it changed, if it is changed
> automatically with upgrade at any point?
it is the default value when creating new clusters.
it changes when the first host is activated in the cluster.
it doesn't change on upgrade.
1. Create a VM with 2 disks
2. Make sure the VM's 'Custom Chipset/Firmware Type' is 'Cluster Default' (the cluster default in this case is 'Q35 Chipset with BIOS')
3. Start the VM and create a snapshot (with memory)
3. Delete the snapshot created in section 3
- Snapshot deleted successfully and no NPE appears on engine.log
Beni, can you please also check the following:
1. Create a VM with a single disk
2. Run the VM
3. Create a snapshot - snapshot1
4. Create a snapshot - snapshot2
5. Remove 'snapshot1'
6. Preview 'snapshot2'
7. Run the VM in preview mode
Verified by automation (same env. as on comment 27)
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.