Bug 2094729 - Cluster Compatibility Version upgrade break VM's with pending next config
Summary: Cluster Compatibility Version upgrade break VM's with pending next config
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.5.0.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.5.2
: ---
Assignee: Jean-Louis Dupond
QA Contact: Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-08 09:09 UTC by Jean-Louis Dupond
Modified: 2022-08-30 08:49 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.5.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-08 08:17:16 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 509 0 None open Don't set a CustomCompatibilityVersion on ClusterUpgrade 2022-06-30 14:36:51 UTC
Red Hat Issue Tracker RHV-46373 0 None None None 2022-06-08 09:19:52 UTC

Description Jean-Louis Dupond 2022-06-08 09:09:06 UTC
Description of problem:

We have some oVirt cluster that went from 4.4.0 to 4.5.0. Mainly without any issue.
After each update, the global cluster version was upgraded (starting from 4.4? until 4.7 currently).

Most VM's reboot often for updates etc, and those are running the newest config version. Except some VM's that never reboot, and are still live since it was booted on 4.4/4.5/4.6 compat level.

But now we notice that after the latest upgrade to 4.7 level, those VM's do not reboot anymore without manual intervention:

2022-06-08 10:02:29,881+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-11) [4035c052] VM '21180680-cba3-4230-9552-5a5bfc9db8b0'(oldvm) moved from 'Up' --> 'RebootInProgress'
2022-06-08 10:02:29,921+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-11) [4035c052] EVENT_ID: USER_REBOOT_VM(157), VM oldvm is rebooting. Rebooted by: Guest OS
2022-06-08 10:02:32,920+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [4035c052] VM '21180680-cba3-4230-9552-5a5bfc9db8b0' was reported as Down on VDS '651716d4-92bb-4f93-9251-0f5a75ec3743'(ovn003)
2022-06-08 10:02:32,923+02 INFO  [org.ovirt.engine.core.bll.SaveVmExternalDataCommand] (ForkJoinPool-1-worker-29) [5bb6140] Running command: SaveVmExternalDataCommand internal: true. Entities affected :  ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VM
2022-06-08 10:02:32,925+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-29) [5bb6140] START, DestroyVDSCommand(HostName = ovn003, DestroyVmVDSCommandParameters:{hostId='651716d4-92bb-4f93-9251-0f5a75ec3743', vmId='21180680-cba3-4230-9552-5a5bfc9db8b0', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 75a2523
2022-06-08 10:02:32,930+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-29) [5bb6140] FINISH, DestroyVDSCommand, return: , log id: 75a2523
2022-06-08 10:02:32,930+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [5bb6140] VM '21180680-cba3-4230-9552-5a5bfc9db8b0'(oldvm) moved from 'RebootInProgress' --> 'Down'
2022-06-08 10:02:32,936+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-29) [5bb6140] EVENT_ID: VM_DOWN(61), VM oldvm is down. Exit message: Down as a part of the reboot process
2022-06-08 10:02:32,937+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [5bb6140] add VM '21180680-cba3-4230-9552-5a5bfc9db8b0'(oldvm) to cold reboot treatment
2022-06-08 10:02:32,942+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-29) [5bb6140] EVENT_ID: COLD_REBOOT_VM_DOWN(9,611), VM oldvm is down as a part of cold reboot process
2022-06-08 10:02:32,942+02 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (ForkJoinPool-1-worker-29) [5bb6140] VM is down as a part of cold reboot process. Attempting to restart. VM Name 'oldvm', VM Id '21180680-cba3-4230-9552-5a5bfc9db8b0
2022-06-08 10:02:32,943+02 INFO  [org.ovirt.engine.core.bll.ProcessDownVmCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [1905222f] Running command: ProcessDownVmCommand internal: true.
2022-06-08 10:02:33,019+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [2f7262a7] Running command: UpdateVmCommand internal: true. Entities affected :  ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2022-06-08 10:02:33,040+02 INFO  [org.ovirt.engine.core.bll.UpdateRngDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [949c26a] Running command: UpdateRngDeviceCommand internal: true. Entities affected :  ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2022-06-08 10:02:33,045+02 INFO  [org.ovirt.engine.core.bll.UpdateGraphicsDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [2cf57b4b] Running command: UpdateGraphicsDeviceCommand internal: true. Entities affected :  ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2022-06-08 10:02:33,048+02 INFO  [org.ovirt.engine.core.bll.UpdateGraphicsDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [6f812870] Running command: UpdateGraphicsDeviceCommand internal: true. Entities affected :  ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER
2022-06-08 10:02:33,052+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-825484) [6f812870] EVENT_ID: SYSTEM_UPDATE_VM(253), VM oldvm configuration was updated by system.
2022-06-08 10:02:33,054+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [6f812870] Lock freed to object 'EngineLock:{exclusiveLocks='[oldvm=VM_NAME]', sharedLocks='[21180680-cba3-4230-9552-5a5bfc9db8b0=VM]'}'
2022-06-08 10:02:33,435+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [1525939] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_COMPATIBILITY_VERSION_NOT_SUPPORTED,$VmName oldvm,$VmVersion 4.4,$DcVersion 4.7
2022-06-08 10:02:33,435+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [1525939] Lock freed to object 'EngineLock:{exclusiveLocks='[21180680-cba3-4230-9552-5a5bfc9db8b0=VM]', sharedLocks=''}'
2022-06-08 10:02:33,439+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [1525939] EVENT_ID: COLD_REBOOT_FAILED(9,612), Cold reboot of VM oldvm failed

While the VM's never had a custom compatibility level set, it seems like the upgrade process sets a new compat level, and causing the VM to be unable to boot.


I don't know if this is some easy fix? :) At least the compat version set should never be some value that is not compatible with the cluster afaik?

Comment 1 Jean-Louis Dupond 2022-06-08 10:19:06 UTC
Did some more checks, and it seems like something in the cluster upgrade process sets the CustomCompatibilityVersion during the upgrade.
The next-run config does not contain a CustomCompatibilityVersion.

But on the next cluster upgrade, it uses the CustomCompatibilityVersion set in the VDSStatic, but does not notice its not set in the next-run config.
And you end up with a Next-Run config with a old/incorrect CustomCompatibilityVersion.

I couldn't find out where the CustomCompatibilityVersion is set initially in the code during the upgrade.

Comment 2 Jean-Louis Dupond 2022-06-08 11:45:25 UTC
This sets the CustomCompatibilityVersion of the running VM:
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/UpdateVmCommand.java#L320

Just need to find out what takes that value to update the next-run config :)

Comment 3 RHEL Program Management 2022-06-08 11:51:58 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 4 Michal Skrivanek 2022-06-08 11:54:24 UTC
(In reply to Jean-Louis Dupond from comment #1)
> Did some more checks, and it seems like something in the cluster upgrade
> process sets the CustomCompatibilityVersion during the upgrade.
> The next-run config does not contain a CustomCompatibilityVersion.

correct, that's the way we handle the required reboots after CL change. Current entry is changed with previous CL, next-run clears it out so that after reboot the VM boots normally with the new CL configuration

> But on the next cluster upgrade, it uses the CustomCompatibilityVersion set
> in the VDSStatic, but does not notice its not set in the next-run config.
> And you end up with a Next-Run config with a old/incorrect
> CustomCompatibilityVersion.

so it's rather the other way around - it shouldn't copy the CCV into the next-run at all. But htat's the problem we had - we cannot determine if that was done during previous CL upgrade(so it should be cleared in next-run) or if it is an intentional change(someone wants to run this VM with such compat version) that we do need to keep set to the same.

Comment 5 Jean-Louis Dupond 2022-06-08 12:25:17 UTC
(In reply to Michal Skrivanek from comment #4)
> > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set
> > in the VDSStatic, but does not notice its not set in the next-run config.
> > And you end up with a Next-Run config with a old/incorrect
> > CustomCompatibilityVersion.
> 
> so it's rather the other way around - it shouldn't copy the CCV into the
> next-run at all. But htat's the problem we had - we cannot determine if that
> was done during previous CL upgrade(so it should be cleared in next-run) or
> if it is an intentional change(someone wants to run this VM with such compat
> version) that we do need to keep set to the same.


But the CCV info is set into the NEXT_RUN snapshot by the cluster upgrade process it seems.

When there is no NEXT_RUN snapshot during cluster upgrade, the NEXT_RUN snapshot XML contains;
<ClusterCompatibilityVersion>4.7</ClusterCompatibilityVersion> (and no CustomCV)


But when there was already a NEXT_RUN snapshot by a previous cluster update, the NEXT_RUN snapshot XML contains the following after the upgrade:
<CustomCompatibilityVersion>4.5</CustomCompatibilityVersion><ClusterCompatibilityVersion>4.7</ClusterCompatibilityVersion>


While just before the cluster upgrade it only contained:
<ClusterCompatibilityVersion>4.6</ClusterCompatibilityVersion>


So for some reason (I could not find where in the code), the cluster upgrade process sets a CustomCompatibilityVersion incorrectly.

Comment 6 Arik 2022-06-13 07:43:45 UTC
(In reply to Jean-Louis Dupond from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set
> > > in the VDSStatic, but does not notice its not set in the next-run config.
> > > And you end up with a Next-Run config with a old/incorrect
> > > CustomCompatibilityVersion.
> > 
> > so it's rather the other way around - it shouldn't copy the CCV into the
> > next-run at all. But htat's the problem we had - we cannot determine if that
> > was done during previous CL upgrade(so it should be cleared in next-run) or
> > if it is an intentional change(someone wants to run this VM with such compat
> > version) that we do need to keep set to the same.
> 
> 
> But the CCV info is set into the NEXT_RUN snapshot by the cluster upgrade
> process it seems.
> 
> When there is no NEXT_RUN snapshot during cluster upgrade, the NEXT_RUN
> snapshot XML contains;
> <ClusterCompatibilityVersion>4.7</ClusterCompatibilityVersion> (and no
> CustomCV)
> 
> 
> But when there was already a NEXT_RUN snapshot by a previous cluster update,
> the NEXT_RUN snapshot XML contains the following after the upgrade:
> <CustomCompatibilityVersion>4.5</
> CustomCompatibilityVersion><ClusterCompatibilityVersion>4.7</
> ClusterCompatibilityVersion>
> 
> 
> While just before the cluster upgrade it only contained:
> <ClusterCompatibilityVersion>4.6</ClusterCompatibilityVersion>
> 
> 
> So for some reason (I could not find where in the code), the cluster upgrade
> process sets a CustomCompatibilityVersion incorrectly.

Yeah, it shouldn't as ovirt-engine tries to use the existing next-run configuration, when exists, that is not supposed to include custom compatibility level (at least after the fix to bz 1650505)

It seems the original report was different than what you wrote above - the custom compatibility level was set to 4.4 when upgrading to 4.7, right?
Did you reproduce the flow above or is it what you suspect that would happen?

Comment 7 Jean-Louis Dupond 2022-06-13 07:52:46 UTC
When upgrading from level 4.4 to 4.5, it seems to create a correct next-run config without the CustomCompatibilityVersion.
But if you don't reboot the VM in the mean time, and then you upgrade to 4.6, it creates a next-run config with a CustomCompatibilityVersion 4.4 included.

This is what I see on my cluster that was upgraded from 4.4 -> 4.5 -> 4.6 -> 4.7.
Some VM's now have a next-run config with CustomCompatibilityVersion 4.4 or 4.5.

Unfortunately I don't have a older oVirt cluster in test to do the upgrade flow.
But I guess it should be easy reproducible by installing ovirt cluster version 4.4.1 for example, do upgrade to version with 4.5 compat level, then to 4.6 and then to 4.7.

Comment 8 Jean-Louis Dupond 2022-06-30 10:26:53 UTC
(In reply to Michal Skrivanek from comment #4)
> (In reply to Jean-Louis Dupond from comment #1)
> 
> > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set
> > in the VDSStatic, but does not notice its not set in the next-run config.
> > And you end up with a Next-Run config with a old/incorrect
> > CustomCompatibilityVersion.
> 
> so it's rather the other way around - it shouldn't copy the CCV into the
> next-run at all. But htat's the problem we had - we cannot determine if that
> was done during previous CL upgrade(so it should be cleared in next-run) or
> if it is an intentional change(someone wants to run this VM with such compat
> version) that we do need to keep set to the same.

Was checking some more about the issue, but seems like we can?
Case 1 - Cluster upgrade:
If you do a cluster upgrade, the CCV is set to the VMStatic in the following line:
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/UpdateVmCommand.java#L320
But creating/updating the next-run is handled by the ClusterUpgrade, and this one does NOT add CCV into the next-run snapshot xml.

Case 2 - CCV Manual change (when no next-run exists):
The code creates a next-run with the chosen CCV

Case 3 - CCV Manual change (with existing next-run config):
The code updates the next-run config with the chosen CCV.



So that means if we do a cluster upgrade, and we read the next-run config.
If that next-run does NOT contain a CCV, even if the current VM has a CCV set in its VMStatic, we should NOT copy it to the next-run config.
This as if it was eventually set manually, it must already exist in the next-run config anyway!


Guess this can easily be fixed? Could just not find the code which sets the CCV in the next-run on cluster upgrade.

Comment 9 Jean-Louis Dupond 2022-07-06 07:33:34 UTC
After more debugging it seems like it was caused because of a failed ClusterUpgrade.

During the latest clusterupgrade, the first try to upgrade the cluster failed.
One VM was unable to get upgraded, with the error:
Message: Cannot Update VM. Q35 chipset is not supported by the guest OS Red Hat Enterprise Linux 5.x.

This caused all the newly created next-run snapshots to be removed by the compensation command:
Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.Snapshot; snapshot: 4a78068d-2d63-43f3-af14-4cd426846aec.

But as the removal of the existing next-run snapshot was not handled in the compensation comtext, it was not readded.
So we ended up with all VM's missing their next-run config.

The following ClusterUpgrade worked, but added a CustomCompatibilityVersion to all the VM's.

For QA:
- Have a cluster with some VM's running Level 4.5 For example
- Upgrade to level 4.6, all VM's should have a next-run config
- Break a VM (for example set Q35 chipset on RHEL 5)
- Try to upgrade to cluster level 4.7
- Update fails

Without the patch you end up with all VM's losing their next-run config.
With the patch, next-run config should have been restored.

Comment 11 Sandro Bonazzola 2022-08-30 08:49:07 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.