Description of problem: We have some oVirt cluster that went from 4.4.0 to 4.5.0. Mainly without any issue. After each update, the global cluster version was upgraded (starting from 4.4? until 4.7 currently). Most VM's reboot often for updates etc, and those are running the newest config version. Except some VM's that never reboot, and are still live since it was booted on 4.4/4.5/4.6 compat level. But now we notice that after the latest upgrade to 4.7 level, those VM's do not reboot anymore without manual intervention: 2022-06-08 10:02:29,881+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-11) [4035c052] VM '21180680-cba3-4230-9552-5a5bfc9db8b0'(oldvm) moved from 'Up' --> 'RebootInProgress' 2022-06-08 10:02:29,921+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-11) [4035c052] EVENT_ID: USER_REBOOT_VM(157), VM oldvm is rebooting. Rebooted by: Guest OS 2022-06-08 10:02:32,920+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [4035c052] VM '21180680-cba3-4230-9552-5a5bfc9db8b0' was reported as Down on VDS '651716d4-92bb-4f93-9251-0f5a75ec3743'(ovn003) 2022-06-08 10:02:32,923+02 INFO [org.ovirt.engine.core.bll.SaveVmExternalDataCommand] (ForkJoinPool-1-worker-29) [5bb6140] Running command: SaveVmExternalDataCommand internal: true. Entities affected : ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VM 2022-06-08 10:02:32,925+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-29) [5bb6140] START, DestroyVDSCommand(HostName = ovn003, DestroyVmVDSCommandParameters:{hostId='651716d4-92bb-4f93-9251-0f5a75ec3743', vmId='21180680-cba3-4230-9552-5a5bfc9db8b0', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 75a2523 2022-06-08 10:02:32,930+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-29) [5bb6140] FINISH, DestroyVDSCommand, return: , log id: 75a2523 2022-06-08 10:02:32,930+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [5bb6140] VM '21180680-cba3-4230-9552-5a5bfc9db8b0'(oldvm) moved from 'RebootInProgress' --> 'Down' 2022-06-08 10:02:32,936+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-29) [5bb6140] EVENT_ID: VM_DOWN(61), VM oldvm is down. Exit message: Down as a part of the reboot process 2022-06-08 10:02:32,937+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-29) [5bb6140] add VM '21180680-cba3-4230-9552-5a5bfc9db8b0'(oldvm) to cold reboot treatment 2022-06-08 10:02:32,942+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-29) [5bb6140] EVENT_ID: COLD_REBOOT_VM_DOWN(9,611), VM oldvm is down as a part of cold reboot process 2022-06-08 10:02:32,942+02 INFO [org.ovirt.engine.core.bll.VdsEventListener] (ForkJoinPool-1-worker-29) [5bb6140] VM is down as a part of cold reboot process. Attempting to restart. VM Name 'oldvm', VM Id '21180680-cba3-4230-9552-5a5bfc9db8b0 2022-06-08 10:02:32,943+02 INFO [org.ovirt.engine.core.bll.ProcessDownVmCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [1905222f] Running command: ProcessDownVmCommand internal: true. 2022-06-08 10:02:33,019+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [2f7262a7] Running command: UpdateVmCommand internal: true. Entities affected : ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER 2022-06-08 10:02:33,040+02 INFO [org.ovirt.engine.core.bll.UpdateRngDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [949c26a] Running command: UpdateRngDeviceCommand internal: true. Entities affected : ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER 2022-06-08 10:02:33,045+02 INFO [org.ovirt.engine.core.bll.UpdateGraphicsDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [2cf57b4b] Running command: UpdateGraphicsDeviceCommand internal: true. Entities affected : ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER 2022-06-08 10:02:33,048+02 INFO [org.ovirt.engine.core.bll.UpdateGraphicsDeviceCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [6f812870] Running command: UpdateGraphicsDeviceCommand internal: true. Entities affected : ID: 21180680-cba3-4230-9552-5a5bfc9db8b0 Type: VMAction group EDIT_VM_PROPERTIES with role type USER 2022-06-08 10:02:33,052+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-825484) [6f812870] EVENT_ID: SYSTEM_UPDATE_VM(253), VM oldvm configuration was updated by system. 2022-06-08 10:02:33,054+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (EE-ManagedThreadFactory-engine-Thread-825484) [6f812870] Lock freed to object 'EngineLock:{exclusiveLocks='[oldvm=VM_NAME]', sharedLocks='[21180680-cba3-4230-9552-5a5bfc9db8b0=VM]'}' 2022-06-08 10:02:33,435+02 WARN [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [1525939] Validation of action 'RunVm' failed for user SYSTEM. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_COMPATIBILITY_VERSION_NOT_SUPPORTED,$VmName oldvm,$VmVersion 4.4,$DcVersion 4.7 2022-06-08 10:02:33,435+02 INFO [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [1525939] Lock freed to object 'EngineLock:{exclusiveLocks='[21180680-cba3-4230-9552-5a5bfc9db8b0=VM]', sharedLocks=''}' 2022-06-08 10:02:33,439+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [1525939] EVENT_ID: COLD_REBOOT_FAILED(9,612), Cold reboot of VM oldvm failed While the VM's never had a custom compatibility level set, it seems like the upgrade process sets a new compat level, and causing the VM to be unable to boot. I don't know if this is some easy fix? :) At least the compat version set should never be some value that is not compatible with the cluster afaik?
Did some more checks, and it seems like something in the cluster upgrade process sets the CustomCompatibilityVersion during the upgrade. The next-run config does not contain a CustomCompatibilityVersion. But on the next cluster upgrade, it uses the CustomCompatibilityVersion set in the VDSStatic, but does not notice its not set in the next-run config. And you end up with a Next-Run config with a old/incorrect CustomCompatibilityVersion. I couldn't find out where the CustomCompatibilityVersion is set initially in the code during the upgrade.
This sets the CustomCompatibilityVersion of the running VM: https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/UpdateVmCommand.java#L320 Just need to find out what takes that value to update the next-run config :)
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
(In reply to Jean-Louis Dupond from comment #1) > Did some more checks, and it seems like something in the cluster upgrade > process sets the CustomCompatibilityVersion during the upgrade. > The next-run config does not contain a CustomCompatibilityVersion. correct, that's the way we handle the required reboots after CL change. Current entry is changed with previous CL, next-run clears it out so that after reboot the VM boots normally with the new CL configuration > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set > in the VDSStatic, but does not notice its not set in the next-run config. > And you end up with a Next-Run config with a old/incorrect > CustomCompatibilityVersion. so it's rather the other way around - it shouldn't copy the CCV into the next-run at all. But htat's the problem we had - we cannot determine if that was done during previous CL upgrade(so it should be cleared in next-run) or if it is an intentional change(someone wants to run this VM with such compat version) that we do need to keep set to the same.
(In reply to Michal Skrivanek from comment #4) > > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set > > in the VDSStatic, but does not notice its not set in the next-run config. > > And you end up with a Next-Run config with a old/incorrect > > CustomCompatibilityVersion. > > so it's rather the other way around - it shouldn't copy the CCV into the > next-run at all. But htat's the problem we had - we cannot determine if that > was done during previous CL upgrade(so it should be cleared in next-run) or > if it is an intentional change(someone wants to run this VM with such compat > version) that we do need to keep set to the same. But the CCV info is set into the NEXT_RUN snapshot by the cluster upgrade process it seems. When there is no NEXT_RUN snapshot during cluster upgrade, the NEXT_RUN snapshot XML contains; <ClusterCompatibilityVersion>4.7</ClusterCompatibilityVersion> (and no CustomCV) But when there was already a NEXT_RUN snapshot by a previous cluster update, the NEXT_RUN snapshot XML contains the following after the upgrade: <CustomCompatibilityVersion>4.5</CustomCompatibilityVersion><ClusterCompatibilityVersion>4.7</ClusterCompatibilityVersion> While just before the cluster upgrade it only contained: <ClusterCompatibilityVersion>4.6</ClusterCompatibilityVersion> So for some reason (I could not find where in the code), the cluster upgrade process sets a CustomCompatibilityVersion incorrectly.
(In reply to Jean-Louis Dupond from comment #5) > (In reply to Michal Skrivanek from comment #4) > > > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set > > > in the VDSStatic, but does not notice its not set in the next-run config. > > > And you end up with a Next-Run config with a old/incorrect > > > CustomCompatibilityVersion. > > > > so it's rather the other way around - it shouldn't copy the CCV into the > > next-run at all. But htat's the problem we had - we cannot determine if that > > was done during previous CL upgrade(so it should be cleared in next-run) or > > if it is an intentional change(someone wants to run this VM with such compat > > version) that we do need to keep set to the same. > > > But the CCV info is set into the NEXT_RUN snapshot by the cluster upgrade > process it seems. > > When there is no NEXT_RUN snapshot during cluster upgrade, the NEXT_RUN > snapshot XML contains; > <ClusterCompatibilityVersion>4.7</ClusterCompatibilityVersion> (and no > CustomCV) > > > But when there was already a NEXT_RUN snapshot by a previous cluster update, > the NEXT_RUN snapshot XML contains the following after the upgrade: > <CustomCompatibilityVersion>4.5</ > CustomCompatibilityVersion><ClusterCompatibilityVersion>4.7</ > ClusterCompatibilityVersion> > > > While just before the cluster upgrade it only contained: > <ClusterCompatibilityVersion>4.6</ClusterCompatibilityVersion> > > > So for some reason (I could not find where in the code), the cluster upgrade > process sets a CustomCompatibilityVersion incorrectly. Yeah, it shouldn't as ovirt-engine tries to use the existing next-run configuration, when exists, that is not supposed to include custom compatibility level (at least after the fix to bz 1650505) It seems the original report was different than what you wrote above - the custom compatibility level was set to 4.4 when upgrading to 4.7, right? Did you reproduce the flow above or is it what you suspect that would happen?
When upgrading from level 4.4 to 4.5, it seems to create a correct next-run config without the CustomCompatibilityVersion. But if you don't reboot the VM in the mean time, and then you upgrade to 4.6, it creates a next-run config with a CustomCompatibilityVersion 4.4 included. This is what I see on my cluster that was upgraded from 4.4 -> 4.5 -> 4.6 -> 4.7. Some VM's now have a next-run config with CustomCompatibilityVersion 4.4 or 4.5. Unfortunately I don't have a older oVirt cluster in test to do the upgrade flow. But I guess it should be easy reproducible by installing ovirt cluster version 4.4.1 for example, do upgrade to version with 4.5 compat level, then to 4.6 and then to 4.7.
(In reply to Michal Skrivanek from comment #4) > (In reply to Jean-Louis Dupond from comment #1) > > > But on the next cluster upgrade, it uses the CustomCompatibilityVersion set > > in the VDSStatic, but does not notice its not set in the next-run config. > > And you end up with a Next-Run config with a old/incorrect > > CustomCompatibilityVersion. > > so it's rather the other way around - it shouldn't copy the CCV into the > next-run at all. But htat's the problem we had - we cannot determine if that > was done during previous CL upgrade(so it should be cleared in next-run) or > if it is an intentional change(someone wants to run this VM with such compat > version) that we do need to keep set to the same. Was checking some more about the issue, but seems like we can? Case 1 - Cluster upgrade: If you do a cluster upgrade, the CCV is set to the VMStatic in the following line: https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/UpdateVmCommand.java#L320 But creating/updating the next-run is handled by the ClusterUpgrade, and this one does NOT add CCV into the next-run snapshot xml. Case 2 - CCV Manual change (when no next-run exists): The code creates a next-run with the chosen CCV Case 3 - CCV Manual change (with existing next-run config): The code updates the next-run config with the chosen CCV. So that means if we do a cluster upgrade, and we read the next-run config. If that next-run does NOT contain a CCV, even if the current VM has a CCV set in its VMStatic, we should NOT copy it to the next-run config. This as if it was eventually set manually, it must already exist in the next-run config anyway! Guess this can easily be fixed? Could just not find the code which sets the CCV in the next-run on cluster upgrade.
After more debugging it seems like it was caused because of a failed ClusterUpgrade. During the latest clusterupgrade, the first try to upgrade the cluster failed. One VM was unable to get upgraded, with the error: Message: Cannot Update VM. Q35 chipset is not supported by the guest OS Red Hat Enterprise Linux 5.x. This caused all the newly created next-run snapshots to be removed by the compensation command: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.Snapshot; snapshot: 4a78068d-2d63-43f3-af14-4cd426846aec. But as the removal of the existing next-run snapshot was not handled in the compensation comtext, it was not readded. So we ended up with all VM's missing their next-run config. The following ClusterUpgrade worked, but added a CustomCompatibilityVersion to all the VM's. For QA: - Have a cluster with some VM's running Level 4.5 For example - Upgrade to level 4.6, all VM's should have a next-run config - Break a VM (for example set Q35 chipset on RHEL 5) - Try to upgrade to cluster level 4.7 - Update fails Without the patch you end up with all VM's losing their next-run config. With the patch, next-run config should have been restored.
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.