Bug 2167244 - Windows VM with Reenlightenment flag becomes non-migratable after upgrade CNV from 4.10 to 4.11
Summary: Windows VM with Reenlightenment flag becomes non-migratable after upgrade CNV...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.11.3
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.8
Assignee: Antonio Cardace
QA Contact: Denys Shchedrivyi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-06 02:22 UTC by Denys Shchedrivyi
Modified: 2023-08-08 10:27 UTC (History)
10 users (show)

Fixed In Version: virt-operator-container-v4.10.8-9 hco-bundle-registry-container-v4.10.8-37
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-08 10:27:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 9201 0 None open [release-0.49] Fix Live Migration for reenlightenment VMIs 2023-02-09 10:41:38 UTC
Red Hat Issue Tracker CNV-25081 0 None None None 2023-02-06 02:24:17 UTC

Description Denys Shchedrivyi 2023-02-06 02:22:28 UTC
Description of problem:
 Windows VM was migratable in CNV 4.10.7, but after upgrade to 4.11.3 it becomes non-migratable, here the events from VMI:

# VM Created and successfully migrated for the first time:
>  Normal   SuccessfulCreate  49m                  disruptionbudget-controller  Created PodDisruptionBudget kubevirt-disruption-budget-56ncf
>  Normal   SuccessfulCreate  49m                  virtualmachine-controller    Created virtual machine pod virt-launcher-windows-vm-1675645478-4397972-dsmmn
>  Normal   Created           48m                  virt-handler                 VirtualMachineInstance defined.
>  Normal   Started           48m                  virt-handler                 VirtualMachineInstance started.
>  Normal   SuccessfulCreate  14m                  disruptionbudget-controller  Created Migration kubevirt-evacuation-7frcx
>  Normal   SuccessfulUpdate  14m                  virtualmachine-controller    Expanded PodDisruptionBudget kubevirt-disruption-budget-56ncf
>  Normal   PreparingTarget   11m (x2 over 11m)    virt-handler                 VirtualMachineInstance Migration Target Prepared.
>  Normal   PreparingTarget   11m                  virt-handler                 Migration Target is listening at 10.131.0.119, on ports: 35055,36327
>  Normal   Migrating         11m (x5 over 11m)    virt-handler                 VirtualMachineInstance is migrating.
>  Normal   Migrated          11m                  virt-handler                 The VirtualMachineInstance migrated to node cnv-qe-infra-07.cnvqe2.lab.eng.rdu2.redhat.com.
>  Normal   Deleted           11m                  virt-handler                 Signaled Deletion
>  Normal   SuccessfulUpdate  11m                  disruptionbudget-controller  shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-56ncf)

# When CNV upgraded - this message appeared: 
>  Warning  Migrated          5m32s                virt-handler                 EvictionStrategy is set but vmi is not migratable; HyperV Reenlightenment VMIs cannot migrate when TSC Frequency is not exposed on the cluster: guest timers might be inconsistent
>  Warning  Migrated          47s (x8 over 5m32s)  virt-handler                 EvictionStrategy is set but vmi is not migratable; HyperV Reenlightenment VMIs cannot migrate when TSC Frequency is not exposed on the cluster: guest timers might be inconsistent

 And VM is not migratable anymore. 
 


Version-Release number of selected component (if applicable):
CNV 4.11.3


Steps to Reproduce:
1. Install CNV 4.10.7 
2. Create Windows VM with HyperV Reenlightenment flag enabled
3. Upgrade CNV to 4.11.3


Actual results:
 VM is not migratable after upgrade

Expected results:
 VM should be migratable 

Additional info:
 Restarting VM after upgrade (`virtctl restart`) helps fix that

Comment 1 Antonio Cardace 2023-02-06 09:11:09 UTC
This is behaving as expected, if the cluster does not expose the TSC frequency then migration of re-enlightenment Windows VM is not supported because of changes introduced by QEMU (for safety measures).

This will happen on virtualized nodes if the invtsc CPU model feature is not set (PSI clusters).

Comment 2 Peter Lauterbach 2023-02-06 15:23:51 UTC
restarting VMs is a disruptive and unexpected operation during an OCP upgrade. This blocks completion of the upgrade of the OCP cluster.
How can we identify or warn against this problem BEFORE a customer starts an OCP upgrade?

Comment 3 Fabian Deutsch 2023-02-06 19:15:52 UTC
Did this happen on bare metal nodes or virtualized nodes?

Comment 4 Denys Shchedrivyi 2023-02-06 19:19:01 UTC
 BM node

Comment 8 Peter Lauterbach 2023-02-08 22:03:28 UTC
> This will happen on virtualized nodes if the invtsc CPU model feature is not set (PSI clusters)
Please advise how a customer can diagnose what physical hosts will have this problem, and which ones will not experience this problem?

Comment 9 Antonio Cardace 2023-02-09 11:09:57 UTC
To clarify concerns, this is indeed a bug caused by breaking changes introduced by QEMU and KubeVirt in the 4.11.1 release.

@pelauter the reenlightenment VMI when created will have a node selector that will schedule the VM only on nodes that support the lowest TSC frequency available on the cluster or nodes that have the 'cpu-timer.node.kubevirt.io/tsc-scalable' label set to true as they support TSC frequency scaling.

In practice on the cluster on which Denys found the bug these are the nodes:

  name: monster01.lab.eng.tlv2.redhat.com
    cpu-timer.node.kubevirt.io/tsc-frequency: '2099998000'
    cpu-timer.node.kubevirt.io/tsc-scalable: 'false'

  name: monster02.lab.eng.tlv2.redhat.com
    cpu-timer.node.kubevirt.io/tsc-frequency: '2099998000'
    cpu-timer.node.kubevirt.io/tsc-scalable: 'false'

  name: monster04.lab.eng.tlv2.redhat.com
    cpu-timer.node.kubevirt.io/tsc-frequency: '1699998000'
    cpu-timer.node.kubevirt.io/tsc-scalable: 'false'

  name: zeus08.lab.eng.tlv2.redhat.com
    cpu-timer.node.kubevirt.io/tsc-frequency: '1699998000'
    cpu-timer.node.kubevirt.io/tsc-scalable: 'false'

  name: zeus10.lab.eng.tlv2.redhat.com
    cpu-timer.node.kubevirt.io/tsc-frequency: '2095078000'
    cpu-timer.node.kubevirt.io/tsc-scalable: 'true'

  name: zeus11.lab.eng.tlv2.redhat.com
    cpu-timer.node.kubevirt.io/tsc-frequency: '2095077000'
    cpu-timer.node.kubevirt.io/tsc-scalable: 'true'

The VMI will have a node selector that points to the lowest frequency, which is 1699998000, so this VMI will not be able to run on the first 2 nodes because they have a TSC frequency of 2099998000 and they are non-scalable.  

With all this said, the Live-Migration condition turning to false is a real bug and will be fixed in 4.10.8.

Comment 11 Denys Shchedrivyi 2023-03-14 02:09:37 UTC
verified, Windows VM with Reenlightenment can be migrated after upgrade from 4.10.8 to 4.11.3


Note You need to log in before you can comment on or make changes to this bug.