Bug 2010318

Summary: with sysprep config data as cfgmap volume and as cdrom disk a windows10 VMI fails to LiveMigrate
Product: Container Native Virtualization (CNV) Reporter: Kedar Bidarkar <kbidarka>
Component: VirtualizationAssignee: Antonio Cardace <acardace>
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.9.0CC: cnv-qe-bugs, dholler, jlejosne, lpivarc, sgott, ycui, zpeng
Target Milestone: ---Keywords: TestOnly
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-14 19:28:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kedar Bidarkar 2021-10-04 13:00:23 UTC
Description of problem:
With sysprep config data as cfgMap Volume and attached as cdrom Disk to a windows10 VMI fails to LiveMigrate

Version-Release number of selected component (if applicable):
CNV-4.9.0

How reproducible:


Steps to Reproduce:
1. Create a Windows 10 VM via the Templates provided
2. Attach the sysprep-config as cfgMap Volume and attached as CDROM Disk to it by editing the VM
3.

Actual results:

The target VMI Pod, just shows as "Completed" as seen in the "additional info" below.
The VMI remained on the node-15, with the same source VMI Pod.

LiveMigration fails as seen from the VMIM object info.

the Target VMI Pod shows the below error.
 
 
[kbidarka@localhost sysprep]$ oc describe pods virt-launcher-vm-ocs-win10-8m789 | grep -A 20 "Events"
Events:
  Type     Reason                        Age                 From                     Message
  ----     ------                        ----                ----                     -------
  Normal   Scheduled                     73m                 default-scheduler        Successfully assigned default/virt-launcher-vm-ocs-win10-8m789 to node-13.redhat.com
  Normal   SuccessfulAttachVolume        73m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-5100b562-7ad4-489f-8210-40c2b0714d0a"
  Normal   SuccessfulMountVolume         73m                 kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-5100b562-7ad4-489f-8210-40c2b0714d0a" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-5100b562-7ad4-489f-8210-40c2b0714d0a/dev"
  Normal   SuccessfulMountVolume         73m                 kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-5100b562-7ad4-489f-8210-40c2b0714d0a" volumeMapPath "/var/lib/kubelet/pods/f6641b87-fd64-4289-87bb-fc91d0ea78cc/volumeDevices/kubernetes.io~csi"
  Normal   AddedInterface                73m                 multus                   Add eth0 [10.131.1.11/23] from openshift-sdn
  Normal   Pulled                        73m                 kubelet                  Container image "registry.redhat.io/container-native-virtualization/virt-launcher@sha256:1158e1c0c282ca2ba6af6e9b71a79811d37ac4b51922f989f8f3da68a5800264" already present on machine
  Normal   Created                       73m                 kubelet                  Created container compute
  Normal   Started                       73m                 kubelet                  Started container compute
  Warning  DNSConfigForming              48m (x24 over 73m)  kubelet                  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local 
  Warning  MultiplePodDisruptionBudgets  48m (x8 over 73m)   controllermanager        Pod "default"/"virt-launcher-vm-ocs-win10-8m789" matches multiple PodDisruptionBudgets.  Chose "kubevirt-disruption-budget-s6dvh" arbitrarily.
  Warning  FailedMount                   48m                 kubelet                  MountVolume.SetUp failed for volume "sysprep" : object "default"/"sysprep-config" not registered
 
 
 
[kbidarka@localhost sysprep]$ oc get vmim vm-ocs-win10-migration-nhcxg -o yaml | grep phase
        f:phase: {}
  phase: Failed



Expected results:
LiveMigration should succeed

Additional info:

[kbidarka@localhost sysprep]$ oc get pods
NAME                               READY   STATUS      RESTARTS   AGE
virt-launcher-vm-ocs-win10-8m789   0/1     Completed   0          72m
virt-launcher-vm-ocs-win10-q88ml   1/1     Running     0          75m

[kbidarka@localhost sysprep]$ oc get vmi
NAME           AGE   PHASE     IP            NODENAME                           READY
vm-ocs-win10   75m   Running   10.xxx.y.aaa   node-15.redhat.com   True
 
[kbidarka@localhost sysprep]$ oc get vmi vm-ocs-win10 -o yaml | grep sysprep
        name: sysprep
  - name: sysprep
    sysprep:
        name: sysprep-config
  - name: sysprep

Comment 2 Jed Lejosne 2021-10-04 13:21:57 UTC
@

Comment 5 Jed Lejosne 2021-10-04 13:24:38 UTC
@kbidarka Just to confirm, there is no CNV/OCP upgrade involved here, correct? Asking because I would expect a similar error going from CNV 2.6 to CNV 4.8/4.9, due to #2008511...

Comment 6 Kedar Bidarkar 2021-10-04 13:26:25 UTC
Yes, no upgrades involved here. This is seen on a scratch install of 4.9.0 setup.

Comment 7 Jed Lejosne 2021-10-04 13:57:03 UTC
@kbidarka Not sure if it's a copy-paste error, but at the end of the bug description, "-name: sysprep" seems to greps twice, that could be an issue!
Even having 2 configmaps with different names could cause issues.
Could we have the full output of `oc get vmi vm-ocs-win10 -o yaml` please?
Thanks!

Comment 9 Jed Lejosne 2021-10-04 19:27:57 UTC
It all looks like the sysprep-config ConfigMap object got deleted between VM creation and migration...
Did you check if it was still there before migrating? What about after? (oc get configmap sysprep-config)

Comment 10 Kedar Bidarkar 2021-10-05 13:19:56 UTC
The sysprep-config cfgMap still exists on the setup. so it did existed all the while.

]$ oc get cm 
NAME                       DATA   AGE
kube-root-ca.crt           1      46h
openshift-service-ca.crt   1      46h
sysprep-config             2      27h

Comment 11 Kedar Bidarkar 2021-10-05 13:23:33 UTC
Ah! In comment 4 above, we see no sysprep disks/volumes in the vm yaml file, because I updated VM later, after creating the VM and creating cfgMap.

Comment 12 Kedar Bidarkar 2021-10-05 14:21:36 UTC
$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{
       "op": "add",
       "path": "/spec/configuration/migrations",
       "value": {
             "progressTimeout": 1800,
             "bandwidthPerMigration": "128Mi"
        }
   }]'



[kbidarka@localhost auth]$ oc get pods -o wide 
NAME                                           READY   STATUS      RESTARTS   AGE     IP             NODE                               NOMINATED NODE   READINESS GATES
virt-launcher-vm-ocs-win10-6xxrn               1/1     Running     0          54m     xxx   node-14.redhat.com   <none>           <none>
virt-launcher-vm-ocs-win10-8m789               0/1     Completed   0          27h     xxx    node-13.redhat.com   <none>           <none>
virt-launcher-vm-ocs-win10-q88ml               0/1     Completed   0          27h     xxx    node-15.redhat.com   <none>           <none>
virt-launcher-win10-mathematical-guppy-7cp6s   1/1     Running     0          3h42m   xxx    node-13.redhat.com   <none>           <none>
virt-launcher-win10-mathematical-guppy-j6wgz   0/1     Completed   0          3h52m   xxx   node-14.redhat.com   <none>           <none>
virt-launcher-win10-mathematical-guppy-ktt7h   0/1     Completed   0          3h50m   xxx    node-13.redhat.com   <none>           <none>
virt-launcher-win10-mathematical-guppy-vr7v7   0/1     Completed   0          3h46m   xxx   node-14.redhat.com   <none>           <none>
[kbidarka@localhost auth]$ oc get vmi 
NAME                       AGE     PHASE     IP             NODENAME                           READY
vm-ocs-win10               27h     Running   aa.xx.yy.a2   node-14.redhat.com   True
win10-mathematical-guppy   3h52m   Running   aa.xx.yy.a1   node-13.redhat.com   True



With the above json annotation patch, the LiveMigration turned out to be successful.
I updated,
"progressTimeout": 1800,
"bandwidthPerMigration": "128Mi"


Plan is to try to restore to the defaults and reproduce this issue.

Comment 13 Kedar Bidarkar 2021-10-06 08:48:02 UTC
Turns out, even "without updating" or "restoring to defaults" the migration values in HCO CR. The VMI LiveMigration  works, but from the 2nd attempt of "LiveMigrating" it.

[kbidarka@localhost sysprep]$ oc get vmi 
NAME                  AGE   PHASE     IP             NODENAME                           READY
win10-devoted-bison   41m   Running  xx.yy.zz.aa   node-15.redhat.com   True
[kbidarka@localhost sysprep]$ oc get pods 
NAME                                      READY   STATUS      RESTARTS   AGE
virt-launcher-win10-devoted-bison-5gq2x   0/2     Init:0/2    0          6s
virt-launcher-win10-devoted-bison-q4jcw   2/2     Running     0          42m
virt-launcher-win10-devoted-bison-xgxkw   0/2     Completed   0          32m    [ Here Target VMI Pod just entered "Completed" state and the LiveMigration failed. ]  [ VMI Remained on node-15.redhat.com]

 
[kbidarka@localhost sysprep]$ oc get pods 
NAME                                      READY   STATUS      RESTARTS   AGE
virt-launcher-win10-devoted-bison-5gq2x   2/2     Running     0          25m    [ Here, Target VMI Pod is now in "Running" state. ]
virt-launcher-win10-devoted-bison-q4jcw   0/2     Completed   0          67m    [ Here, Source VMI Pod entered "Completed" state and the LiveMigration was successful. ]
virt-launcher-win10-devoted-bison-xgxkw   0/2     Completed   0          57m

[kbidarka@localhost sysprep]$ oc get vmi 
NAME                  AGE   PHASE     IP            NODENAME                           READY
win10-devoted-bison   68m   Running   xx.yy.zz.aa   node-14.redhat.com                 True      [ VMI is now running on node-14.redhat.com]

Comment 14 sgott 2021-10-06 12:06:33 UTC
Would it be possible to collect logs from the migration source and target virt-launcher pods?

Comment 16 sgott 2021-10-06 15:07:33 UTC
Moved to POST by mistake. We're still investigating the root cause.

Comment 19 Kedar Bidarkar 2021-10-06 19:40:53 UTC
Summary from comment17 and comment18:

It appears that, for every new node the VMIPod migrates to, the LiveMigration just fails to be successful.
As seen in comment17, I cordoned node-15, so that the Target Pod gets created on node-13, which had no prior VMI's running.

As seen in comment18, the migration failed.

Attaching logs shortly.

Comment 22 sgott 2021-10-07 21:51:20 UTC
Removing blocker flag and deferring to the next point release. So far all failures in logs are timeout related. We will continue to investigate.

Comment 24 sgott 2021-12-20 15:17:17 UTC
Kedar is this still reproducible?

Comment 25 sgott 2022-01-26 19:59:37 UTC
Kedar is this still reproducible?

Comment 26 sgott 2022-01-26 21:17:39 UTC
Lowering the severity because this has not been clearly reproduced in some time.
Lowering the priority because the severity was lowered.

We suspect this has already been fixed by other migration related changes. Thus we are adding the TestOnly flag as it does not appear code changes are necessary.
Deferring to the next release as it is not clear there exists a negative effect at this time.

Kedar, please advise if you can reproduce this.

Comment 28 zhe peng 2022-06-16 06:42:10 UTC
verify with cnv 4.11 build:
$ oc version
Server Version: 4.11.0-fc.0
Kubernetes Version: v1.24.0+beaaed6
$ virtctl version
Server Version: version.Info{GitVersion:"v0.53.1-44-gfc4ffc073", GitCommit:"fc4ffc07324b46158ae55de63c253cdf2b0fe3ac", GitTreeState:"clean", BuildDate:"2022-06-14T11:24:17Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}

step:
same with bug description,
create a win vm with sysprep, start vm and do live migration
migration successed

start migration:
$ oc get pods
NAME                           READY   STATUS    RESTARTS   AGE
virt-launcher-vm-win10-92sqn   2/2     Running   0          2m50s
virt-launcher-vm-win10-zgdcq   2/2     Running   0          24s

after migration:
$ oc get pods
NAME                           READY   STATUS      RESTARTS   AGE
virt-launcher-vm-win10-92sqn   0/2     Completed   0          7m23s
virt-launcher-vm-win10-zgdcq   2/2     Running     0          4m57s

$ oc get virtualmachineinstancemigrations.kubevirt.io
NAME                       PHASE       VMI
vm-win10-migration-5ul2q   Succeeded   vm-win10

$ oc get vmi vm-win10 -o yaml | grep sysprep
        name: sysprep
  - name: sysprep
    sysprep:
        name: sysprep-config
  - name: sysprep

move to verified.

Comment 30 errata-xmlrpc 2022-09-14 19:28:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526

Comment 31 Red Hat Bugzilla 2023-09-15 01:36:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days