Bug 2010318
| Summary: | with sysprep config data as cfgmap volume and as cdrom disk a windows10 VMI fails to LiveMigrate | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Kedar Bidarkar <kbidarka> |
| Component: | Virtualization | Assignee: | Antonio Cardace <acardace> |
| Status: | CLOSED ERRATA | QA Contact: | zhe peng <zpeng> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.9.0 | CC: | cnv-qe-bugs, dholler, jlejosne, lpivarc, sgott, ycui, zpeng |
| Target Milestone: | --- | Keywords: | TestOnly |
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-14 19:28:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@ @kbidarka Just to confirm, there is no CNV/OCP upgrade involved here, correct? Asking because I would expect a similar error going from CNV 2.6 to CNV 4.8/4.9, due to #2008511... Yes, no upgrades involved here. This is seen on a scratch install of 4.9.0 setup. @kbidarka Not sure if it's a copy-paste error, but at the end of the bug description, "-name: sysprep" seems to greps twice, that could be an issue! Even having 2 configmaps with different names could cause issues. Could we have the full output of `oc get vmi vm-ocs-win10 -o yaml` please? Thanks! It all looks like the sysprep-config ConfigMap object got deleted between VM creation and migration... Did you check if it was still there before migrating? What about after? (oc get configmap sysprep-config) The sysprep-config cfgMap still exists on the setup. so it did existed all the while. ]$ oc get cm NAME DATA AGE kube-root-ca.crt 1 46h openshift-service-ca.crt 1 46h sysprep-config 2 27h Ah! In comment 4 above, we see no sysprep disks/volumes in the vm yaml file, because I updated VM later, after creating the VM and creating cfgMap. $ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{
"op": "add",
"path": "/spec/configuration/migrations",
"value": {
"progressTimeout": 1800,
"bandwidthPerMigration": "128Mi"
}
}]'
[kbidarka@localhost auth]$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
virt-launcher-vm-ocs-win10-6xxrn 1/1 Running 0 54m xxx node-14.redhat.com <none> <none>
virt-launcher-vm-ocs-win10-8m789 0/1 Completed 0 27h xxx node-13.redhat.com <none> <none>
virt-launcher-vm-ocs-win10-q88ml 0/1 Completed 0 27h xxx node-15.redhat.com <none> <none>
virt-launcher-win10-mathematical-guppy-7cp6s 1/1 Running 0 3h42m xxx node-13.redhat.com <none> <none>
virt-launcher-win10-mathematical-guppy-j6wgz 0/1 Completed 0 3h52m xxx node-14.redhat.com <none> <none>
virt-launcher-win10-mathematical-guppy-ktt7h 0/1 Completed 0 3h50m xxx node-13.redhat.com <none> <none>
virt-launcher-win10-mathematical-guppy-vr7v7 0/1 Completed 0 3h46m xxx node-14.redhat.com <none> <none>
[kbidarka@localhost auth]$ oc get vmi
NAME AGE PHASE IP NODENAME READY
vm-ocs-win10 27h Running aa.xx.yy.a2 node-14.redhat.com True
win10-mathematical-guppy 3h52m Running aa.xx.yy.a1 node-13.redhat.com True
With the above json annotation patch, the LiveMigration turned out to be successful.
I updated,
"progressTimeout": 1800,
"bandwidthPerMigration": "128Mi"
Plan is to try to restore to the defaults and reproduce this issue.
Turns out, even "without updating" or "restoring to defaults" the migration values in HCO CR. The VMI LiveMigration works, but from the 2nd attempt of "LiveMigrating" it. [kbidarka@localhost sysprep]$ oc get vmi NAME AGE PHASE IP NODENAME READY win10-devoted-bison 41m Running xx.yy.zz.aa node-15.redhat.com True [kbidarka@localhost sysprep]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-win10-devoted-bison-5gq2x 0/2 Init:0/2 0 6s virt-launcher-win10-devoted-bison-q4jcw 2/2 Running 0 42m virt-launcher-win10-devoted-bison-xgxkw 0/2 Completed 0 32m [ Here Target VMI Pod just entered "Completed" state and the LiveMigration failed. ] [ VMI Remained on node-15.redhat.com] [kbidarka@localhost sysprep]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-win10-devoted-bison-5gq2x 2/2 Running 0 25m [ Here, Target VMI Pod is now in "Running" state. ] virt-launcher-win10-devoted-bison-q4jcw 0/2 Completed 0 67m [ Here, Source VMI Pod entered "Completed" state and the LiveMigration was successful. ] virt-launcher-win10-devoted-bison-xgxkw 0/2 Completed 0 57m [kbidarka@localhost sysprep]$ oc get vmi NAME AGE PHASE IP NODENAME READY win10-devoted-bison 68m Running xx.yy.zz.aa node-14.redhat.com True [ VMI is now running on node-14.redhat.com] Would it be possible to collect logs from the migration source and target virt-launcher pods? Moved to POST by mistake. We're still investigating the root cause. Summary from comment17 and comment18: It appears that, for every new node the VMIPod migrates to, the LiveMigration just fails to be successful. As seen in comment17, I cordoned node-15, so that the Target Pod gets created on node-13, which had no prior VMI's running. As seen in comment18, the migration failed. Attaching logs shortly. Removing blocker flag and deferring to the next point release. So far all failures in logs are timeout related. We will continue to investigate. Kedar is this still reproducible? Kedar is this still reproducible? Lowering the severity because this has not been clearly reproduced in some time. Lowering the priority because the severity was lowered. We suspect this has already been fixed by other migration related changes. Thus we are adding the TestOnly flag as it does not appear code changes are necessary. Deferring to the next release as it is not clear there exists a negative effect at this time. Kedar, please advise if you can reproduce this. verify with cnv 4.11 build:
$ oc version
Server Version: 4.11.0-fc.0
Kubernetes Version: v1.24.0+beaaed6
$ virtctl version
Server Version: version.Info{GitVersion:"v0.53.1-44-gfc4ffc073", GitCommit:"fc4ffc07324b46158ae55de63c253cdf2b0fe3ac", GitTreeState:"clean", BuildDate:"2022-06-14T11:24:17Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
step:
same with bug description,
create a win vm with sysprep, start vm and do live migration
migration successed
start migration:
$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-win10-92sqn 2/2 Running 0 2m50s
virt-launcher-vm-win10-zgdcq 2/2 Running 0 24s
after migration:
$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-win10-92sqn 0/2 Completed 0 7m23s
virt-launcher-vm-win10-zgdcq 2/2 Running 0 4m57s
$ oc get virtualmachineinstancemigrations.kubevirt.io
NAME PHASE VMI
vm-win10-migration-5ul2q Succeeded vm-win10
$ oc get vmi vm-win10 -o yaml | grep sysprep
name: sysprep
- name: sysprep
sysprep:
name: sysprep-config
- name: sysprep
move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |
Description of problem: With sysprep config data as cfgMap Volume and attached as cdrom Disk to a windows10 VMI fails to LiveMigrate Version-Release number of selected component (if applicable): CNV-4.9.0 How reproducible: Steps to Reproduce: 1. Create a Windows 10 VM via the Templates provided 2. Attach the sysprep-config as cfgMap Volume and attached as CDROM Disk to it by editing the VM 3. Actual results: The target VMI Pod, just shows as "Completed" as seen in the "additional info" below. The VMI remained on the node-15, with the same source VMI Pod. LiveMigration fails as seen from the VMIM object info. the Target VMI Pod shows the below error. [kbidarka@localhost sysprep]$ oc describe pods virt-launcher-vm-ocs-win10-8m789 | grep -A 20 "Events" Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 73m default-scheduler Successfully assigned default/virt-launcher-vm-ocs-win10-8m789 to node-13.redhat.com Normal SuccessfulAttachVolume 73m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-5100b562-7ad4-489f-8210-40c2b0714d0a" Normal SuccessfulMountVolume 73m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-5100b562-7ad4-489f-8210-40c2b0714d0a" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-5100b562-7ad4-489f-8210-40c2b0714d0a/dev" Normal SuccessfulMountVolume 73m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-5100b562-7ad4-489f-8210-40c2b0714d0a" volumeMapPath "/var/lib/kubelet/pods/f6641b87-fd64-4289-87bb-fc91d0ea78cc/volumeDevices/kubernetes.io~csi" Normal AddedInterface 73m multus Add eth0 [10.131.1.11/23] from openshift-sdn Normal Pulled 73m kubelet Container image "registry.redhat.io/container-native-virtualization/virt-launcher@sha256:1158e1c0c282ca2ba6af6e9b71a79811d37ac4b51922f989f8f3da68a5800264" already present on machine Normal Created 73m kubelet Created container compute Normal Started 73m kubelet Started container compute Warning DNSConfigForming 48m (x24 over 73m) kubelet Search Line limits were exceeded, some search paths have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local Warning MultiplePodDisruptionBudgets 48m (x8 over 73m) controllermanager Pod "default"/"virt-launcher-vm-ocs-win10-8m789" matches multiple PodDisruptionBudgets. Chose "kubevirt-disruption-budget-s6dvh" arbitrarily. Warning FailedMount 48m kubelet MountVolume.SetUp failed for volume "sysprep" : object "default"/"sysprep-config" not registered [kbidarka@localhost sysprep]$ oc get vmim vm-ocs-win10-migration-nhcxg -o yaml | grep phase f:phase: {} phase: Failed Expected results: LiveMigration should succeed Additional info: [kbidarka@localhost sysprep]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-ocs-win10-8m789 0/1 Completed 0 72m virt-launcher-vm-ocs-win10-q88ml 1/1 Running 0 75m [kbidarka@localhost sysprep]$ oc get vmi NAME AGE PHASE IP NODENAME READY vm-ocs-win10 75m Running 10.xxx.y.aaa node-15.redhat.com True [kbidarka@localhost sysprep]$ oc get vmi vm-ocs-win10 -o yaml | grep sysprep name: sysprep - name: sysprep sysprep: name: sysprep-config - name: sysprep