Description of problem: If node is notready and pipelines are creating VM's so VM's don't moves to a new mode and we see message : EvictionStrategy is set but vmi is not migratable; cannot migrate VMI: PVC windows-5n76q2-installcdrom is not shared, live migration requires that all PVCs must be shared (using ReadWriteMany access mode). In tekton job, its RWO : https://github.com/kubevirt/tekton-tasks-operator/blob/main/data/tekton-pipelines/okd/windows-efi-installer-pipeline.yaml#L263 $ oc get nodes NAME STATUS ROLES AGE VERSION c01-gk413ccl3-gww9w-master-0 Ready control-plane,master,worker 5h9m v1.26.0+9eb81c2 c01-gk413ccl3-gww9w-master-1 NotReady control-plane,master,worker 5h9m v1.26.0+9eb81c2 c01-gk413ccl3-gww9w-master-2 Ready control-plane,master,worker 5h9m v1.26.0+9eb81c2 [cloud-user@ocp-psi-executor ~]$ oc get vmi -A NAMESPACE NAME AGE PHASE IP NODENAME READY openshift-cnv windows-5g7u8n 22m Running 10.129.0.182 c01-gk413ccl3-gww9w-master-1 False Version-Release number of selected component (if applicable): 4.13.0 How reproducible: always Steps to Reproduce: 1.used a compact cluster to reproduce. 2.Run automation job 3. Actual results: VM are not migrated also pipeline end with error as VMI task is never completed Expected results: VM should get migrated on a different node Additional info:
Would it work if we use the stoarage API instead of a PVC, e.g. like this? storage: resources: requests: storage: 9Gi
When VM is running and node goes down, VM will be in error state and wait-for-vmi-status task will fail whole pipeline, because VM is in error state and even VM would be migrated to a different node, pipeline will not continue. So changing access mode will not help in this case. In case we want to not fail if VM is in error state, we will have to change behaviour of wait-for-vmi-status and not fail if any error occurs. This opens potential issues, that VM will be in err state and not able to recover and pipeline will still run, instead of failing too.
The current behavior of failing the whole pipeline on the first internal error should be kept, but never the less live migration of the VM should be enabled and we will retry the scenario of running the VM on a node changing to "not-ready" state.
With rework of example pipelines in https://github.com/kubevirt/ssp-operator/pull/550 this issue should be fixed