Bug 2174974

Summary: Tekton: VM is not getting evicted/migrated to a new node due to PVCs accessmode
Product: Container Native Virtualization (CNV) Reporter: Geetika Kapoor <gkapoor>
Component: InfrastructureAssignee: Karel Šimon <ksimon>
Status: ON_QA --- QA Contact: Geetika Kapoor <gkapoor>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13.0CC: ksimon, rsdeor
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kubevirt-ssp-operator-rhel9-container-v4.14.0-77 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Geetika Kapoor 2023-03-02 19:13:59 UTC
Description of problem:

If node is notready and pipelines are creating VM's so VM's don't moves to a new mode and we see message : EvictionStrategy is set but vmi is not migratable; cannot migrate VMI: PVC windows-5n76q2-installcdrom is not shared, live migration requires that all PVCs must be shared (using ReadWriteMany access mode). 

In tekton job, its RWO : https://github.com/kubevirt/tekton-tasks-operator/blob/main/data/tekton-pipelines/okd/windows-efi-installer-pipeline.yaml#L263

$ oc get nodes
NAME                           STATUS     ROLES                         AGE    VERSION
c01-gk413ccl3-gww9w-master-0   Ready      control-plane,master,worker   5h9m   v1.26.0+9eb81c2
c01-gk413ccl3-gww9w-master-1   NotReady   control-plane,master,worker   5h9m   v1.26.0+9eb81c2
c01-gk413ccl3-gww9w-master-2   Ready      control-plane,master,worker   5h9m   v1.26.0+9eb81c2
[cloud-user@ocp-psi-executor ~]$ oc get vmi -A
NAMESPACE       NAME             AGE   PHASE     IP             NODENAME                       READY
openshift-cnv   windows-5g7u8n   22m   Running   10.129.0.182   c01-gk413ccl3-gww9w-master-1   False

Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
always

Steps to Reproduce:
1.used a compact cluster to reproduce.
2.Run automation job 
3.

Actual results:

VM are not migrated also pipeline end with error as VMI task is never completed

Expected results:
VM should get migrated on a different node

Additional info:

Comment 2 Dominik Holler 2023-03-15 11:48:14 UTC
Would it work if we use the stoarage API instead of a PVC, e.g. like this?

  storage:
    resources:
      requests:
        storage: 9Gi

Comment 3 Karel Šimon 2023-03-21 08:48:13 UTC
When VM is running and node goes down, VM will be in error state and wait-for-vmi-status task will fail whole pipeline, because VM is in error state and even VM would be migrated to a different node, pipeline will not continue. So changing access mode will not help in this case. In case we want to not fail if VM is in error state, we will have to change behaviour of wait-for-vmi-status and not fail if any error occurs. This opens potential issues, that VM will be in err state and not able to recover and pipeline will still run, instead of failing too.

Comment 4 Dominik Holler 2023-03-21 11:02:22 UTC
The current behavior of failing the whole pipeline on the first internal error should be kept, but never the less live migration of the VM should be enabled and we will retry the scenario of running the VM on a node changing to "not-ready" state.

Comment 5 Karel Šimon 2023-07-03 07:25:55 UTC
With rework of example pipelines in https://github.com/kubevirt/ssp-operator/pull/550 this issue should be fixed