Bug 2174974 - Tekton: VM is not getting evicted/migrated to a new node due to PVCs accessmode
Summary: Tekton: VM is not getting evicted/migrated to a new node due to PVCs accessmode
Keywords:
Status: ON_QA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Infrastructure
Version: 4.13.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.14.0
Assignee: Karel Šimon
QA Contact: Geetika Kapoor
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-02 19:13 UTC by Geetika Kapoor
Modified: 2023-07-03 07:25 UTC (History)
2 users (show)

Fixed In Version: kubevirt-ssp-operator-rhel9-container-v4.14.0-77
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt ssp-operator pull 550 0 None Merged feat: rework example pipelines 2023-07-03 07:25:55 UTC
Github kubevirt tekton-tasks-operator pull 139 0 None open [WIP] fix: allow VMs migration during pipelineRun 2023-03-21 13:33:33 UTC
Red Hat Issue Tracker CNV-26389 0 None None None 2023-03-02 19:15:28 UTC

Description Geetika Kapoor 2023-03-02 19:13:59 UTC
Description of problem:

If node is notready and pipelines are creating VM's so VM's don't moves to a new mode and we see message : EvictionStrategy is set but vmi is not migratable; cannot migrate VMI: PVC windows-5n76q2-installcdrom is not shared, live migration requires that all PVCs must be shared (using ReadWriteMany access mode). 

In tekton job, its RWO : https://github.com/kubevirt/tekton-tasks-operator/blob/main/data/tekton-pipelines/okd/windows-efi-installer-pipeline.yaml#L263

$ oc get nodes
NAME                           STATUS     ROLES                         AGE    VERSION
c01-gk413ccl3-gww9w-master-0   Ready      control-plane,master,worker   5h9m   v1.26.0+9eb81c2
c01-gk413ccl3-gww9w-master-1   NotReady   control-plane,master,worker   5h9m   v1.26.0+9eb81c2
c01-gk413ccl3-gww9w-master-2   Ready      control-plane,master,worker   5h9m   v1.26.0+9eb81c2
[cloud-user@ocp-psi-executor ~]$ oc get vmi -A
NAMESPACE       NAME             AGE   PHASE     IP             NODENAME                       READY
openshift-cnv   windows-5g7u8n   22m   Running   10.129.0.182   c01-gk413ccl3-gww9w-master-1   False

Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
always

Steps to Reproduce:
1.used a compact cluster to reproduce.
2.Run automation job 
3.

Actual results:

VM are not migrated also pipeline end with error as VMI task is never completed

Expected results:
VM should get migrated on a different node

Additional info:

Comment 2 Dominik Holler 2023-03-15 11:48:14 UTC
Would it work if we use the stoarage API instead of a PVC, e.g. like this?

  storage:
    resources:
      requests:
        storage: 9Gi

Comment 3 Karel Šimon 2023-03-21 08:48:13 UTC
When VM is running and node goes down, VM will be in error state and wait-for-vmi-status task will fail whole pipeline, because VM is in error state and even VM would be migrated to a different node, pipeline will not continue. So changing access mode will not help in this case. In case we want to not fail if VM is in error state, we will have to change behaviour of wait-for-vmi-status and not fail if any error occurs. This opens potential issues, that VM will be in err state and not able to recover and pipeline will still run, instead of failing too.

Comment 4 Dominik Holler 2023-03-21 11:02:22 UTC
The current behavior of failing the whole pipeline on the first internal error should be kept, but never the less live migration of the VM should be enabled and we will retry the scenario of running the VM on a node changing to "not-ready" state.

Comment 5 Karel Šimon 2023-07-03 07:25:55 UTC
With rework of example pipelines in https://github.com/kubevirt/ssp-operator/pull/550 this issue should be fixed


Note You need to log in before you can comment on or make changes to this bug.