Bug 2228036

Summary: Virt-Launcher Pod Node Drain stuck when HCO evictionStrategy is set "None" and VM is not restarted
Product: Container Native Virtualization (CNV) Reporter: Akriti Gupta <akrgupta>
Component: VirtualizationAssignee: Antonio Cardace <acardace>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 4.14.0Flags: akrgupta: needinfo+
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.14.0.rhel9-1706 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 14:06:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Akriti Gupta 2023-08-01 07:54:35 UTC
Description of problem: Initially a VM running with HCO evictionStrategy:LiveMigrate , when we update HCO evictionStrategy:None and without restsrting the vm do node drain , Virt-launcher pod does not drain , and node drain is stuck with following error:

error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s

Version-Release number of selected component (if applicable):


How reproducible:
100% on a bm cluster

Steps to Reproduce:
1.initially at HCO evictionStrategy:LiveMigrate
2.create a vm (VM is running) (no evictionStrategy field in VM spec)
3.edit hco with evictionStrategy: None
4.do not restart vm
5.do node drain


Actual results: Node drain is stuck while draining virt-launcher pod 
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s


Expected results:Node drain is successful and VM Restarted on another node


Additional info:

Comment 1 Akriti Gupta 2023-08-01 07:58:03 UTC
[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: LiveMigrate
[akriti@fedora ~]$ oc apply -f vm_rhel_ocs.yaml 
Warning: kubevirt.io/v1alpha3 is now deprecated and will be removed in a future release.
virtualmachine.kubevirt.io/vm2-rhel88-ocs created
[akriti@fedora ~]$ oc get vm
NAME             AGE   STATUS    READY
vm2-rhel88-ocs   35s   Stopped   False
[akriti@fedora ~]$ virtctl start vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to start
[akriti@fedora ~]$ oc get vm vm2-rhel88-ocs -o yaml | grep eviction
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   38s   Running   10.128.0.173   cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
[cloud-user@vm2-rhel88-ocs ~]$ [akriti@fedora ~]$
[akriti@fedora ~]$ oc edit hco kubevirt-hyperconverged -n openshift-cnv -o yaml
[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction evictionStrategy: None 
[akriti@fedora ~]$ oc adm drain cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com cordoned
.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-n
—--------------------
***Node drain stays stuck here until vm is stopped


***If we restart the vm after updating HCO - VM Restarts and node is drained

[akriti@fedora ~]$ virtctl stop vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to stop
[akriti@fedora ~]$ virtctl start vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to start
[akriti@fedora ~]$ oc get vm
NAME             AGE   STATUS    READY
vm2-rhel88-ocs   19m   Running   True
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   22s   Running   10.129.0.136   cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]
Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64
Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
Last login: Mon Jul 31 06:43:47 on ttyS0
[cloud-user@vm2-rhel88-ocs ~]$ [akriti@fedora ~]$ 
[akriti@fedora ~]$ oc adm drain cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com cordoned
.
.
node/cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com drained
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   84s   Running   10.128.0.203   cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
Last login: Mon Jul 31 07:01:36 on ttyS0
[cloud-user@vm2-rhel88-ocs ~]$

Comment 2 Kedar Bidarkar 2023-08-02 12:23:34 UTC
*** Bug 2228027 has been marked as a duplicate of this bug. ***

Comment 3 Antonio Cardace 2023-08-22 11:20:46 UTC
@akrgupta To verify this just make sure that the eviction strategy the VM was started with is always stored in the VMI in the `.spec.evictionStrategy` field.

Comment 4 Akriti Gupta 2023-08-23 12:16:34 UTC
verified on v4.14.0.rhel9-1709
VMI had eviction strategy defined under which is same as what was in HCO when VM was started,

On updateing HCO Vm follows then new eviction strategy value only on restart 

[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: None
[akriti@fedora ~]$ oc get vmi vm-rhel88-ocs -o json | jq .spec.evictionStrategy
"None"
[akriti@fedora ~]$ oc get vm vm-rhel88-ocs -o yaml | grep eviction
[akriti@fedora ~]$ virtctl console vm-rhel88-ocs
Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm-rhel88-ocs login: cloud-user
Password: 
[cloud-user@vm-rhel88-ocs ~]$ [akriti@fedora ~]$ 

[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: LiveMigrate
[akriti@fedora ~]$ oc get vmi
NAME            AGE     PHASE     IP             NODENAME                            READY
vm-rhel88-ocs   6m47s   Running   10.131.0.192   virt-akr-414-jd9ft-worker-0-g6xl6   True
[akriti@fedora ~]$ oc adm drain virt-akr-414-jd9ft-worker-0-g6xl6 --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/virt-akr-414-jd9ft-worker-0-g6xl6 cordoned
.
.
node/virt-akr-414-jd9ft-worker-0-g6xl6 drained
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE       IP    NODENAME                            READY
vm-rhel88-ocs   42s   Scheduled         virt-akr-414-jd9ft-worker-0-qsgrz   False
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE       IP    NODENAME                            READY
vm-rhel88-ocs   48s   Scheduled         virt-akr-414-jd9ft-worker-0-qsgrz   False
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME                            READY
vm-rhel88-ocs   53s   Running   10.129.2.38   virt-akr-414-jd9ft-worker-0-qsgrz   True
[akriti@fedora ~]$ virtctl restart vm-rhel88-ocs
VM vm-rhel88-ocs was scheduled to restart
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME                            READY
vm-rhel88-ocs   52s   Running   10.129.2.39   virt-akr-414-jd9ft-worker-0-qsgrz   True
[akriti@fedora ~]$ oc get vmi vm-rhel88-ocs -o yaml | grep eviction
  evictionStrategy: LiveMigrate

Comment 6 errata-xmlrpc 2023-11-08 14:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817