Bug 2228036

Summary: Virt-Launcher Pod Node Drain stuck when HCO evictionStrategy is set "None" and VM is not restarted
Product: Container Native Virtualization (CNV) Reporter: Akriti Gupta <akrgupta>
Component: VirtualizationAssignee: Antonio Cardace <acardace>
Status: POST --- QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 4.14.0   
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Akriti Gupta 2023-08-01 07:54:35 UTC
Description of problem: Initially a VM running with HCO evictionStrategy:LiveMigrate , when we update HCO evictionStrategy:None and without restsrting the vm do node drain , Virt-launcher pod does not drain , and node drain is stuck with following error:

error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s

Version-Release number of selected component (if applicable):


How reproducible:
100% on a bm cluster

Steps to Reproduce:
1.initially at HCO evictionStrategy:LiveMigrate
2.create a vm (VM is running) (no evictionStrategy field in VM spec)
3.edit hco with evictionStrategy: None
4.do not restart vm
5.do node drain


Actual results: Node drain is stuck while draining virt-launcher pod 
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s


Expected results:Node drain is successful and VM Restarted on another node


Additional info:

Comment 1 Akriti Gupta 2023-08-01 07:58:03 UTC
[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: LiveMigrate
[akriti@fedora ~]$ oc apply -f vm_rhel_ocs.yaml 
Warning: kubevirt.io/v1alpha3 is now deprecated and will be removed in a future release.
virtualmachine.kubevirt.io/vm2-rhel88-ocs created
[akriti@fedora ~]$ oc get vm
NAME             AGE   STATUS    READY
vm2-rhel88-ocs   35s   Stopped   False
[akriti@fedora ~]$ virtctl start vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to start
[akriti@fedora ~]$ oc get vm vm2-rhel88-ocs -o yaml | grep eviction
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   38s   Running   10.128.0.173   cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
[cloud-user@vm2-rhel88-ocs ~]$ [akriti@fedora ~]$
[akriti@fedora ~]$ oc edit hco kubevirt-hyperconverged -n openshift-cnv -o yaml
[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction evictionStrategy: None 
[akriti@fedora ~]$ oc adm drain cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com cordoned
.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-n
—--------------------
***Node drain stays stuck here until vm is stopped


***If we restart the vm after updating HCO - VM Restarts and node is drained

[akriti@fedora ~]$ virtctl stop vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to stop
[akriti@fedora ~]$ virtctl start vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to start
[akriti@fedora ~]$ oc get vm
NAME             AGE   STATUS    READY
vm2-rhel88-ocs   19m   Running   True
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   22s   Running   10.129.0.136   cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]
Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64
Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
Last login: Mon Jul 31 06:43:47 on ttyS0
[cloud-user@vm2-rhel88-ocs ~]$ [akriti@fedora ~]$ 
[akriti@fedora ~]$ oc adm drain cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com cordoned
.
.
node/cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com drained
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   84s   Running   10.128.0.203   cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
Last login: Mon Jul 31 07:01:36 on ttyS0
[cloud-user@vm2-rhel88-ocs ~]$

Comment 2 Kedar Bidarkar 2023-08-02 12:23:34 UTC
*** Bug 2228027 has been marked as a duplicate of this bug. ***