2228036 – Virt-Launcher Pod Node Drain stuck when HCO evictionStrategy is set "None" and VM is not restarted

Bug 2228036 - Virt-Launcher Pod Node Drain stuck when HCO evictionStrategy is set "None" and VM is not restarted

Summary: Virt-Launcher Pod Node Drain stuck when HCO evictionStrategy is set "None" an...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.14.0
Assignee:	Antonio Cardace
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2228027 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-01 07:54 UTC by Akriti Gupta
Modified:	2023-11-08 14:06 UTC (History)
CC List:	0 users
Fixed In Version:	v4.14.0.rhel9-1706
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	akrgupta: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 10255	None	Merged	virt-controller: always store evictionStrategy in Spec	2023-08-16 11:13:01 UTC
Github	kubevirt kubevirt pull 10277	None	open	[release-1.0] virt-controller: always store evictionStrategy in Spec	2023-08-21 10:25:48 UTC
Red Hat Issue Tracker	CNV-31577	None	None	None	2023-08-01 07:55:54 UTC
Red Hat Product Errata	RHSA-2023:6817	None	None	None	2023-11-08 14:06:32 UTC

Description Akriti Gupta 2023-08-01 07:54:35 UTC

Description of problem: Initially a VM running with HCO evictionStrategy:LiveMigrate , when we update HCO evictionStrategy:None and without restsrting the vm do node drain , Virt-launcher pod does not drain , and node drain is stuck with following error:

error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s

Version-Release number of selected component (if applicable):


How reproducible:
100% on a bm cluster

Steps to Reproduce:
1.initially at HCO evictionStrategy:LiveMigrate
2.create a vm (VM is running) (no evictionStrategy field in VM spec)
3.edit hco with evictionStrategy: None
4.do not restart vm
5.do node drain


Actual results: Node drain is stuck while draining virt-launcher pod 
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s


Expected results:Node drain is successful and VM Restarted on another node


Additional info:

Comment 1 Akriti Gupta 2023-08-01 07:58:03 UTC

[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: LiveMigrate
[akriti@fedora ~]$ oc apply -f vm_rhel_ocs.yaml 
Warning: kubevirt.io/v1alpha3 is now deprecated and will be removed in a future release.
virtualmachine.kubevirt.io/vm2-rhel88-ocs created
[akriti@fedora ~]$ oc get vm
NAME             AGE   STATUS    READY
vm2-rhel88-ocs   35s   Stopped   False
[akriti@fedora ~]$ virtctl start vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to start
[akriti@fedora ~]$ oc get vm vm2-rhel88-ocs -o yaml | grep eviction
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   38s   Running   10.128.0.173   cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
[cloud-user@vm2-rhel88-ocs ~]$ [akriti@fedora ~]$
[akriti@fedora ~]$ oc edit hco kubevirt-hyperconverged -n openshift-cnv -o yaml
[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction evictionStrategy: None 
[akriti@fedora ~]$ oc adm drain cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com cordoned
.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-nxl4s" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/virt-launcher-vm2-rhel88-ocs-nxl4s
error when evicting pods/"virt-launcher-vm2-rhel88-ocs-n
—--------------------
***Node drain stays stuck here until vm is stopped


***If we restart the vm after updating HCO - VM Restarts and node is drained

[akriti@fedora ~]$ virtctl stop vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to stop
[akriti@fedora ~]$ virtctl start vm2-rhel88-ocs
VM vm2-rhel88-ocs was scheduled to start
[akriti@fedora ~]$ oc get vm
NAME             AGE   STATUS    READY
vm2-rhel88-ocs   19m   Running   True
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   22s   Running   10.129.0.136   cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]
Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64
Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
Last login: Mon Jul 31 06:43:47 on ttyS0
[cloud-user@vm2-rhel88-ocs ~]$ [akriti@fedora ~]$ 
[akriti@fedora ~]$ oc adm drain cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com cordoned
.
.
node/cnv-qe-infra-02.cnvqe3.lab.eng.rdu2.redhat.com drained
[akriti@fedora ~]$ oc get vmi
NAME             AGE   PHASE     IP             NODENAME                                         READY
vm2-rhel88-ocs   84s   Running   10.128.0.203   cnv-qe-infra-03.cnvqe3.lab.eng.rdu2.redhat.com   True
[akriti@fedora ~]$ virtctl console vm2-rhel88-ocs
Successfully connected to vm2-rhel88-ocs console. The escape sequence is ^]

Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm2-rhel88-ocs login: cloud-user
Password: 
Last login: Mon Jul 31 07:01:36 on ttyS0
[cloud-user@vm2-rhel88-ocs ~]$

Comment 2 Kedar Bidarkar 2023-08-02 12:23:34 UTC

*** Bug 2228027 has been marked as a duplicate of this bug. ***

Comment 3 Antonio Cardace 2023-08-22 11:20:46 UTC

@akrgupta To verify this just make sure that the eviction strategy the VM was started with is always stored in the VMI in the `.spec.evictionStrategy` field.

Comment 4 Akriti Gupta 2023-08-23 12:16:34 UTC

verified on v4.14.0.rhel9-1709
VMI had eviction strategy defined under which is same as what was in HCO when VM was started,

On updateing HCO Vm follows then new eviction strategy value only on restart 

[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: None
[akriti@fedora ~]$ oc get vmi vm-rhel88-ocs -o json | jq .spec.evictionStrategy
"None"
[akriti@fedora ~]$ oc get vm vm-rhel88-ocs -o yaml | grep eviction
[akriti@fedora ~]$ virtctl console vm-rhel88-ocs
Red Hat Enterprise Linux 8.8 (Ootpa)
Kernel 4.18.0-477.17.1.el8_8.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm-rhel88-ocs login: cloud-user
Password: 
[cloud-user@vm-rhel88-ocs ~]$ [akriti@fedora ~]$ 

[akriti@fedora ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o yaml | grep eviction
  evictionStrategy: LiveMigrate
[akriti@fedora ~]$ oc get vmi
NAME            AGE     PHASE     IP             NODENAME                            READY
vm-rhel88-ocs   6m47s   Running   10.131.0.192   virt-akr-414-jd9ft-worker-0-g6xl6   True
[akriti@fedora ~]$ oc adm drain virt-akr-414-jd9ft-worker-0-g6xl6 --force=true --ignore-daemonsets=true --delete-emptydir-data=true
node/virt-akr-414-jd9ft-worker-0-g6xl6 cordoned
.
.
node/virt-akr-414-jd9ft-worker-0-g6xl6 drained
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE       IP    NODENAME                            READY
vm-rhel88-ocs   42s   Scheduled         virt-akr-414-jd9ft-worker-0-qsgrz   False
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE       IP    NODENAME                            READY
vm-rhel88-ocs   48s   Scheduled         virt-akr-414-jd9ft-worker-0-qsgrz   False
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME                            READY
vm-rhel88-ocs   53s   Running   10.129.2.38   virt-akr-414-jd9ft-worker-0-qsgrz   True
[akriti@fedora ~]$ virtctl restart vm-rhel88-ocs
VM vm-rhel88-ocs was scheduled to restart
[akriti@fedora ~]$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME                            READY
vm-rhel88-ocs   52s   Running   10.129.2.39   virt-akr-414-jd9ft-worker-0-qsgrz   True
[akriti@fedora ~]$ oc get vmi vm-rhel88-ocs -o yaml | grep eviction
  evictionStrategy: LiveMigrate

Comment 6 errata-xmlrpc 2023-11-08 14:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817

Note You need to log in before you can comment on or make changes to this bug.