2238786 – After draining node where mtq system pods running the namespace becomes locked but ResourceQuota not updated

Bug 2238786 - After draining node where mtq system pods running the namespace becomes locked but ResourceQuota not updated

Summary: After draining node where mtq system pods running the namespace becomes locke...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.14.1
Assignee:	Barak
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-13 15:39 UTC by Denys Shchedrivyi
Modified:	2024-04-06 04:25 UTC (History)
CC List:	3 users (show)
Fixed In Version:	v4.14.0.rhel9-2163
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-07 15:00:42 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt managed-tenant-quota pull 18	None	Merged	[release-v1.1]Update number of replicas and remove redundant log	2023-10-03 08:51:08 UTC
Github	kubevirt managed-tenant-quota pull 33	None	Merged	[release-v1.1] Consider Max parallel migration	2023-10-19 07:36:15 UTC
Red Hat Issue Tracker	CNV-32937	None	None	None	2023-09-13 15:40:14 UTC
Red Hat Product Errata	RHSA-2023:7704	None	None	None	2023-12-07 15:00:44 UTC

Description Denys Shchedrivyi 2023-09-13 15:39:57 UTC

Description of problem:
 When I drained the node where mtq-lock and mtq-controller pod were running - namespace becomes locked, but ResourceQuota was not increased. As result - migration stuck in Pending state:

> $ oc get vmim
> NAME                        PHASE       VMI            
> kubevirt-evacuation-s5sft   Pending     vm-fedora-1 

> $ oc get resourcequota
> NAME        AGE   REQUEST                     LIMIT
> quota-cpu   16h   requests.cpu: 1001m/1100m   limits.cpu: 1010m/1100m


 ValidatingWebhookConfiguration is present (the namespace is locked):
> $ oc get ValidatingWebhookConfiguration | grep lock
> lock.default.com                                     1          4m58s

 and I can't delete vmmrq:
> $ oc delete vmmrq my-vmmrq
> Error from server: admission webhook "lock.default.com" denied the request: Migration process is currently being handled by the Managed Quota controller, and as a result, modifications,creation or deletion of virtualMachineMigrationResourceQuotas are not permitted in this namespace, please try again.


 after manually removing lock and recreating vmmrq - migration started

Version-Release number of selected component (if applicable):
4.14

How reproducible:


Steps to Reproduce:
1. get mtq-controller and mtq-lock pods running on one node
2. create and run VM on the same node
3. drain that node, for example:
 oc adm drain --delete-local-data --ignore-daemonsets=true --force virt-den-415-8gq52-worker-0-9wz8h


Actual results:
 Migration does not work

Expected results:
 VM successfully migrated during the node drain

Additional info:

Comment 1 Denys Shchedrivyi 2023-10-03 20:24:07 UTC

Verified on v4.14.0.rhel9-2166

 Unfortunately, I see that in some cases during the node drain resources increased, VM migrated, but namespace keeps locked even after the migration completed:

VM Migrated:
> $ oc get vmim
> NAME                        PHASE       VMI
> kubevirt-evacuation-6gwz2   Succeeded   vm-fedora-cpu-mem-1

Namespace still locked 
> $  oc get ValidatingWebhookConfiguration | grep lock
> lock.default.com                                     1          12m

And resources quota increased
> $ oc get resourcequota
> NAME                 AGE   REQUEST                                                      LIMIT
> quota-cpu-memory-1   73m   requests.cpu: 1010m/2100m, requests.memory: 1369594368/3Gi   limits.cpu: 1010m/2100m, limits.memory: 1369594368/3Gi
 

 Not 100% sure how to reproduce it, I was able to get it when drained node where mtq-controller, mtq-operator, mtq-lock and virt-launcer pods were running

Comment 3 Denys Shchedrivyi 2023-11-28 00:05:40 UTC

Verified on CNV-v4.14.1.rhel9-62
Drained nodes multiple times - MTQ is working good

Comment 10 errata-xmlrpc 2023-12-07 15:00:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7704

Comment 11 Red Hat Bugzilla 2024-04-06 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.