Bug 2238786 - After draining node where mtq system pods running the namespace becomes locked but ResourceQuota not updated
Summary: After draining node where mtq system pods running the namespace becomes locke...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.14.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.14.1
Assignee: Barak
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-13 15:39 UTC by Denys Shchedrivyi
Modified: 2024-04-06 04:25 UTC (History)
3 users (show)

Fixed In Version: v4.14.0.rhel9-2163
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-07 15:00:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt managed-tenant-quota pull 18 0 None Merged [release-v1.1]Update number of replicas and remove redundant log 2023-10-03 08:51:08 UTC
Github kubevirt managed-tenant-quota pull 33 0 None Merged [release-v1.1] Consider Max parallel migration 2023-10-19 07:36:15 UTC
Red Hat Issue Tracker CNV-32937 0 None None None 2023-09-13 15:40:14 UTC
Red Hat Product Errata RHSA-2023:7704 0 None None None 2023-12-07 15:00:44 UTC

Description Denys Shchedrivyi 2023-09-13 15:39:57 UTC
Description of problem:
 When I drained the node where mtq-lock and mtq-controller pod were running - namespace becomes locked, but ResourceQuota was not increased. As result - migration stuck in Pending state:

> $ oc get vmim
> NAME                        PHASE       VMI            
> kubevirt-evacuation-s5sft   Pending     vm-fedora-1 

> $ oc get resourcequota
> NAME        AGE   REQUEST                     LIMIT
> quota-cpu   16h   requests.cpu: 1001m/1100m   limits.cpu: 1010m/1100m


 ValidatingWebhookConfiguration is present (the namespace is locked):
> $ oc get ValidatingWebhookConfiguration | grep lock
> lock.default.com                                     1          4m58s

 and I can't delete vmmrq:
> $ oc delete vmmrq my-vmmrq
> Error from server: admission webhook "lock.default.com" denied the request: Migration process is currently being handled by the Managed Quota controller, and as a result, modifications,creation or deletion of virtualMachineMigrationResourceQuotas are not permitted in this namespace, please try again.


 after manually removing lock and recreating vmmrq - migration started

Version-Release number of selected component (if applicable):
4.14

How reproducible:


Steps to Reproduce:
1. get mtq-controller and mtq-lock pods running on one node
2. create and run VM on the same node
3. drain that node, for example:
 oc adm drain --delete-local-data --ignore-daemonsets=true --force virt-den-415-8gq52-worker-0-9wz8h


Actual results:
 Migration does not work

Expected results:
 VM successfully migrated during the node drain

Additional info:

Comment 1 Denys Shchedrivyi 2023-10-03 20:24:07 UTC
Verified on v4.14.0.rhel9-2166

 Unfortunately, I see that in some cases during the node drain resources increased, VM migrated, but namespace keeps locked even after the migration completed:

VM Migrated:
> $ oc get vmim
> NAME                        PHASE       VMI
> kubevirt-evacuation-6gwz2   Succeeded   vm-fedora-cpu-mem-1

Namespace still locked 
> $  oc get ValidatingWebhookConfiguration | grep lock
> lock.default.com                                     1          12m

And resources quota increased
> $ oc get resourcequota
> NAME                 AGE   REQUEST                                                      LIMIT
> quota-cpu-memory-1   73m   requests.cpu: 1010m/2100m, requests.memory: 1369594368/3Gi   limits.cpu: 1010m/2100m, limits.memory: 1369594368/3Gi
 

 Not 100% sure how to reproduce it, I was able to get it when drained node where mtq-controller, mtq-operator, mtq-lock and virt-launcer pods were running

Comment 3 Denys Shchedrivyi 2023-11-28 00:05:40 UTC
Verified on CNV-v4.14.1.rhel9-62
Drained nodes multiple times - MTQ is working good

Comment 10 errata-xmlrpc 2023-12-07 15:00:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7704

Comment 11 Red Hat Bugzilla 2024-04-06 04:25:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.