Bug 2238786

Summary: After draining node where mtq system pods running the namespace becomes locked but ResourceQuota not updated
Product: Container Native Virtualization (CNV) Reporter: Denys Shchedrivyi <dshchedr>
Component: VirtualizationAssignee: Barak <bmordeha>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 4.14.0CC: acardace, bmordeha, sgott
Target Milestone: ---   
Target Release: 4.14.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.14.0.rhel9-2163 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-07 15:00:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Denys Shchedrivyi 2023-09-13 15:39:57 UTC
Description of problem:
 When I drained the node where mtq-lock and mtq-controller pod were running - namespace becomes locked, but ResourceQuota was not increased. As result - migration stuck in Pending state:

> $ oc get vmim
> NAME                        PHASE       VMI            
> kubevirt-evacuation-s5sft   Pending     vm-fedora-1 

> $ oc get resourcequota
> NAME        AGE   REQUEST                     LIMIT
> quota-cpu   16h   requests.cpu: 1001m/1100m   limits.cpu: 1010m/1100m


 ValidatingWebhookConfiguration is present (the namespace is locked):
> $ oc get ValidatingWebhookConfiguration | grep lock
> lock.default.com                                     1          4m58s

 and I can't delete vmmrq:
> $ oc delete vmmrq my-vmmrq
> Error from server: admission webhook "lock.default.com" denied the request: Migration process is currently being handled by the Managed Quota controller, and as a result, modifications,creation or deletion of virtualMachineMigrationResourceQuotas are not permitted in this namespace, please try again.


 after manually removing lock and recreating vmmrq - migration started

Version-Release number of selected component (if applicable):
4.14

How reproducible:


Steps to Reproduce:
1. get mtq-controller and mtq-lock pods running on one node
2. create and run VM on the same node
3. drain that node, for example:
 oc adm drain --delete-local-data --ignore-daemonsets=true --force virt-den-415-8gq52-worker-0-9wz8h


Actual results:
 Migration does not work

Expected results:
 VM successfully migrated during the node drain

Additional info:

Comment 1 Denys Shchedrivyi 2023-10-03 20:24:07 UTC
Verified on v4.14.0.rhel9-2166

 Unfortunately, I see that in some cases during the node drain resources increased, VM migrated, but namespace keeps locked even after the migration completed:

VM Migrated:
> $ oc get vmim
> NAME                        PHASE       VMI
> kubevirt-evacuation-6gwz2   Succeeded   vm-fedora-cpu-mem-1

Namespace still locked 
> $  oc get ValidatingWebhookConfiguration | grep lock
> lock.default.com                                     1          12m

And resources quota increased
> $ oc get resourcequota
> NAME                 AGE   REQUEST                                                      LIMIT
> quota-cpu-memory-1   73m   requests.cpu: 1010m/2100m, requests.memory: 1369594368/3Gi   limits.cpu: 1010m/2100m, limits.memory: 1369594368/3Gi
 

 Not 100% sure how to reproduce it, I was able to get it when drained node where mtq-controller, mtq-operator, mtq-lock and virt-launcer pods were running

Comment 3 Denys Shchedrivyi 2023-11-28 00:05:40 UTC
Verified on CNV-v4.14.1.rhel9-62
Drained nodes multiple times - MTQ is working good

Comment 10 errata-xmlrpc 2023-12-07 15:00:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7704

Comment 11 Red Hat Bugzilla 2024-04-06 04:25:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days