Description of problem: When I drained the node where mtq-lock and mtq-controller pod were running - namespace becomes locked, but ResourceQuota was not increased. As result - migration stuck in Pending state: > $ oc get vmim > NAME PHASE VMI > kubevirt-evacuation-s5sft Pending vm-fedora-1 > $ oc get resourcequota > NAME AGE REQUEST LIMIT > quota-cpu 16h requests.cpu: 1001m/1100m limits.cpu: 1010m/1100m ValidatingWebhookConfiguration is present (the namespace is locked): > $ oc get ValidatingWebhookConfiguration | grep lock > lock.default.com 1 4m58s and I can't delete vmmrq: > $ oc delete vmmrq my-vmmrq > Error from server: admission webhook "lock.default.com" denied the request: Migration process is currently being handled by the Managed Quota controller, and as a result, modifications,creation or deletion of virtualMachineMigrationResourceQuotas are not permitted in this namespace, please try again. after manually removing lock and recreating vmmrq - migration started Version-Release number of selected component (if applicable): 4.14 How reproducible: Steps to Reproduce: 1. get mtq-controller and mtq-lock pods running on one node 2. create and run VM on the same node 3. drain that node, for example: oc adm drain --delete-local-data --ignore-daemonsets=true --force virt-den-415-8gq52-worker-0-9wz8h Actual results: Migration does not work Expected results: VM successfully migrated during the node drain Additional info:
Verified on v4.14.0.rhel9-2166 Unfortunately, I see that in some cases during the node drain resources increased, VM migrated, but namespace keeps locked even after the migration completed: VM Migrated: > $ oc get vmim > NAME PHASE VMI > kubevirt-evacuation-6gwz2 Succeeded vm-fedora-cpu-mem-1 Namespace still locked > $ oc get ValidatingWebhookConfiguration | grep lock > lock.default.com 1 12m And resources quota increased > $ oc get resourcequota > NAME AGE REQUEST LIMIT > quota-cpu-memory-1 73m requests.cpu: 1010m/2100m, requests.memory: 1369594368/3Gi limits.cpu: 1010m/2100m, limits.memory: 1369594368/3Gi Not 100% sure how to reproduce it, I was able to get it when drained node where mtq-controller, mtq-operator, mtq-lock and virt-launcer pods were running
Verified on CNV-v4.14.1.rhel9-62 Drained nodes multiple times - MTQ is working good
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7704
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days