Bug 2218435
Summary: | queueing multiple VMs migration causes virt-controller to hit a deadlock. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Boaz <bbenshab> | ||||||
Component: | Virtualization | Assignee: | Jed Lejosne <jlejosne> | ||||||
Status: | CLOSED MIGRATED | QA Contact: | Kedar Bidarkar <kbidarka> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.13.2 | CC: | acardace, dshchedr, fdeutsch, jhopper, ycui | ||||||
Target Milestone: | --- | Keywords: | Scale | ||||||
Target Release: | 4.15.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2023-12-14 16:14:10 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
*** Bug 2149960 has been marked as a duplicate of this bug. *** Created attachment 1973107 [details]
1200 vm migration with migration queue <= parallelMigrationsPerCluster
Frequent migrations with Large Scale clusters is not that often, hence for now setting the priority to High, but setting the Target Release is 4.14.0. @bbenshab Thank you for the detailed report! Pardon my ignorance, but where is that virt-controller queue size coming from, is it a metric? If so, do you know its name? Thank you! Lowering sev to high as no prod workload is impacted The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Created attachment 1973086 [details] virt-controller-queue I'm running a scale regression setup on : ========================================= OpenShift 4.13.2 OpenShift Virtualization 4.13.1 OpenShift Container Storage - 4.12.4-rhodf this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS. while I was testing idle VMs migration in bulks - meaning I schedule 100 VMs migrations, wait for completion, and then schedule another 100, I noticed that the migration completion rate was slowly degrading with every bulk, starting at 20 seconds per VM and reaching up to 1570 seconds per VM in the last bulks, in order to debug this issue I schedule 800 VMs migration so it will be easier to notice the root cause. ideally, the expected result is that we will queue all those migration jobs and then execute them at a rate of parallelMigrationsPerCluster, however, what actually happened is that all those queues got stuck in the virt-controller migration queue. they remained there indefinitely while consuming MEM & CPU, even after the vmim's already failed, the queue remained unphased, in fact, the only thing that caused a few of those queues to be eliminated is when nonvoluntary_ctxt_switches were triggered I eventually killed active virt-controller after 4.5 hours - see attached image virt-controller-queue. the way I found to avoid triggering this issue is by making sure through automation that the number of scheduled migrating VMs queue will always be <= parallelMigrationsPerCluster by doing that I was able to complete 1200 VMs migration in just above 12 minutes. it's important to note that this issue is exclusive to the migration flow, for example when I mass-scheduled 6000 VMs for starting I didn't experience any issues. note that I was using the following debug, but the rate at which those logs were generating and getting overwritten made them useless at this scale. ============================================================================================================================================ Spec: logVerbosityConfig: kubevirt: virtController: 9 virtHandler: 9 virtLauncher: 9 virtAPI: 9 ============================================================================================================================================ steps to reproduce: this issue is 100% reproducible 1. create a cluster with 800 VMs 2. initiate a large number of migrations (as easy as running a bunch of "virtctl migrate")