This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2218435 - queueing multiple VMs migration causes virt-controller to hit a deadlock.
Summary: queueing multiple VMs migration causes virt-controller to hit a deadlock.
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.13.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.15.0
Assignee: Jed Lejosne
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
: 2149960 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-29 06:28 UTC by Boaz
Modified: 2024-04-13 04:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-14 16:14:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
virt-controller-queue (73.97 KB, image/png)
2023-06-29 06:28 UTC, Boaz
no flags Details
1200 vm migration with migration queue <= parallelMigrationsPerCluster (67.53 KB, image/png)
2023-06-29 07:27 UTC, Boaz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   CNV-30386 0 None None None 2023-12-14 16:14:10 UTC

Description Boaz 2023-06-29 06:28:31 UTC
Created attachment 1973086 [details]
virt-controller-queue

I'm running a scale regression setup on :
=========================================
OpenShift 4.13.2
OpenShift Virtualization 4.13.1
OpenShift Container Storage - 4.12.4-rhodf
this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS.

while I was testing idle VMs migration in bulks - meaning I schedule 100 VMs migrations, wait for completion, and then schedule another 100, I noticed that
the migration completion rate was slowly degrading with every bulk, starting at 20 seconds per VM and reaching up to 1570 seconds per VM in the last bulks,
in order to debug this issue I schedule 800 VMs migration so it will be easier to notice the root cause.
ideally, the expected result is that we will queue all those migration jobs and then execute them at a rate of parallelMigrationsPerCluster,
however, what actually happened is that all those queues got stuck in the virt-controller migration queue.
they remained there indefinitely while consuming MEM & CPU, even after the vmim's already failed, the queue remained unphased, in fact, the only thing that caused a few of those queues to be eliminated is when nonvoluntary_ctxt_switches were triggered I eventually killed active virt-controller after 4.5 hours - see attached image virt-controller-queue.

the way I found to avoid triggering this issue is by making sure through automation that the number of scheduled migrating VMs queue will always be <= parallelMigrationsPerCluster
by doing that I was able to complete 1200 VMs migration in just above 12 minutes.

it's important to note that this issue is exclusive to the migration flow, for example when I mass-scheduled 6000 VMs for starting I didn't experience any issues.

note that I was using the following debug, but the rate at which those logs were generating and getting overwritten made them useless at this scale.
============================================================================================================================================

Spec:
  logVerbosityConfig:
    kubevirt:
      virtController: 9
      virtHandler: 9
      virtLauncher: 9
      virtAPI: 9
============================================================================================================================================


steps to reproduce:
this issue is 100% reproducible
1. create a cluster with 800 VMs
2. initiate a large number of migrations (as easy as running a bunch of "virtctl migrate")

Comment 1 Boaz 2023-06-29 07:19:57 UTC
*** Bug 2149960 has been marked as a duplicate of this bug. ***

Comment 3 Boaz 2023-06-29 07:27:12 UTC
Created attachment 1973107 [details]
1200 vm migration with migration queue <= parallelMigrationsPerCluster

Comment 4 Kedar Bidarkar 2023-07-05 12:23:40 UTC
Frequent migrations with Large Scale clusters is not that often, hence for now setting the priority to High, but setting the Target Release is 4.14.0.

Comment 5 Jed Lejosne 2023-07-14 13:17:10 UTC
@bbenshab Thank you for the detailed report!
Pardon my ignorance, but where is that virt-controller queue size coming from, is it a metric? If so, do you know its name?
Thank you!

Comment 7 Fabian Deutsch 2023-09-27 07:27:22 UTC
Lowering sev to high as no prod workload is impacted

Comment 8 Red Hat Bugzilla 2024-04-13 04:25:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.