Bug 2218435

Summary:

queueing multiple VMs migration causes virt-controller to hit a deadlock.

Product:

Container Native Virtualization (CNV)

Reporter:

Boaz <bbenshab>

Component:

Virtualization

Assignee:

Jed Lejosne <jlejosne>

Status:

CLOSED MIGRATED

QA Contact:

Kedar Bidarkar <kbidarka>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.13.2

CC:

acardace, dshchedr, fdeutsch, jhopper, ycui

Target Milestone:

---

Keywords:

Scale

Target Release:

4.15.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-12-14 16:14:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
virt-controller-queue	none
1200 vm migration with migration queue <= parallelMigrationsPerCluster	none

Description Boaz 2023-06-29 06:28:31 UTC

Created attachment 1973086 [details]
virt-controller-queue

I'm running a scale regression setup on :
=========================================
OpenShift 4.13.2
OpenShift Virtualization 4.13.1
OpenShift Container Storage - 4.12.4-rhodf
this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS.

while I was testing idle VMs migration in bulks - meaning I schedule 100 VMs migrations, wait for completion, and then schedule another 100, I noticed that
the migration completion rate was slowly degrading with every bulk, starting at 20 seconds per VM and reaching up to 1570 seconds per VM in the last bulks,
in order to debug this issue I schedule 800 VMs migration so it will be easier to notice the root cause.
ideally, the expected result is that we will queue all those migration jobs and then execute them at a rate of parallelMigrationsPerCluster,
however, what actually happened is that all those queues got stuck in the virt-controller migration queue.
they remained there indefinitely while consuming MEM & CPU, even after the vmim's already failed, the queue remained unphased, in fact, the only thing that caused a few of those queues to be eliminated is when nonvoluntary_ctxt_switches were triggered I eventually killed active virt-controller after 4.5 hours - see attached image virt-controller-queue.

the way I found to avoid triggering this issue is by making sure through automation that the number of scheduled migrating VMs queue will always be <= parallelMigrationsPerCluster
by doing that I was able to complete 1200 VMs migration in just above 12 minutes.

it's important to note that this issue is exclusive to the migration flow, for example when I mass-scheduled 6000 VMs for starting I didn't experience any issues.

note that I was using the following debug, but the rate at which those logs were generating and getting overwritten made them useless at this scale.
============================================================================================================================================

Spec:
  logVerbosityConfig:
    kubevirt:
      virtController: 9
      virtHandler: 9
      virtLauncher: 9
      virtAPI: 9
============================================================================================================================================


steps to reproduce:
this issue is 100% reproducible
1. create a cluster with 800 VMs
2. initiate a large number of migrations (as easy as running a bunch of "virtctl migrate")

Comment 1 Boaz 2023-06-29 07:19:57 UTC

*** Bug 2149960 has been marked as a duplicate of this bug. ***

Comment 3 Boaz 2023-06-29 07:27:12 UTC

Created attachment 1973107 [details]
1200 vm migration with migration queue <= parallelMigrationsPerCluster

Comment 4 Kedar Bidarkar 2023-07-05 12:23:40 UTC

Frequent migrations with Large Scale clusters is not that often, hence for now setting the priority to High, but setting the Target Release is 4.14.0.

Comment 5 Jed Lejosne 2023-07-14 13:17:10 UTC

@bbenshab Thank you for the detailed report!
Pardon my ignorance, but where is that virt-controller queue size coming from, is it a metric? If so, do you know its name?
Thank you!

Comment 7 Fabian Deutsch 2023-09-27 07:27:22 UTC

Lowering sev to high as no prod workload is impacted

Comment 8 Red Hat Bugzilla 2024-04-13 04:25:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days