2069098 – Large scale |VMs migration is slow due to low migration parallelism

Bug 2069098 - Large scale |VMs migration is slow due to low migration parallelism

Summary: Large scale |VMs migration is slow due to low migration parallelism

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.9.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Igor Bezukh
QA Contact:	Sarah Bennert
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-28 09:32 UTC by Boaz
Modified:	2023-01-24 13:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.12.0-714
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-24 13:36:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 8701	None	open	[virt-controller]: consider the ParallelOutboundMigrationsPerNode when evicting VMs	2022-11-02 09:43:35 UTC
Red Hat Issue Tracker	CNV-17235	None	None	None	2022-11-02 10:19:38 UTC
Red Hat Product Errata	RHSA-2023:0408	None	None	None	2023-01-24 13:37:30 UTC

Description Boaz 2022-03-28 09:32:44 UTC

Some background:
-------------------------
I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster, however, the migration was on the slow side, so I played around with the tweaks mentioned below to find the fastest & safest config
to complete such migrations, but no matter what I did, the total migration time from start to finish was very similar which was odd.

this setup is currently running 3000 VMs:
1500 RHEL 8.5 persistent storage VMs 
500 Windows10 persistent storage VMs.
1000 Fedora Ephemeral storage VMs.

The workers are divided to 3 zones:
worker000 - worker031. = Zone0
worker032 - worker062. = Zone1
worker033 - worker096. = Zone2

I start the migration by applying an empty machineconfig to zone-2
which then causes the nodes to start draining.

this is my migration config:
--------------------
  liveMigrationConfig:
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 20
    parallelOutboundMigrationsPerNode: 4
    progressTimeout: 150
  workloads: {}
--------------------


this is zone-2 config:
--------------------
  maxUnavailable: 10
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/zone-2: ""
--------------------


another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:
--------------------
spec:
  kubeletConfig:
    kubeAPIBurst: 200
    kubeAPIQPS: 100
    maxPods: 500
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: enabled
--------------------


so to sum up the above config, we are allowed to take down up to 10 nodes at a time, and migrate 4 VMs per node and up to 20 parallel VMs in total:

as far as 10 nodes at a time - no issues there:
--------------------
worker064        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker065        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker069        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker072        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker073        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker075        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker077        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker083        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker088        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker090        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
--------------------

however, when it comes to the migration, I have measured how many VMs we actually migrate at any given time through the migration of 1000 VMs:


+----------------+-------------+---------------------------+
|                |             |                           |
| Migrating      | time spent  | time spent in percentage  |
+----------------+-------------+---------------------------+
|                |             |                           |
| 1              | 952         | 10.33%                    |
+----------------+-------------+---------------------------+
|                |             |                           |
| 2              | 748         | 8.11%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 3              | 1250        | 13.56%                    |
+----------------+-------------+---------------------------+
|                |             |                           |
| 4              | 874         | 9.48%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 5              | 886         | 9.61%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 6              | 413         | 4.48%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 7              | 361         | 3.92%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 8              | 189         | 2.05%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 9              | 143         | 1.55%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 10             | 109         | 1.18%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 11             | 38          | 0.41%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 12             | 7           | 0.08%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| 13             | 24          | 0.26%                     |
+----------------+-------------+---------------------------+
|                |             |                           |
| Doing nothing  | 3226        | 34.99%                    |
+----------------+-------------+---------------------------+

As you can see 55% of the time we actually migrate up to 6 VMs in parallel.
(35% Doing nothing  - mostly due to nodes going down for reboots but not only)


Versions of all relevant components:
CNV	4.9.2
RHCS	5.0
OCP     4.9.15


CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather.migration-misbehaving.tar.gz

Comment 1 sgott 2022-03-30 12:54:48 UTC

parallelOutboundMigrationsPerNode is a tunable setting. Did you adjust this at all during your testing? I believe machineConfig goes node-by-node when it evicts.

Comment 2 Boaz 2022-03-30 13:35:44 UTC

Hey Stu, 
I did as I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2069098#c0 , parallelOutboundMigrationsPerNode is set to 4, (default 2).

Comment 3 sgott 2022-04-06 12:29:29 UTC

I suggest it should be increased even further. parallelMigrationsPerCluster is set to 20. This limit cannot be reached by machine config. Correct me if I'm wrong, but with machine config only one node goes down at a time. Thus only 4 migrations will ever happen in parallel. If you'd like 20 migrations to happen at once, parallelOutboundMigrationsPerNode needs to also be set to 20.

Comment 4 Kedar Bidarkar 2022-04-13 12:19:17 UTC

Boaz, Can you retest as per the suggestion in comment 3 above?

Comment 5 Boaz 2022-04-14 13:34:11 UTC

Hey @kbidarka as I mentioned on the main thread, there are 10 nodes that go down in parallel, and we reached up to 13 VMs parallel migration but only for 0.26% of the entire run duration (9219 seconds), what I would like to see is 20 VMs parallel migration for the entire migration duration (with some reasonable exceptions), which will significantly shorten the migration duration.


this is zone-2 config:
--------------------
  maxUnavailable: 10
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/zone-2: ""
--------------------



so to sum up the above config, we are allowed to take down up to 10 nodes at a time, and migrate 4 VMs per node and up to 20 parallel VMs in total:

as far as 10 nodes at a time - no issues there:
--------------------
worker064        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker065        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker069        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker072        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker073        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker075        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker077        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker083        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker088        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
worker090        Ready,SchedulingDisabled   worker,zone-2   19d   v1.22.3+e790d7f
--------------------

Comment 6 Kedar Bidarkar 2022-04-20 12:33:44 UTC

We would need logs for, 
a) Virt-controller 
b) Event logs from the VMI, of the start and stop of the migrations 
c) Time points, from when the evictions start and stop from VMI 
d) And the virt-api Logs
e) Total time of migrations taken, currently.

Comment 7 Kedar Bidarkar 2022-04-20 12:36:53 UTC

We see the must_gather in the description, we will take a look at the logs in them first.

Comment 8 Boaz 2022-08-16 18:45:57 UTC

Again we run this large-scale setup this time with 117 nodes and 5000 persistent VMs, and I'm again seeing the exact behavior which impacts our migration performance 
this time we running on:
OCP 4.10.20
OpenShift Virtualization 4.10.3a

the following table is showing the number of parallel migrations we had as a percentage of the total run time:


+-------------------------+------------+--------------------------+
| VMS Parallel migrations | time spent | time spent in percentage |
+-------------------------+------------+--------------------------+
|            1            |    1457    |          14.12%          |
+-------------------------+------------+--------------------------+
|            2            |     922    |           8.93%          |
+-------------------------+------------+--------------------------+
|            3            |     518    |           5.02%          |
+-------------------------+------------+--------------------------+
|            4            |     351    |           3.40%          |
+-------------------------+------------+--------------------------+
|            5            |     402    |           3.90%          |
+-------------------------+------------+--------------------------+
|            6            |     407    |           3.94%          |
+-------------------------+------------+--------------------------+
|            7            |     564    |           5.47%          |
+-------------------------+------------+--------------------------+
|            8            |     531    |           5.15%          |
+-------------------------+------------+--------------------------+
|            9            |     543    |           5.26%          |
+-------------------------+------------+--------------------------+
|            10           |     643    |           6.23%          |
+-------------------------+------------+--------------------------+
|            11           |     421    |           4.08%          |
+-------------------------+------------+--------------------------+
|            12           |     118    |           1.14%          |
+-------------------------+------------+--------------------------+
|            13           |     36     |           0.35%          |
+-------------------------+------------+--------------------------+
|            14           |     74     |           0.72%          |
+-------------------------+------------+--------------------------+
|            15           |      8     |           0.08%          |
+-------------------------+------------+--------------------------+
|            16           |      8     |           0.08%          |
+-------------------------+------------+--------------------------+
|            17           |      7     |           0.07%          |
+-------------------------+------------+--------------------------+
|            18           |     23     |           0.22%          |
+-------------------------+------------+--------------------------+
|      Doing Nothing      |    3286    |          31.84%          |
+-------------------------+------------+--------------------------+

migration config same as before:
--------------------
  liveMigrationConfig:
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 20
    parallelOutboundMigrationsPerNode: 4
    progressTimeout: 150
  workloads: {}


This time I also collected the VMI timestamps :
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/parallel_migration/1309vms_migration.csv

along with the must-gather:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/parallel_migration/must-gather-bz2069098-repo.tar.gz

Comment 9 Fabian Deutsch 2022-08-17 12:53:55 UTC

Taking a note from an off-list discussion:
Looking at the whole time for draining nodes ~31% of the time, no migration was taking place. ~14% of the time, only one VM was migrated at a time.~9% of the time 2 VMs were migrated at a time.

Thus the problem is: Why did the system never reach the configured limits of migrating at max 20 Vms at a time?

Comment 10 sgott 2022-08-17 13:02:01 UTC

Per comment #8 almost 32% of the total time, there are no migrations in any state of progress at all. Is this a side effect of using machine config to trigger the migrations?

Comment 11 Boaz 2022-08-17 13:58:26 UTC

(In reply to sgott from comment #10)
> Per comment #8 almost 32% of the total time, there are no migrations in any
> state of progress at all. Is this a side effect of using machine config to
> trigger the migrations?

correct, as mentioned at c0.

Comment 12 Boaz 2022-08-17 14:46:16 UTC

adding another instance - this time with 1500 VMs, as we can see the results are really similar to those of the 1300 VMs:

a brief explanation about the table - "time spent" is how long we spent migrating X VMs in parallel - for example, from the table below we can see that during 9.49% of the entire test duration or 1475 seconds out of 10784 (10784 seconds is the sum of "time spent" Column which is our 100%) we were only migrating 2 VMs in parallel for that duration (1475 sconds).

+-------------------------+------------------+--------------------------+
| VMS Parallel migrations | time spent (sec) | time spent in percentage |
+-------------------------+------------------+--------------------------+
|            1            |       1475       |          13.68%          |
+-------------------------+------------------+--------------------------+
|            2            |       1023       |           9.49%          |
+-------------------------+------------------+--------------------------+
|            3            |        518       |           4.80%          |
+-------------------------+------------------+--------------------------+
|            4            |        328       |           3.04%          |
+-------------------------+------------------+--------------------------+
|            5            |        378       |           3.51%          |
+-------------------------+------------------+--------------------------+
|            6            |        394       |           3.65%          |
+-------------------------+------------------+--------------------------+
|            7            |        534       |           4.95%          |
+-------------------------+------------------+--------------------------+
|            8            |        523       |           4.85%          |
+-------------------------+------------------+--------------------------+
|            9            |        527       |           4.89%          |
+-------------------------+------------------+--------------------------+
|            10           |        604       |           5.60%          |
+-------------------------+------------------+--------------------------+
|            11           |        382       |           3.54%          |
+-------------------------+------------------+--------------------------+
|            12           |        118       |           1.09%          |
+-------------------------+------------------+--------------------------+
|            13           |        36        |           0.33%          |
+-------------------------+------------------+--------------------------+
|            14           |        74        |           0.69%          |
+-------------------------+------------------+--------------------------+
|            15           |         8        |           0.07%          |
+-------------------------+------------------+--------------------------+
|            16           |         8        |           0.07%          |
+-------------------------+------------------+--------------------------+
|            17           |         7        |           0.06%          |
+-------------------------+------------------+--------------------------+
|            18           |        23        |           0.21%          |
+-------------------------+------------------+--------------------------+
|      Doing Nothing      |       3824       |          35.46%          |
+-------------------------+------------------+--------------------------+


VMI migration detailed timestamps info:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/parallel_migration/1500vms_migration.csv

full VM map must-gather:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/parallel_migration/must-gather-bz2069098-repo_1500_vms.tar.gz

Comment 15 sgott 2022-09-07 12:54:10 UTC

Deferring to the next release because we don't have a clear diagnosis or path forward.

Comment 16 sgott 2022-09-07 13:02:00 UTC

Actually, before we defer this, I'd like to pose this quesiton: Do we see the same slowness of migration during an upgrade, or is this related to machine config? Can you test that?

Comment 17 Igor Bezukh 2022-09-29 12:58:06 UTC

Hi,

I did some experiments and I can clearly see that the most efficient way to consume the parallel migrations is to use a taint.

When you taint a worker node with the default Kubevirt key "kubevirt.io/drain=true:NoSchedule" then the virt-controller takes all the VMIs 
that reside on that node, that are eligible to be migrated, and creates migration objects for them.

If using the regular drain procedure, virt-api hooks all the eviction API calls and marks the VMI objects with a patch call. That is a lot of API calls which are not necessary run in parallel.
Also we are limited by the rate of the virt-controller control loop execution, since not all the VMIs are patched at the same time.

Comment 18 Igor Bezukh 2022-10-02 07:18:57 UTC

Hi Boaz,

I need your help with D/L the must gather files, I can't access the links provided in the discussion above.

In addition, when you apply the machine config to the zone, can you configure some custom taints to be applied as part of the OCP drain procedure?

Comment 19 Igor Bezukh 2022-10-02 16:35:58 UTC

BTW do we have some certainty regarding the parallelism of the drain procedure itself?

Comment 20 Igor Bezukh 2022-10-06 08:46:03 UTC

Hi,

I managed to D/L the must-gather, and while observing the virt-controller logs I noticed this type of error that appears at the time the migration rate is low:

2022-08-17T10:48:58.936994585Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance worker011","pos":"evacuation.go:298","reason":"Internal error occurred: failed calling webhook \"migration-create-validator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.openshift-cnv.svc:443/migration-validate-create?timeout=10s\": context deadline exceeded","timestamp":"2022-08-17T10:48:58.936863Z"}

This means virt-controller failed to create new migration because the webhook was unresponsive. This usually happens when the API server is overloaded (although we already have 6 instances of virt-api).

Perhaps this can explain the "doing nothing" percentage in the histogram above.

I was wondering whether we can try to reproduce the issue while using a dedicated migration network? I want to eliminate the probability of high network congestion that can cause the context deadline exceed errors with API.

Comment 21 Boaz 2022-10-19 06:12:47 UTC

Hey, 
the "Doing nothing" percentage is mainly related to the maxUnavailable settings on the zones e.g: 

--------------------
  maxUnavailable: 11
  nodeSelector:
    matchLabels:
--------------------

while we wait for the nodes to come back from the reboot before proceeding to the next node drain, and yes we do see an occasional "failed calling webhook"
this is already a work in progress, you can catch up here - https://github.com/kubevirt/kubevirt/issues/7101#
thanks.

Comment 22 Igor Bezukh 2022-10-20 09:03:39 UTC

Hi,

To summarize my conclusion : we should equal the maximum parallel migrations to the maximum parallel migrations per outbound node, and here is why:

- kubectl drain logic doesn't run in parallel when it has multiple nodes to drain. But kubectl does create a goroutine per pod that is to be drained.
  This introduces a latency when we need to proceed to the next node to drain, because when kubectl tries to evict a pod of migratable VMI, the PDB blocks 
  the eviction, and kubectl then waits for 5 secs until it will issue a new eviction request. That means that until all the VMIs won't be migrated, the kubectl won't proceed to the next node.

- Even if the drain itself somehow optimized in the Machine Config Operator, we still have a probability that the migration queue will be populated with VMI pods that belong to the same node, then we hit the maximum outbound limit.

I also suggest to try the dedicated migration network feature, both the the control plane traffic and the migration speed can benefit from it IMO.

Comment 23 Igor Bezukh 2022-10-20 12:42:53 UTC

Following my observations, the MCO can issue the drain procedure on multiple nodes simultaneously, up to the maxUnavailable value.

I can suggest an optimization on the kubevirt side, in the evacuation controller we can try to organize the migration candidates queue in some way that
the candidates would be from different nodes, as much as possible.

Comment 24 Fabian Deutsch 2022-11-09 13:39:08 UTC

Igor, to be clear: The issue is that with plain kube, if multiple nodes are drained, then one node must be drained completely, before the next node is approached.
And how fast a node can be drained depends on how fast we can migrate VMs of it.

IOW per node eviction is parallel, but node draining is serial?

Comment 25 Igor Bezukh 2022-11-09 13:43:38 UTC

Yes exactly - with regards to plain kube, node eviction is parallel, and node draining is serial

Comment 28 Sarah Bennert 2022-12-12 13:41:30 UTC

VERIFIED

Comment 32 errata-xmlrpc 2023-01-24 13:36:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408

Note You need to log in before you can comment on or make changes to this bug.