1903679 – [scale] 1K batch start VMS fail

Bug 1903679 - [scale] 1K batch start VMS fail

Summary: [scale] 1K batch start VMS fail

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	4.8.0
Assignee:	sgott
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-02 16:01 UTC by guy chen
Modified:	2021-07-27 14:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 14:21:17 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 4772	0	None	closed	virt-controller: increase the number of VMI controller threads	2021-02-12 09:30:43 UTC
Red Hat Product Errata	RHSA-2021:2920	0	None	None	None	2021-07-27 14:22:27 UTC

Description guy chen 2020-12-02 16:01:58 UTC

Description of problem:

I have run a load of start 1K VMS batch on the system - i've run 1K the command virtctl start {VMS}.
Only 106 VMS started, all the rest of the VMS got the following status :

[root]# oc get vmi | grep -c Run
106
[root]# oc get vmi | grep -c Failed
330
[root]# oc get vmi | grep -c Scheduling
137
[root]# oc get vmi | grep -c Pending
144

The rest of the VMI have no status.


Version-Release number of selected component (if applicable):

NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-rc.4   True        False         21d     Cluster version is 4.6.0-rc.4
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.5.0   OpenShift Virtualization   2.5.0     kubevirt-hyperconverged-operator.v2.4.3   Succeeded


How reproducible:

Always

Steps to Reproduce:

1.Create 1K PVC
2.Create 1K VM
3.Run virtctl start VM_NAME 1K times

Actual results:
Not all VMI start

Expected results:
All VMI start

Comment 1 Jed Lejosne 2020-12-02 16:28:36 UTC

Would it be possible to provide some logs for this?
Maybe pick one VMI per phase, and provide:
- The virt-handler logs for that VMI
- Its virt-launcher logs (if virt-launcher even started)
- More VMI-specific logs if available, or PVC logs if any, we need all we can get to diagnose this issue

Ideally we'd be able to pull out something similar to the artifacts of failed functional tests, but it's probably much trickier at that scale... For example:
https://gcsweb.apps.ovirt.org/gcs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/4555/pull-kubevirt-e2e-k8s-1.18/1334089402359484416/artifacts/k8s-reporter/

Thank you!

Comment 3 Jed Lejosne 2020-12-02 17:12:44 UTC

285 virt-launchers grep for the panic described in https://github.com/kubevirt/kubevirt/issues/3933 (and https://github.com/kubevirt/kubevirt/issues/4556)
Unfortunately, it looks like it's a pretty meaningless error, it probably just means that qemu crashed.
Isn't there a VMI "debug" option to redirect logs from libvirtd and qemu to virt-launcher pod logs? I thought there was one but I can't seem to find it.

Comment 4 Jed Lejosne 2020-12-02 17:19:18 UTC

Looking at the kubevirt code, it seems like what I'm looking for is the VMI boolean label "debugLogs".
Any chance that could be set to true in the next run?
Thank you!

Comment 6 Jed Lejosne 2020-12-03 16:38:42 UTC

After having a look at the cluster, it looks like qemu is actually never started on failed VMIs.
In fact, the libvirt VM XML is also never created.
On the run I looked at, the first 108 VMs successfully started, then everything else failed.

Trying to manually create a VM from XML, from inside an about-to-fail virt-launcher/compute, errors on a missing tap0.
Indeed, `ip addr` shows no tap0...
Could both the tap creation and libvirt VM creation have *silently* failed?...
I'll keep digging.

Comment 7 Jed Lejosne 2020-12-03 23:26:33 UTC

The previous lead was a red herring. I created a VMI without any NIC and it still failed to create.
Instead, the timestamps in the logs strongly suggest that we're just hitting the (3 minutes) timeout, just like the logs say!

On failed VMIs, the last log lines on compute are:
2020-12-02T12:52:10.974818638Z {"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:71","timestamp":"2020-12-02T12:52:10.974746Z"}
2020-12-02T12:57:10.978575541Z panic: timed out waiting for domain to be defined

When looking for that last relevant line ("Marked as ready") + the next one in the logs of the first VMI created, we get:
2020-11-26T13:27:22.283172288Z {"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:71","timestamp":"2020-11-26T13:27:22.283111Z"}
2020-11-26T13:27:25.041030280Z {"component":"virt-launcher","kind":"","level":"info","msg":"Executing PreStartHook on VMI pod environment","name":"vm-1","namespace":"default","pos":"manager.go:982",

That's a ~3 second delay between the two.

The 10th VMI says:
2020-11-26T13:27:36.222406947Z {"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:71","timestamp":"2020-11-26T13:27:36.222314Z"}
2020-11-26T13:28:01.002579062Z {"component":"virt-launcher","kind":"","level":"info","msg":"Executing PreStartHook on VMI pod environment","name":"vm-10","namespace":"default","pos":"manager.go:982"

That's 25 seconds.

Here are some more delays:
VMI 25:  1min 10 seconds
VMI 50:  2min 37 seconds
VMI 75:  3min 33 seconds (more than the 3 minutes timeout, probably because of un-logged events happening in between, or just scheduling delays)
VMI 100: 4min 41 seconds

Furthermore, when creating a new VMI on that cluster, it takes about 10 minutes for it to even be considered for creation (then it will eventually fail on the same timeout).

Some component is getting highly overwhelmed and creates a bottleneck.
Prometheus seems to be going crazy, with 150% CPU usage and GBs of memory usage, probably a good candidate for more investigation.

Comment 8 Jed Lejosne 2020-12-04 13:34:15 UTC

To add the the previous, the 600 remaining VMs all started successfully overnight (I killed 400 yesterday to make the cluster somewhat usable)
So the upset component eventually settled. Interesting to note that prometheus is now down to ~15% CPU usage.

Comment 9 guy chen 2020-12-06 10:41:56 UTC

So the solution is to either to limit the number of VMS per batch or to increase the VMS timeout's ?

Comment 10 Jed Lejosne 2020-12-16 19:37:55 UTC

Here's a few experiment results:
- Increasing the virt-launcher timeout[0] seems to be a viable workaround, but it doesn't fix the root cause
- Similarly, adding a timeout in virt-launcher to first wait for the domain to get created allows all VMs to start [1]
  Still not the proper fix, but it would be a good idea to add that timeout anyway, since waiting for a non-existent domain to start doesn't make much sense
- Increasing the virt-controller thread count [2] helps a bit, but doesn't solve the issue.
  Still, it could be a good idea to double it, like I did in the next tests, to ensure virt-controller never gets overwhelmed
- With a higher virt-controller thread count, while 1000* VMs still fail to start, creating directly 1000* (equally spec-ed) VMIs now works!
- Creating the 1000 VMs with spec.running=true (autostart) behaves pretty interestingly.
  At first, virt-launcher pods are created when the VMIs are ready to be scheduled, which solves the timeout issue.
  However, while the first ~90 VMs are started pretty quickly, that's followed a major slowdown, about 1 VM started per minute.
  Then after a while, 100s of VMIs get started all at once, along with their virt-launcher pod, and many eventually hit the timeout and fail...

So the proper fix probably has to do with the scheduling of VM startup, maybe around virt-api?

* Test were actually performed on only 850 VMs/VMIs, not 1000, due to cluster limitations, but I expect similar results for 1000
[0] https://github.com/kubevirt/kubevirt/blob/6dd0872755456635c82ce369561eeca1a2408cd2/cmd/virt-launcher/virt-launcher.go#L60
[1] https://github.com/jean-edouard/kubevirt/commit/6ff1169338158cbdd42cf71fda159f2aea087420
[2] https://github.com/kubevirt/kubevirt/blob/606ee025ee8c80018678063aa66da586222a8bfa/pkg/virt-controller/watch/application.go#L77

Comment 11 Fabian Deutsch 2020-12-21 11:01:50 UTC

Do we have a better idea of the root cause? Or an idea how to identify it?

> to ensure virt-controller never gets overwhelmed

Could the operator automatically adjust the thread count according to the number of objects?
Or could we send an alert to detect when the controller is getting overwhelmed?

> creating directly 1000* (equally spec-ed) VMIs now works!

That's really interesting.
Do we know why VMs don't work?

> which solves the timeout issue.

Do we have an idea of why the slowdowns happen at the different stages?

@rmohr FYI

Comment 12 Jed Lejosne 2020-12-21 16:46:23 UTC

@fdeutsch I still have no clear idea what the root cause is, it's hard to pinpoint and tests take a lot of time, but knowing it happens as a result of the VM-to-VMI process should help narrow down the investigation.
The delays also seem to be correlated with the number of VMIs currently in Pending state.

Regardless, it's now pretty obvious that this is indeed a bug.
While waiting for a fix, the workaround I'd recommend to anyone running into this would be to start at most 200 VMs at once, then wait for all of them to be running before starting another batch.

Comment 13 Jed Lejosne 2021-01-05 20:01:20 UTC

I found a recurring error in the logs that I hoped might be part of the problem (spoiler: not really, but still worth documenting...)

Looking at the virt-handler logs after starting a whole lot of VMs, this infamous error keeps happening, at least once per VMI:

{"component":"virt-handler","kind":"","level":"error","msg":"Updating the VirtualMachineInstance status failed.","name":"vm-20","namespace":"default","pos":"vm.go:1451","reason":"Operation cannot be fulfilled on virtualmachineinstances.kubevirt.io \"vm-20\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2021-01-05T19:29:14.449560Z","uid":"dd0dbf6b-4326-492f-88ce-1cba9f5d34f4"}

It's a pretty common kubernetes error and just means an object was updated between a Get() and an Update().
It's also mostly harmless, since it usually triggers a retry.
However, my theory was that virt-handlers were maybe getting overwhelmed with retries.
So I wrote some code to re-Get() right before Update() to minimize the chances of getting the error (https://github.com/jean-edouard/kubevirt/commit/c279b87c9be8b734ffd5c8966d32ae1369c4b39f).
Note: Ithe proper fix for the above error is probably to use Patch() instead of Update(), but I suspect that would take a lot of work.

The patch above did push the number of successfully started VMIs from ~170 to ~230 (out of 850, trying to start ~300 VMs at once is likely to succeed, it's all about the length of the queue).
The potential improvement is pretty insignificant and could just have been a fluke, so back to the drawing board.

Comment 14 Fabian Deutsch 2021-01-08 13:20:48 UTC

Interesting observation.

Do you - or @rmohr - have an idea of how Kube itself is addressing huge amounts of objects that need to be processed? Can we learn from them?

Comment 15 Ruth Netser 2021-03-31 11:26:32 UTC

Jed, Guy, what's the status of this bug?

Comment 16 Jed Lejosne 2021-03-31 12:14:51 UTC

The bug was fixed in kubevirt master mid-January.
Starting 1000 VMs should now work, but will still take a long time (about 1000 seconds, so 15+ minutes).
The fix was not backported to any existing kubevirt release branch, please let me know if it should be.

Comment 17 Ruth Netser 2021-03-31 14:25:32 UTC

(In reply to Jed Lejosne from comment #16)
> The bug was fixed in kubevirt master mid-January.
> Starting 1000 VMs should now work, but will still take a long time (about
> 1000 seconds, so 15+ minutes).
> The fix was not backported to any existing kubevirt release branch, please
> let me know if it should be.

Guy, can we mark this bug as verified for 4.8?
Do yo know if this should be backported?

Comment 18 guy chen 2021-04-01 10:48:04 UTC

This will be tested at 4.8 once we get the performance lab i'v ordered, hopefully in the next few weeks.
I do not know if this should be backported.

Comment 20 Israel Pinto 2021-05-30 18:57:27 UTC

Verified. 

More info at: https://issues.redhat.com/browse/CNV-10125

Comment 23 errata-xmlrpc 2021-07-27 14:21:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920

Note You need to log in before you can comment on or make changes to this bug.