Description of problem: Right now kubevirt uses very low and not configurable QPS values for the kubernetes rate limiters. If any operations are done which lead to mass-restarts of VMs, CNV controllers can run into QPS limits which will lead to launcher pod timeouts while it is waiting for kvm to be started. As a consequence no VMs can be started anymore, since the timeout leads to the VMI recreation which in turn creates pressure on the ratelimiter and so forth. Version-Release number of selected component (if applicable): How reproducible: force-delete a lot of VMIs and wait for the VM controllers to recover. It will at least take a very very long time until at least a part of the VMs will run again. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: We should introduce higher QPS limits, make them in addition configurable and finally expose the rate limit prometheus metric to allow monitoring rate limit hits. Additional info: Applicable to all CNV versions
https://github.com/kubevirt/kubevirt/pull/5963 and https://github.com/kubevirt/kubevirt/pull/6101
to verify, repeat the BZ description
Verified by comparing spent time in CNV 4.8.0 and v4.10.0-636 for creating 100 replicas of VMIs. In 4.8 on my environment after updating "replica" values it took around *40* seconds for first pods to appear on a cluster In 4.10 with default qps values it takes around *10* seconds. Decreasing qps values in a config - expectedly increase the time of processing. In my opinion, we can consider this bug as fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947