Bug 1990061 - [virt] CNV Daemonsets have maxUnavailable set to 1 which leads to very slow upgrades on large clusters
Summary: [virt] CNV Daemonsets have maxUnavailable set to 1 which leads to very slow u...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.6.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Antonio Cardace
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks: 1987009 1990267
TreeView+ depends on / blocked
 
Reported: 2021-08-04 16:27 UTC by Simone Tiraboschi
Modified: 2022-03-16 15:53 UTC (History)
23 users (show)

Fixed In Version: virt-launcher-container-v4.8.4-8 hco-bundle-registry-container-v4.8.4-35
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1987009
Environment:
Last Closed: 2022-03-16 15:51:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 6702 0 None open virt-handler canary upgrade 2021-11-24 13:11:44 UTC
Red Hat Product Errata RHSA-2022:0947 0 None None None 2022-03-16 15:53:09 UTC

Description Simone Tiraboschi 2021-08-04 16:27:58 UTC
+++ This bug was initially created as a clone of Bug #1987009 +++

* This specifically refers to virt-handler*

As for:
https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#upgrade-and-reconfiguration

please set .spec.updateStrategy.rollingUpdate.maxUnavailable = 10% 
on virt-hander daemonset.

Description of problem:
Currently, all the daemonsets managed by the OpenShift Virtualization Operator default to a maxUnavailable of 1. This means that on large clusters the upgrade of the OpenShift virtualization operator takes a long time.

For example on  120 node cluster, it took 5.5 hours just for the OpenShift Virtualization operator to upgrade. When customers set aside maintenance windows to upgrade their platform, the OCP platform upgrade itself takes lesser time than CNV operator upgrade, so that will be a pain point. 

[kni@e16-h18-b03-fc640 ~]$ for i in `oc get ds | grep 10h | awk {'print$1'}`; do echo -n $i; oc get ds/$i -o yaml | grep -i maxunavailable; done
bridge-marker            f:maxUnavailable: {}
      maxUnavailable: 1
hostpath-provisioner            f:maxUnavailable: {}
      maxUnavailable: 1
kube-cni-linux-bridge-plugin            f:maxUnavailable: {}
      maxUnavailable: 1
kubevirt-node-labeller            f:maxUnavailable: {}
      maxUnavailable: 1
nmstate-handler            f:maxUnavailable: {}
      maxUnavailable: 1
ovs-cni-amd64            f:maxUnavailable: {}
      maxUnavailable: 1
virt-handler            f:maxUnavailable: {}
      maxUnavailable: 1

Currently all cluster operators in OCP have a maxUnavailable of at least 10% set. Clayton also recommends this as per https://bugzilla.redhat.com/show_bug.cgi?id=1920209#c14

Couple of options here: 
1. ump maxUnavailable to 10%
2. Investigate if any pods in any of the daemonsets do not handle SIGTERM properly and as a result take a while to exit. Inthat case we should lower the `terminationGracePeriodSeconds` to somehting like 10s.
Version-Release number of selected component (if applicable):
CNV 2.6.5

How reproducible:
100%

Steps to Reproduce:
1. Deploy large cluster
2. Install CNV
3. Upgrade CNV oeprator

Actual results:
Upgrade of CNV on 120 node cluser takes 5.5 hours

Expected results:
OpenShift Cluster Operator upgrade itself takes around 3hours on a 120 node cluster, so the CNV operator takes longer than all of OpenShift to upgrade.


Additional info:

--- Additional comment from Dan Kenigsberg on 2021-07-30 14:47:58 CEST ---

Idea for a workaround: use https://docs.openshift.com/container-platform/4.8/virt/install/virt-specifying-nodes-for-virtualization-components.html#node-placement-hco_virt-specifying-nodes-for-virtualization-components to limit cnv daemonsets to the few workers where VMs run. 

This, however, is going to disable knmstate on most nodes, so you may want to revert it after upgrade.

Maybe there's a way to use https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#jsonpatch-annotations (requires support exception) to explicitly allow only knmstate everywhere?

--- Additional comment from Sai Sindhur Malleni on 2021-07-30 16:33:17 CEST ---

(In reply to Dan Kenigsberg from comment #1)
> Idea for a workaround: use
> https://docs.openshift.com/container-platform/4.8/virt/install/virt-
> specifying-nodes-for-virtualization-components.html#node-placement-hco_virt-
> specifying-nodes-for-virtualization-components to limit cnv daemonsets to
> the few workers where VMs run. 
> 
> This, however, is going to disable knmstate on most nodes, so you may want
> to revert it after upgrade.
> 
> Maybe there's a way to use
> https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/
> cluster-configuration.md#jsonpatch-annotations (requires support exception)
> to explicitly allow only knmstate everywhere?

Thanks Dan. Sure, later versions of CNV/OCP do support this but when upgrading from 4.6.17 (whatever CNV operator version is on that) -> 4.7.11 this feature is missing. While the workaround will help this case, I think we all agree that we want to make sure the operator itself upgrades quickly enough when deployed at scale, if a customer really wants to use 120 nodes for CNV. So I do believe we can speed this up even when running on 120 nodes.

--- Additional comment from Adam Litke on 2021-08-02 16:18:53 CEST ---

the hostpath-provisioner DS can set maxUnavailable to infinity.  The DS only needs to run when the node is running actual workloads.

--- Additional comment from Simone Tiraboschi on 2021-08-02 16:47:07 CEST ---

Currently on our daemonsets I see:

$ oc get daemonset -n openshift-cnv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.updateStrategy.rollingUpdate.maxUnavailable}{"\n"}{end}'
bridge-marker	1
kube-cni-linux-bridge-plugin	1
nmstate-handler	1
ovs-cni-amd64	1
virt-handler	1


Personally I'm simply for statically setting 10% on each on that as Clayton Coleman recommends here:
https://bugzilla.redhat.com/show_bug.cgi?id=1920209#c14

Adam, Stu, Petr, Dominik: do you have any specific concerns about that?

--- Additional comment from Petr Horáček on 2021-08-04 11:04:00 CEST ---

I don't have any concerns as none of these network components is critical cluster-wide.

I would be happy to share the work on this. @Simone, would you mind if I cloned this BZ and took over the network components?

--- Additional comment from Dominik Holler on 2021-08-04 13:37:05 CEST ---

SSP is not affected, because SSP does not have any daemon set.

--- Additional comment from Simone Tiraboschi on 2021-08-04 18:24:10 CEST ---

https://github.com/openshift/enhancements/pull/854 got merged so now the agreement is officially about setting:
.spec.updateStrategy.rollingUpdate.maxUnavailable = 10%
on all of our dameonsets.

> @Simone, would you mind if I cloned this BZ and took over the network components?

Yes, please do.
I'm going to create also bugs for the other affected components.

Comment 1 Simone Tiraboschi 2021-08-05 07:16:59 UTC
Because the fix for this is trivial but the impact on upgrades time really visible on large clusters,
I'm also asking if we can consider backporting this down to 2.5.z.

Comment 2 Simone Tiraboschi 2021-08-09 12:20:31 UTC
Thanks, so let's commit for 2.6.7.

Comment 3 Roman Mohr 2021-08-10 12:44:43 UTC
I find 10% pretty high. That means that if virt-handler has issues, we would roll it out in this example on 12 nodes, and we would loose the possibility to do virt-related operations on all of them.
I can understand that one does not want to wait for hours, but on the other hand, thanks to automation, and general 0 downtime update mechanism we can play safe and let it just take its time (would still be interesting to see where we spend the time). Of course, if it take too long, it may be hard for a team to perform updates within the timespan of regular shifts.

I can understand that sometimes (e.g. for testing) one would want to speed up the process. Also on very big clusters it may really be too much and having a percentage instead of `1` may make sense, but 10% sounds pretty high. Making it configurable sounds great. It may help in cases where users favour update speed over guaranteed availability.

Before just changing these values, do we have some availability guarantees which we want to provide as well as update time goals which we can compare to see what our goals are and that we can weight them?

Comment 4 Roman Mohr 2021-08-10 12:58:48 UTC
What may also speak for a tunable which CNV can potentially change to its own needs: Since it focuses pretty well tested openshift versions, the chance of unintended errors may be smaller than on e.g. upstream kubevirt installations which may be a justification for allowing higher values.

Comment 5 Simone Tiraboschi 2021-08-12 08:16:53 UTC
This is something that should discussed at Openshift level because, during OCP updates, the product with the most conservative value is going to influence everything else.
It got discussed here:
https://github.com/openshift/enhancements/pull/854

For the same reason, I don't think that having this independently tunable for each specific product is going to add that much value (it will help during upgrades of the single product but not during OCP ones).

Comment 6 Dan Kenigsberg 2021-08-12 09:19:31 UTC
@stirabos kubevirt CR can expose its own maxUnavailable configurable. Roman can then keep it in its default of `1`, and HCO can set it to `10%`.

In the future, HCO can expose it itself, to let users control this value (`1` for the cautions, `30%` for the hurried).

Comment 9 Roman Mohr 2021-08-17 15:42:17 UTC
(In reply to Simone Tiraboschi from comment #5)
> This is something that should discussed at Openshift level because, during
> OCP updates, the product with the most conservative value is going to
> influence everything else.
> It got discussed here:
> https://github.com/openshift/enhancements/pull/854
> 

I am not sure this applies for layered components. As far as I know we don't influence the general upgrade time of an openshift node.
We update completely independent, after the whole cluster has updated, but maybe I miss something.

From a kubevirt perspective we can probably reduce the "kubevirt downtime" per node significantly, if we would start a virt-handler dummy daemonset first
which only has the purpose of pre-pulling the new image.


> For the same reason, I don't think that having this independently tunable
> for each specific product is going to add that much value (it will help
> during upgrades of the single product but not during OCP ones).

Comment 10 Simone Tiraboschi 2021-08-19 10:38:22 UTC
Yes, on the technical side the two upgrades are independent and the user can update OCP only and CNV only, but in the real word I'd expect customers planning maintenance windows scheduling multiple updates at a time.
So I think that having an upgrade time that is somehow comparable with Openshift one is still a best practice.

Comment 11 Roman Mohr 2021-08-20 09:37:00 UTC
(In reply to Simone Tiraboschi from comment #10)
> Yes, on the technical side the two upgrades are independent and the user can
> update OCP only and CNV only, but in the real word I'd expect customers
> planning maintenance windows scheduling multiple updates at a time.

I want to emphasize again that during that update window only a single node at a time will have degraded functionality (it is not unavailable at all, workloads continue to work).

> So I think that having an upgrade time that is somehow comparable with
> Openshift one is still a best practice.

Hm, I would say that a best-practice would be having as-fast-as-possible and as-secure-as-possible updates.  We don't do any pre-checks as far as I  know. Also I am not sure if we are really sure that our downgrades work reliably. If we are e.g. half through a CNV update and then for instance a non-existing image is referenced by a component, we would render 25 nodes potentially unusable for virt. While in the other case we only talk about a single node.

Therefore I would recommend to first also explore options on how to reach a reasonable speed without having a potentially huge impact on VM based services if something goes wrong. Since everything is automated, i don't think that we can expect from users by default the skill set to know immediately how to fix an issue so that the upgrade can proceed.

Comment 12 David Vossel 2021-08-23 13:56:18 UTC
I agree with Roman that we need to have both a fast and safe as possible update path for virt-handler. Here's the concern. There are no pre-flight checks for virt-handler, and losing (even temporarily) 10% of the cluster is a big deal. Especially given that we don't have a reliable way to automate recovery from such a situation (that I'm aware of)

Our current behavior of updating one virt-handler at a time means that we're sort of using a single canary pod to gate the update of the rest of the cluster. For example, if the first virt-handler update fails, no more updates will occur until that issue is solved. In that scenario at most one node is impacted.

To speed up updates of virt-handler at scale I'd like to maintain the confidence we get from that initial canary pod update. I see two paths here.

1. Single canary pod update + 10% batched updates afterwards

this would look like
- Set virt-handler's update strategy to "on delete"
- pick one virt-handler out of the cluster, delete it, and verify the new updated virt-handler comes back successfully
- update the rest of the cluster in 10% batches after gaining confidence from canary pod update

2. Force a pre-pull of new virt-handler image on all nodes then perform existing behavior of rolling over 1 handler at a time.

this would look like
- create a dummy daemonset that pre-pulls all virt-handlers across all nodes
- once dummy daemonset is finished rolling out, delete it
- update virt-handler daemonset, pre-pulled virt-handler images should make this faster.


In order to pick a path forward, I'd like to see some research done to determine what actually speeds up our update process. If it's primarily the pulling of new virt-handler images across the cluster, then option 2 is the simplest and safest path forward. If there's more to it than that, option 1 might make more sense.

Comment 13 Fabian Deutsch 2021-08-23 14:32:45 UTC
To me the second option is interesting as it seems so safe.
It also seems that other components of OpenSHift (and Kubernetes) have the same issue, i.e. https://access.redhat.com/documentation/en-us/red_hat_codeready_workspaces/2.1/html/administration_guide/caching-images-for-faster-workspace-start_crw which is also recommending to pre-pull certain images.

Comment 19 Simone Tiraboschi 2021-09-02 16:10:32 UTC
On k8s 1.21 we also have .spec.updateStrategy.rollingUpdate.maxSurge in beta:
have https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/1591-daemonset-surge
but I don't know if virt-handler is already able to correctly handle the handoff

Comment 23 Simone Tiraboschi 2021-09-20 13:18:48 UTC
Double thinking at this,
the placeholder daemon to pre pull images is for sure an optimization,
but this is not going to solve the real issue:
- with maxUnavailable=1 the upgrade time is going to scale linearly with the number of worker nodes in the cluster
- with maxUnavailable=n% the upgrade time is going to be constant regardless the number of worker nodes in the cluster

So I'm still convinced that this is the real solution.
@Stu: what about simply exposing an API for that on the Kubevirt CR with a default of 1 having HCO setting there 10%?

Comment 28 Krzysztof Majcher 2021-12-03 14:11:55 UTC
Any news on the progress?

Comment 29 sgott 2021-12-03 16:03:57 UTC
This is close to being merged in main. We had an LGTM/approve, and then noticed that we might be creating a flakey test. So the architecture and approach have been agreed upon, we just need to fix one remaining detail.

Comment 30 Antonio Cardace 2021-12-13 13:58:39 UTC
status update, https://github.com/kubevirt/kubevirt/pull/6702 has been merged in main.

Comment 32 sgott 2021-12-22 13:15:25 UTC
To verify, this will likely take a scale test setup. Perform an upgrade and observe that the maxUnavailable field changes to 10% during upgrade (after the upgrade starts).

Comment 33 Fabian Deutsch 2022-01-25 10:49:14 UTC
In an up to 20 node cluster - How many nodes are concurrently updated in 4.9 and (after this change) in 4.10?

Comment 34 Antonio Cardace 2022-01-28 14:19:18 UTC
(In reply to Fabian Deutsch from comment #33)
> In an up to 20 node cluster - How many nodes are concurrently updated in 4.9
> and (after this change) in 4.10?

In 4.9 always 1 node at a time.

In 4.10 it's 1 node in a 1-19 nodes cluster and 2 nodes in a 20 nodes cluster.

Comment 35 Fabian Deutsch 2022-01-28 16:25:23 UTC
Thanks. Thus in order to verify it we will need a cluster with at least 20 workers.

Comment 36 Kedar Bidarkar 2022-02-10 14:41:53 UTC
1) Had checked with Antonio, who mentioned that the SOURCE_CNV version needs to be 4.10.0, for verifying the bug.

2) Hence decided to make use of the nightly upgrades using the following link, https://access.redhat.com/articles/6070641

3) We used a Scale setup with 20 worker nodes and 3 Master Nodes

4) OCP version
[kni@f12-h17-b07-5039ms ~]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-fc.2   True        False         44h     Cluster version is 4.10.0-fc.2

5) Why this was not tested with the very latest CNV Build ?

a) I used CNV Build version (4.10.0-648)   as the startingCSV version.

$ oc get subscription -n openshift-cnv hco-operatorhub -o yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
...
spec:
  channel: nightly-4.10
  installPlanApproval: Manual
  name: kubevirt-hyperconverged
  source: cnv-nightly-catalog-source
  sourceNamespace: openshift-marketplace
  startingCSV: kubevirt-hyperconverged-operator.4.10.0-648

b) As this upgrade testing was all about ensuring virt-handler updates with maxUnavailable value as 10%,
Had to choose a specific CNV Build version (4.10.0-648), for which the virt-handler version would change in the next immediate CNV Build Version ( 4.10.9-654 )

c) Using latest version of CNV Build, would not have updated virt-handler, as the versions would then have been same.

6) CNV Installation was successful.

kni@f12-h17-b07-5039ms ~]$ oc get csv -n openshift-cnv 
NAME                                          DISPLAY                    VERSION      REPLACES                                      PHASE
kubevirt-hyperconverged-operator.4.10.0-648   OpenShift Virtualization   4.10.0-648   kubevirt-hyperconverged-operator.4.10.0-646   Succeeded
[kni@f12-h17-b07-5039ms ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}'
ReconcileComplete	True	Reconcile completed successfully
Available	True	Reconcile completed successfully
Progressing	False	Reconcile completed successfully
Degraded	False	Reconcile completed successfully
Upgradeable	True	Reconcile completed successfully

7)  Approved the install-ntzqh 

[kni@f12-h17-b07-5039ms ~]$ oc get ip -A
NAMESPACE       NAME            CSV                                           APPROVAL   APPROVED
openshift-cnv   install-744cd   kubevirt-hyperconverged-operator.4.10.0-648   Manual     true
openshift-cnv   install-7pts2   kubevirt-hyperconverged-operator.4.10.0-648   Manual     false
openshift-cnv   install-ntzqh   kubevirt-hyperconverged-operator.4.10.0-654   Manual     false

~]$ oc patch installplan install-ntzqh --namespace=openshift-cnv --type=merge '--patch={"spec":{"approved":true}}'
installplan.operators.coreos.com/install-ntzqh patched
[kni@f12-h17-b07-5039ms ~]$ oc get csv -n openshift-cnv 
NAME                                          DISPLAY                    VERSION      REPLACES                                      PHASE
kubevirt-hyperconverged-operator.4.10.0-648   OpenShift Virtualization   4.10.0-648   kubevirt-hyperconverged-operator.4.10.0-646   Replacing
kubevirt-hyperconverged-operator.4.10.0-654   OpenShift Virtualization   4.10.0-654   kubevirt-hyperconverged-operator.4.10.0-648   InstallReady

8) As seen below 2 virt-handler Pods were updating concurrently.
Considering maxUnavailable=10% ; For 20 nodes 10% is 2.

virt-handler-28p6h                                              1/1     Running    0             49m
virt-handler-4qqgc                                              1/1     Running    0             49m
virt-handler-88k4w                                              1/1     Running    0             49m
virt-handler-9mkhx                                              1/1     Running    0             49m
virt-handler-bzrj5                                              1/1     Running    0             49m
virt-handler-c22bb                                              1/1     Running    0             49m
virt-handler-ckc7b                                              1/1     Running    0             49m
virt-handler-fbd85                                              1/1     Running    0             49m
virt-handler-gk5p7                                              1/1     Running    0             49m
virt-handler-l8rsn                                              1/1     Running    0             49m
virt-handler-ln7hn                                              1/1     Running    0             49m
virt-handler-pg8ww                                              1/1     Running    0             49m
virt-handler-pzhjr                                              1/1     Running    0             49m
virt-handler-trlt6                                              1/1     Running    0             49m
virt-handler-vttzv                                              1/1     Running    0             49m
virt-handler-wmnzt                                              1/1     Running    0             52s
virt-handler-wnrxp                                              1/1     Running    0             49m
virt-handler-wzcjm                                              0/1     Init:0/1   0             6s
virt-handler-x7gmv                                              0/1     Running    0             11s
virt-handler-zrgsr                                              1/1     Running    0             49m

------------------------------------------------------
virt-handler-28p6h                                              1/1     Running   0             50m
virt-handler-4qqgc                                              1/1     Running   0             50m
virt-handler-88k4w                                              1/1     Running   0             50m
virt-handler-9mkhx                                              1/1     Running   0             50m
virt-handler-b6d86                                              0/1     Running   0             26s
virt-handler-bzrj5                                              1/1     Running   0             50m
virt-handler-c22bb                                              1/1     Running   0             50m
virt-handler-ckc7b                                              1/1     Running   0             50m
virt-handler-fbd85                                              1/1     Running   0             50m
virt-handler-gk5p7                                              1/1     Running   0             50m
virt-handler-gndcx                                              0/1     Running   0             31s
virt-handler-l8rsn                                              1/1     Running   0             50m
virt-handler-ln7hn                                              1/1     Running   0             50m
virt-handler-pg8ww                                              1/1     Running   0             50m
virt-handler-vttzv                                              1/1     Running   0             50m
virt-handler-wmnzt                                              1/1     Running   0             113s
virt-handler-wnrxp                                              1/1     Running   0             50m
virt-handler-wzcjm                                              1/1     Running   0             67s
virt-handler-x7gmv                                              1/1     Running   0             72s
virt-handler-zrgsr                                              1/1     Running   0             50m



[kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable'
1
[kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable'
1
[kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable'
"10%"
[kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable'
"10%"
[kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable'
1
[kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable'
1


[kni@f12-h17-b07-5039ms ~]$ oc get csv -n openshift-cnv 
NAME                                          DISPLAY                    VERSION      REPLACES                                      PHASE
kubevirt-hyperconverged-operator.4.10.0-654   OpenShift Virtualization   4.10.0-654   kubevirt-hyperconverged-operator.4.10.0-648   Succeeded

-------------------------------------------------------

]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o json | jq '.status.versions'
[
  {
    "name": "operator",
    "version": "4.10.0"
  }
]

[kni@f12-h17-b07-5039ms ~]$ oc get pods -n openshift-cnv | grep hco-operator
hco-operator-7b668bfc7f-lh9hj                                   1/1     Running   0             12m

~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}'
ReconcileComplete	True	Reconcile completed successfully
Available	True	Reconcile completed successfully
Progressing	False	Reconcile completed successfully
Degraded	False	Reconcile completed successfully
Upgradeable	True	Reconcile completed successfully


8) Total Time taken for virt-handler upgrade with 20 Worker Nodes is 7m48sec approximately.


Summary: 
1) As seen above 2 virt-handler Pods were updating concurrently.
2) Also virt-handler Daemonset value got updated to maxUnavailable=10% ; As we had 20 worker nodes.
3) After the upgrade, virt-handler Daemonset value got updated to maxUnavailable=1

VERIFIED upgrading from 4.10.0-648 to 4.10.0-654 ( using nightly builds )

Comment 39 errata-xmlrpc 2022-03-16 15:51:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.