+++ This bug was initially created as a clone of Bug #1987009 +++ * This specifically refers to virt-handler* As for: https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#upgrade-and-reconfiguration please set .spec.updateStrategy.rollingUpdate.maxUnavailable = 10% on virt-hander daemonset. Description of problem: Currently, all the daemonsets managed by the OpenShift Virtualization Operator default to a maxUnavailable of 1. This means that on large clusters the upgrade of the OpenShift virtualization operator takes a long time. For example on 120 node cluster, it took 5.5 hours just for the OpenShift Virtualization operator to upgrade. When customers set aside maintenance windows to upgrade their platform, the OCP platform upgrade itself takes lesser time than CNV operator upgrade, so that will be a pain point. [kni@e16-h18-b03-fc640 ~]$ for i in `oc get ds | grep 10h | awk {'print$1'}`; do echo -n $i; oc get ds/$i -o yaml | grep -i maxunavailable; done bridge-marker f:maxUnavailable: {} maxUnavailable: 1 hostpath-provisioner f:maxUnavailable: {} maxUnavailable: 1 kube-cni-linux-bridge-plugin f:maxUnavailable: {} maxUnavailable: 1 kubevirt-node-labeller f:maxUnavailable: {} maxUnavailable: 1 nmstate-handler f:maxUnavailable: {} maxUnavailable: 1 ovs-cni-amd64 f:maxUnavailable: {} maxUnavailable: 1 virt-handler f:maxUnavailable: {} maxUnavailable: 1 Currently all cluster operators in OCP have a maxUnavailable of at least 10% set. Clayton also recommends this as per https://bugzilla.redhat.com/show_bug.cgi?id=1920209#c14 Couple of options here: 1. ump maxUnavailable to 10% 2. Investigate if any pods in any of the daemonsets do not handle SIGTERM properly and as a result take a while to exit. Inthat case we should lower the `terminationGracePeriodSeconds` to somehting like 10s. Version-Release number of selected component (if applicable): CNV 2.6.5 How reproducible: 100% Steps to Reproduce: 1. Deploy large cluster 2. Install CNV 3. Upgrade CNV oeprator Actual results: Upgrade of CNV on 120 node cluser takes 5.5 hours Expected results: OpenShift Cluster Operator upgrade itself takes around 3hours on a 120 node cluster, so the CNV operator takes longer than all of OpenShift to upgrade. Additional info: --- Additional comment from Dan Kenigsberg on 2021-07-30 14:47:58 CEST --- Idea for a workaround: use https://docs.openshift.com/container-platform/4.8/virt/install/virt-specifying-nodes-for-virtualization-components.html#node-placement-hco_virt-specifying-nodes-for-virtualization-components to limit cnv daemonsets to the few workers where VMs run. This, however, is going to disable knmstate on most nodes, so you may want to revert it after upgrade. Maybe there's a way to use https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#jsonpatch-annotations (requires support exception) to explicitly allow only knmstate everywhere? --- Additional comment from Sai Sindhur Malleni on 2021-07-30 16:33:17 CEST --- (In reply to Dan Kenigsberg from comment #1) > Idea for a workaround: use > https://docs.openshift.com/container-platform/4.8/virt/install/virt- > specifying-nodes-for-virtualization-components.html#node-placement-hco_virt- > specifying-nodes-for-virtualization-components to limit cnv daemonsets to > the few workers where VMs run. > > This, however, is going to disable knmstate on most nodes, so you may want > to revert it after upgrade. > > Maybe there's a way to use > https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/ > cluster-configuration.md#jsonpatch-annotations (requires support exception) > to explicitly allow only knmstate everywhere? Thanks Dan. Sure, later versions of CNV/OCP do support this but when upgrading from 4.6.17 (whatever CNV operator version is on that) -> 4.7.11 this feature is missing. While the workaround will help this case, I think we all agree that we want to make sure the operator itself upgrades quickly enough when deployed at scale, if a customer really wants to use 120 nodes for CNV. So I do believe we can speed this up even when running on 120 nodes. --- Additional comment from Adam Litke on 2021-08-02 16:18:53 CEST --- the hostpath-provisioner DS can set maxUnavailable to infinity. The DS only needs to run when the node is running actual workloads. --- Additional comment from Simone Tiraboschi on 2021-08-02 16:47:07 CEST --- Currently on our daemonsets I see: $ oc get daemonset -n openshift-cnv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.updateStrategy.rollingUpdate.maxUnavailable}{"\n"}{end}' bridge-marker 1 kube-cni-linux-bridge-plugin 1 nmstate-handler 1 ovs-cni-amd64 1 virt-handler 1 Personally I'm simply for statically setting 10% on each on that as Clayton Coleman recommends here: https://bugzilla.redhat.com/show_bug.cgi?id=1920209#c14 Adam, Stu, Petr, Dominik: do you have any specific concerns about that? --- Additional comment from Petr Horáček on 2021-08-04 11:04:00 CEST --- I don't have any concerns as none of these network components is critical cluster-wide. I would be happy to share the work on this. @Simone, would you mind if I cloned this BZ and took over the network components? --- Additional comment from Dominik Holler on 2021-08-04 13:37:05 CEST --- SSP is not affected, because SSP does not have any daemon set. --- Additional comment from Simone Tiraboschi on 2021-08-04 18:24:10 CEST --- https://github.com/openshift/enhancements/pull/854 got merged so now the agreement is officially about setting: .spec.updateStrategy.rollingUpdate.maxUnavailable = 10% on all of our dameonsets. > @Simone, would you mind if I cloned this BZ and took over the network components? Yes, please do. I'm going to create also bugs for the other affected components.
Because the fix for this is trivial but the impact on upgrades time really visible on large clusters, I'm also asking if we can consider backporting this down to 2.5.z.
Thanks, so let's commit for 2.6.7.
I find 10% pretty high. That means that if virt-handler has issues, we would roll it out in this example on 12 nodes, and we would loose the possibility to do virt-related operations on all of them. I can understand that one does not want to wait for hours, but on the other hand, thanks to automation, and general 0 downtime update mechanism we can play safe and let it just take its time (would still be interesting to see where we spend the time). Of course, if it take too long, it may be hard for a team to perform updates within the timespan of regular shifts. I can understand that sometimes (e.g. for testing) one would want to speed up the process. Also on very big clusters it may really be too much and having a percentage instead of `1` may make sense, but 10% sounds pretty high. Making it configurable sounds great. It may help in cases where users favour update speed over guaranteed availability. Before just changing these values, do we have some availability guarantees which we want to provide as well as update time goals which we can compare to see what our goals are and that we can weight them?
What may also speak for a tunable which CNV can potentially change to its own needs: Since it focuses pretty well tested openshift versions, the chance of unintended errors may be smaller than on e.g. upstream kubevirt installations which may be a justification for allowing higher values.
This is something that should discussed at Openshift level because, during OCP updates, the product with the most conservative value is going to influence everything else. It got discussed here: https://github.com/openshift/enhancements/pull/854 For the same reason, I don't think that having this independently tunable for each specific product is going to add that much value (it will help during upgrades of the single product but not during OCP ones).
@stirabos kubevirt CR can expose its own maxUnavailable configurable. Roman can then keep it in its default of `1`, and HCO can set it to `10%`. In the future, HCO can expose it itself, to let users control this value (`1` for the cautions, `30%` for the hurried).
(In reply to Simone Tiraboschi from comment #5) > This is something that should discussed at Openshift level because, during > OCP updates, the product with the most conservative value is going to > influence everything else. > It got discussed here: > https://github.com/openshift/enhancements/pull/854 > I am not sure this applies for layered components. As far as I know we don't influence the general upgrade time of an openshift node. We update completely independent, after the whole cluster has updated, but maybe I miss something. From a kubevirt perspective we can probably reduce the "kubevirt downtime" per node significantly, if we would start a virt-handler dummy daemonset first which only has the purpose of pre-pulling the new image. > For the same reason, I don't think that having this independently tunable > for each specific product is going to add that much value (it will help > during upgrades of the single product but not during OCP ones).
Yes, on the technical side the two upgrades are independent and the user can update OCP only and CNV only, but in the real word I'd expect customers planning maintenance windows scheduling multiple updates at a time. So I think that having an upgrade time that is somehow comparable with Openshift one is still a best practice.
(In reply to Simone Tiraboschi from comment #10) > Yes, on the technical side the two upgrades are independent and the user can > update OCP only and CNV only, but in the real word I'd expect customers > planning maintenance windows scheduling multiple updates at a time. I want to emphasize again that during that update window only a single node at a time will have degraded functionality (it is not unavailable at all, workloads continue to work). > So I think that having an upgrade time that is somehow comparable with > Openshift one is still a best practice. Hm, I would say that a best-practice would be having as-fast-as-possible and as-secure-as-possible updates. We don't do any pre-checks as far as I know. Also I am not sure if we are really sure that our downgrades work reliably. If we are e.g. half through a CNV update and then for instance a non-existing image is referenced by a component, we would render 25 nodes potentially unusable for virt. While in the other case we only talk about a single node. Therefore I would recommend to first also explore options on how to reach a reasonable speed without having a potentially huge impact on VM based services if something goes wrong. Since everything is automated, i don't think that we can expect from users by default the skill set to know immediately how to fix an issue so that the upgrade can proceed.
I agree with Roman that we need to have both a fast and safe as possible update path for virt-handler. Here's the concern. There are no pre-flight checks for virt-handler, and losing (even temporarily) 10% of the cluster is a big deal. Especially given that we don't have a reliable way to automate recovery from such a situation (that I'm aware of) Our current behavior of updating one virt-handler at a time means that we're sort of using a single canary pod to gate the update of the rest of the cluster. For example, if the first virt-handler update fails, no more updates will occur until that issue is solved. In that scenario at most one node is impacted. To speed up updates of virt-handler at scale I'd like to maintain the confidence we get from that initial canary pod update. I see two paths here. 1. Single canary pod update + 10% batched updates afterwards this would look like - Set virt-handler's update strategy to "on delete" - pick one virt-handler out of the cluster, delete it, and verify the new updated virt-handler comes back successfully - update the rest of the cluster in 10% batches after gaining confidence from canary pod update 2. Force a pre-pull of new virt-handler image on all nodes then perform existing behavior of rolling over 1 handler at a time. this would look like - create a dummy daemonset that pre-pulls all virt-handlers across all nodes - once dummy daemonset is finished rolling out, delete it - update virt-handler daemonset, pre-pulled virt-handler images should make this faster. In order to pick a path forward, I'd like to see some research done to determine what actually speeds up our update process. If it's primarily the pulling of new virt-handler images across the cluster, then option 2 is the simplest and safest path forward. If there's more to it than that, option 1 might make more sense.
To me the second option is interesting as it seems so safe. It also seems that other components of OpenSHift (and Kubernetes) have the same issue, i.e. https://access.redhat.com/documentation/en-us/red_hat_codeready_workspaces/2.1/html/administration_guide/caching-images-for-faster-workspace-start_crw which is also recommending to pre-pull certain images.
On k8s 1.21 we also have .spec.updateStrategy.rollingUpdate.maxSurge in beta: have https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/1591-daemonset-surge but I don't know if virt-handler is already able to correctly handle the handoff
Double thinking at this, the placeholder daemon to pre pull images is for sure an optimization, but this is not going to solve the real issue: - with maxUnavailable=1 the upgrade time is going to scale linearly with the number of worker nodes in the cluster - with maxUnavailable=n% the upgrade time is going to be constant regardless the number of worker nodes in the cluster So I'm still convinced that this is the real solution. @Stu: what about simply exposing an API for that on the Kubevirt CR with a default of 1 having HCO setting there 10%?
Any news on the progress?
This is close to being merged in main. We had an LGTM/approve, and then noticed that we might be creating a flakey test. So the architecture and approach have been agreed upon, we just need to fix one remaining detail.
status update, https://github.com/kubevirt/kubevirt/pull/6702 has been merged in main.
To verify, this will likely take a scale test setup. Perform an upgrade and observe that the maxUnavailable field changes to 10% during upgrade (after the upgrade starts).
In an up to 20 node cluster - How many nodes are concurrently updated in 4.9 and (after this change) in 4.10?
(In reply to Fabian Deutsch from comment #33) > In an up to 20 node cluster - How many nodes are concurrently updated in 4.9 > and (after this change) in 4.10? In 4.9 always 1 node at a time. In 4.10 it's 1 node in a 1-19 nodes cluster and 2 nodes in a 20 nodes cluster.
Thanks. Thus in order to verify it we will need a cluster with at least 20 workers.
1) Had checked with Antonio, who mentioned that the SOURCE_CNV version needs to be 4.10.0, for verifying the bug. 2) Hence decided to make use of the nightly upgrades using the following link, https://access.redhat.com/articles/6070641 3) We used a Scale setup with 20 worker nodes and 3 Master Nodes 4) OCP version [kni@f12-h17-b07-5039ms ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-fc.2 True False 44h Cluster version is 4.10.0-fc.2 5) Why this was not tested with the very latest CNV Build ? a) I used CNV Build version (4.10.0-648) as the startingCSV version. $ oc get subscription -n openshift-cnv hco-operatorhub -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: ... spec: channel: nightly-4.10 installPlanApproval: Manual name: kubevirt-hyperconverged source: cnv-nightly-catalog-source sourceNamespace: openshift-marketplace startingCSV: kubevirt-hyperconverged-operator.4.10.0-648 b) As this upgrade testing was all about ensuring virt-handler updates with maxUnavailable value as 10%, Had to choose a specific CNV Build version (4.10.0-648), for which the virt-handler version would change in the next immediate CNV Build Version ( 4.10.9-654 ) c) Using latest version of CNV Build, would not have updated virt-handler, as the versions would then have been same. 6) CNV Installation was successful. kni@f12-h17-b07-5039ms ~]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.4.10.0-648 OpenShift Virtualization 4.10.0-648 kubevirt-hyperconverged-operator.4.10.0-646 Succeeded [kni@f12-h17-b07-5039ms ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}' ReconcileComplete True Reconcile completed successfully Available True Reconcile completed successfully Progressing False Reconcile completed successfully Degraded False Reconcile completed successfully Upgradeable True Reconcile completed successfully 7) Approved the install-ntzqh [kni@f12-h17-b07-5039ms ~]$ oc get ip -A NAMESPACE NAME CSV APPROVAL APPROVED openshift-cnv install-744cd kubevirt-hyperconverged-operator.4.10.0-648 Manual true openshift-cnv install-7pts2 kubevirt-hyperconverged-operator.4.10.0-648 Manual false openshift-cnv install-ntzqh kubevirt-hyperconverged-operator.4.10.0-654 Manual false ~]$ oc patch installplan install-ntzqh --namespace=openshift-cnv --type=merge '--patch={"spec":{"approved":true}}' installplan.operators.coreos.com/install-ntzqh patched [kni@f12-h17-b07-5039ms ~]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.4.10.0-648 OpenShift Virtualization 4.10.0-648 kubevirt-hyperconverged-operator.4.10.0-646 Replacing kubevirt-hyperconverged-operator.4.10.0-654 OpenShift Virtualization 4.10.0-654 kubevirt-hyperconverged-operator.4.10.0-648 InstallReady 8) As seen below 2 virt-handler Pods were updating concurrently. Considering maxUnavailable=10% ; For 20 nodes 10% is 2. virt-handler-28p6h 1/1 Running 0 49m virt-handler-4qqgc 1/1 Running 0 49m virt-handler-88k4w 1/1 Running 0 49m virt-handler-9mkhx 1/1 Running 0 49m virt-handler-bzrj5 1/1 Running 0 49m virt-handler-c22bb 1/1 Running 0 49m virt-handler-ckc7b 1/1 Running 0 49m virt-handler-fbd85 1/1 Running 0 49m virt-handler-gk5p7 1/1 Running 0 49m virt-handler-l8rsn 1/1 Running 0 49m virt-handler-ln7hn 1/1 Running 0 49m virt-handler-pg8ww 1/1 Running 0 49m virt-handler-pzhjr 1/1 Running 0 49m virt-handler-trlt6 1/1 Running 0 49m virt-handler-vttzv 1/1 Running 0 49m virt-handler-wmnzt 1/1 Running 0 52s virt-handler-wnrxp 1/1 Running 0 49m virt-handler-wzcjm 0/1 Init:0/1 0 6s virt-handler-x7gmv 0/1 Running 0 11s virt-handler-zrgsr 1/1 Running 0 49m ------------------------------------------------------ virt-handler-28p6h 1/1 Running 0 50m virt-handler-4qqgc 1/1 Running 0 50m virt-handler-88k4w 1/1 Running 0 50m virt-handler-9mkhx 1/1 Running 0 50m virt-handler-b6d86 0/1 Running 0 26s virt-handler-bzrj5 1/1 Running 0 50m virt-handler-c22bb 1/1 Running 0 50m virt-handler-ckc7b 1/1 Running 0 50m virt-handler-fbd85 1/1 Running 0 50m virt-handler-gk5p7 1/1 Running 0 50m virt-handler-gndcx 0/1 Running 0 31s virt-handler-l8rsn 1/1 Running 0 50m virt-handler-ln7hn 1/1 Running 0 50m virt-handler-pg8ww 1/1 Running 0 50m virt-handler-vttzv 1/1 Running 0 50m virt-handler-wmnzt 1/1 Running 0 113s virt-handler-wnrxp 1/1 Running 0 50m virt-handler-wzcjm 1/1 Running 0 67s virt-handler-x7gmv 1/1 Running 0 72s virt-handler-zrgsr 1/1 Running 0 50m [kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable' 1 [kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable' 1 [kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable' "10%" [kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable' "10%" [kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable' 1 [kni@f12-h17-b07-5039ms ~]$ oc get daemonset virt-handler -o json | jq '.spec.updateStrategy.rollingUpdate.maxUnavailable' 1 [kni@f12-h17-b07-5039ms ~]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.4.10.0-654 OpenShift Virtualization 4.10.0-654 kubevirt-hyperconverged-operator.4.10.0-648 Succeeded ------------------------------------------------------- ]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o json | jq '.status.versions' [ { "name": "operator", "version": "4.10.0" } ] [kni@f12-h17-b07-5039ms ~]$ oc get pods -n openshift-cnv | grep hco-operator hco-operator-7b668bfc7f-lh9hj 1/1 Running 0 12m ~]$ oc get hco -n openshift-cnv kubevirt-hyperconverged -o=jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}' ReconcileComplete True Reconcile completed successfully Available True Reconcile completed successfully Progressing False Reconcile completed successfully Degraded False Reconcile completed successfully Upgradeable True Reconcile completed successfully 8) Total Time taken for virt-handler upgrade with 20 Worker Nodes is 7m48sec approximately. Summary: 1) As seen above 2 virt-handler Pods were updating concurrently. 2) Also virt-handler Daemonset value got updated to maxUnavailable=10% ; As we had 20 worker nodes. 3) After the upgrade, virt-handler Daemonset value got updated to maxUnavailable=1 VERIFIED upgrading from 4.10.0-648 to 4.10.0-654 ( using nightly builds )
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947