Bug 1987009 - [tracker] CNV Daemonsets have maxUnavailable set to 1 which leads to very slow upgrades on large clusters
Summary: [tracker] CNV Daemonsets have maxUnavailable set to 1 which leads to very slo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Installation
Version: 2.6.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Simone Tiraboschi
QA Contact:
URL:
Whiteboard:
Depends On: 1990061 1990063 1990065 1991906 1994489
Blocks: 1990267
TreeView+ depends on / blocked
 
Reported: 2021-07-28 17:37 UTC by Sai Sindhur Malleni
Modified: 2023-09-15 01:21 UTC (History)
17 users (show)

Fixed In Version: hco-bundle-registry-container-v4.10.0-519
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1990061 1990063 1990065 1990267 (view as bug list)
Environment:
Last Closed: 2022-03-16 15:51:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-13241 0 None None None 2023-09-15 01:14:50 UTC
Red Hat Product Errata RHSA-2022:0947 0 None None None 2022-03-16 15:53:09 UTC

Description Sai Sindhur Malleni 2021-07-28 17:37:36 UTC
Description of problem:
Currently, all the daemonsets managed by the OpenShift Virtualization Operator default to a maxUnavailable of 1. This means that on large clusters the upgrade of the OpenShift virtualization operator takes a long time.

For example on  120 node cluster, it took 5.5 hours just for the OpenShift Virtualization operator to upgrade. When customers set aside maintenance windows to upgrade their platform, the OCP platform upgrade itself takes lesser time than CNV operator upgrade, so that will be a pain point. 

[kni@e16-h18-b03-fc640 ~]$ for i in `oc get ds | grep 10h | awk {'print$1'}`; do echo -n $i; oc get ds/$i -o yaml | grep -i maxunavailable; done
bridge-marker            f:maxUnavailable: {}
      maxUnavailable: 1
hostpath-provisioner            f:maxUnavailable: {}
      maxUnavailable: 1
kube-cni-linux-bridge-plugin            f:maxUnavailable: {}
      maxUnavailable: 1
kubevirt-node-labeller            f:maxUnavailable: {}
      maxUnavailable: 1
nmstate-handler            f:maxUnavailable: {}
      maxUnavailable: 1
ovs-cni-amd64            f:maxUnavailable: {}
      maxUnavailable: 1
virt-handler            f:maxUnavailable: {}
      maxUnavailable: 1

Currently all cluster operators in OCP have a maxUnavailable of at least 10% set. Clayton also recommends this as per https://bugzilla.redhat.com/show_bug.cgi?id=1920209#c14

Couple of options here: 
1. ump maxUnavailable to 10%
2. Investigate if any pods in any of the daemonsets do not handle SIGTERM properly and as a result take a while to exit. Inthat case we should lower the `terminationGracePeriodSeconds` to somehting like 10s.
Version-Release number of selected component (if applicable):
CNV 2.6.5

How reproducible:
100%

Steps to Reproduce:
1. Deploy large cluster
2. Install CNV
3. Upgrade CNV oeprator

Actual results:
Upgrade of CNV on 120 node cluser takes 5.5 hours

Expected results:
OpenShift Cluster Operator upgrade itself takes around 3hours on a 120 node cluster, so the CNV operator takes longer than all of OpenShift to upgrade.


Additional info:

Comment 1 Dan Kenigsberg 2021-07-30 12:47:58 UTC
Idea for a workaround: use https://docs.openshift.com/container-platform/4.8/virt/install/virt-specifying-nodes-for-virtualization-components.html#node-placement-hco_virt-specifying-nodes-for-virtualization-components to limit cnv daemonsets to the few workers where VMs run. 

This, however, is going to disable knmstate on most nodes, so you may want to revert it after upgrade.

Maybe there's a way to use https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#jsonpatch-annotations (requires support exception) to explicitly allow only knmstate everywhere?

Comment 2 Sai Sindhur Malleni 2021-07-30 14:33:17 UTC
(In reply to Dan Kenigsberg from comment #1)
> Idea for a workaround: use
> https://docs.openshift.com/container-platform/4.8/virt/install/virt-
> specifying-nodes-for-virtualization-components.html#node-placement-hco_virt-
> specifying-nodes-for-virtualization-components to limit cnv daemonsets to
> the few workers where VMs run. 
> 
> This, however, is going to disable knmstate on most nodes, so you may want
> to revert it after upgrade.
> 
> Maybe there's a way to use
> https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/
> cluster-configuration.md#jsonpatch-annotations (requires support exception)
> to explicitly allow only knmstate everywhere?

Thanks Dan. Sure, later versions of CNV/OCP do support this but when upgrading from 4.6.17 (whatever CNV operator version is on that) -> 4.7.11 this feature is missing. While the workaround will help this case, I think we all agree that we want to make sure the operator itself upgrades quickly enough when deployed at scale, if a customer really wants to use 120 nodes for CNV. So I do believe we can speed this up even when running on 120 nodes.

Comment 3 Adam Litke 2021-08-02 14:18:53 UTC
the hostpath-provisioner DS can set maxUnavailable to infinity.  The DS only needs to run when the node is running actual workloads.

Comment 4 Simone Tiraboschi 2021-08-02 14:47:07 UTC
Currently on our daemonsets I see:

$ oc get daemonset -n openshift-cnv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.updateStrategy.rollingUpdate.maxUnavailable}{"\n"}{end}'
bridge-marker	1
kube-cni-linux-bridge-plugin	1
nmstate-handler	1
ovs-cni-amd64	1
virt-handler	1


Personally I'm simply for statically setting 10% on each on that as Clayton Coleman recommends here:
https://bugzilla.redhat.com/show_bug.cgi?id=1920209#c14

Adam, Stu, Petr, Dominik: do you have any specific concerns about that?

Comment 5 Petr Horáček 2021-08-04 09:04:00 UTC
I don't have any concerns as none of these network components is critical cluster-wide.

I would be happy to share the work on this. @Simone, would you mind if I cloned this BZ and took over the network components?

Comment 6 Dominik Holler 2021-08-04 11:37:05 UTC
SSP is not affected, because SSP does not have any daemon set.

Comment 7 Simone Tiraboschi 2021-08-04 16:24:10 UTC
https://github.com/openshift/enhancements/pull/854 got merged so now the agreement is officially about setting:
.spec.updateStrategy.rollingUpdate.maxUnavailable = 10%
on all of our dameonsets.

> @Simone, would you mind if I cloned this BZ and took over the network components?

Yes, please do.
I'm going to create also bugs for the other affected components.

Comment 8 Simone Tiraboschi 2021-08-04 16:38:42 UTC
I just filed:

https://bugzilla.redhat.com/1990061 - [virt] CNV Daemonsets have maxUnavailable set to 1 which leads to very slow upgrades on large clusters
https://bugzilla.redhat.com/1990063 - [hpp] CNV Daemonsets have maxUnavailable set to 1 which leads to very slow upgrades on large clusters
https://bugzilla.redhat.com/1990065 - [network] CNV Daemonsets have maxUnavailable set to 1 which leads to very slow upgrades on large clusters

Keeping this as a tracker.

Comment 10 Fabian Deutsch 2021-09-27 19:26:30 UTC
Sai, is the cluster run in a disconnected setting, or does the customer use a registry proxy? Or are all images pulled from the public RH registry?

Comment 11 Simone Tiraboschi 2021-12-23 17:07:39 UTC
Moving this to ON_QA since https://bugzilla.redhat.com/show_bug.cgi?id=1990061 is now also ON_QA

Comment 12 Debarati Basu-Nag 2022-03-01 16:14:00 UTC
Marking it as verified since all the dependent bugs has been verified/closed

Comment 15 errata-xmlrpc 2022-03-16 15:51:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947

Comment 16 Red Hat Bugzilla 2023-09-15 01:12:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.