1920209 – Multus daemonset upgrade takes the longest time in the cluster during an upgrade

Bug 1920209 - Multus daemonset upgrade takes the longest time in the cluster during an upgrade

Summary: Multus daemonset upgrade takes the longest time in the cluster during an upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-25 19:16 UTC by Sai Sindhur Malleni
Modified:	2021-07-27 22:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:36:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 962	0	None	open	Bug 1920209: The pods in the Multus daemonset should exit in a reasonable time during an upgrade.	2021-02-16 15:55:23 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:37:12 UTC

Description Sai Sindhur Malleni 2021-01-25 19:16:08 UTC

Description of problem:
During an upgrade from 4.5.21 to 4.6.8 on a large cluster (240 nodes on baremetal) we have observed that the network operator takes the longest time to update and even within that it is the update of the multus daemonset that is taking the longest. While several other daemonsets managed by CNo like ovnkube-node, ovs-node, ovnkube-master also have maxUnavailable set to 1 in their rollingupdate strategy it is only multus that takes theongest, even after all other daemonsets managed by CNO have completed updating.

It is likely that the large number of init container per multus pod is leading to this slow update. It is my understanding that we can't set a maxUnavailable of 10% for the networking daemonsets like some of the other daemonsets as that will impact workload connectivity. It might make sense to reduce the  number of init containers/consolidate some logic for faster roll out of the multus daemonset. There are currently 5 init containers in the multus pod


Version-Release number of selected component (if applicable):

4.5.21 -> 4.6.8

How reproducible:
100%

Steps to Reproduce:
1. deploy a large cluster
2. Update it from 4.5 to 4.6
3. Observe time taken for network operator to complete updating.

Actual results:
Multus daemonset takes extremely long time to update

Expected results:
Multus should not take that long as several other daemonsets that have maxUnavaialble set to 1 like multus finish hours before multus

Additional info:
[kni@e16-h18-b03-fc640 kube-burner]$ oc get ds
NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
multus                        239       239       238     236          238         kubernetes.io/os=linux            16h
multus-admission-controller   3         3         3       3            3           node-role.kubernetes.io/master=   16h
network-metrics-daemon        239       239       239     239          239         kubernetes.io/os=linux            3h27m


============================================================================
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
=============================================================================  status:
    conditions:
    - lastTransitionTime: "2021-01-25T05:13:35Z"
      status: "False"
      type: Degraded
    - lastTransitionTime: "2021-01-25T02:29:59Z"
      status: "True"
      type: Upgradeable
    - lastTransitionTime: "2021-01-25T15:40:31Z"
      message: DaemonSet "openshift-multus/multus" update is rolling out (236 out
        of 239 updated)
      reason: Deploying
      status: "True"
      type: Progressing
    - lastTransitionTime: "2021-01-25T02:35:48Z"
      status: "True"
      type: Available
    extension: null
============================================================================

Comment 1 Douglas Smith 2021-01-26 20:42:18 UTC

I currently believe that this may be due to the Multus daemonset taking a long time to delete -- not necessarily to be created.

Looking at an upstream cluster, if I delete a pod from the Multus daemonset, it can take as long as 36 seconds. 

```
$ time kubectl delete pod kube-multus-ds-amd64-hx582 -n kube-system
pod "kube-multus-ds-amd64-hx582" deleted

real    0m36.297s
```

However, if we set `terminationGracePeriodSeconds: 10`, the time is reduced by about half:

```
[centos@kube-singlehost-master multus-cni]$ time kubectl delete pod kube-multus-ds-amd64-g5hzq -n kube-system
pod "kube-multus-ds-amd64-g5hzq" deleted

real    0m17.937s
```

In a cluster of this size, this could potentially cause quite a slow down.

Additionally, we have a pull request upstream @ https://github.com/intel/multus-cni/pull/600

Regarding some of the other points...

Regarding MaxUnavailble, I don't believe this is part of the Multus daemonset yaml. I have checked master as well as the 4.5 branch, master here @ https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L96-L108

```
$ oc describe pod multus-hf4kb -n openshift-multus | grep -i maxunavail
$ oc edit pod multus-hf4kb -n openshift-multus 
```

Additionally, I don't believe that the init containers are causing an undue amount of time, each of those init containers are used for copying CNI plugin binaries to disk, and they do not wait, and each one just copies a few binaries to the host from the image. These images are separate as CNI plugins are often unrelated projects, and this is the kind of keystone where they come together and are copied to disk.

Comment 2 Sai Sindhur Malleni 2021-01-26 20:59:05 UTC

(In reply to Douglas Smith from comment #1)
> I currently believe that this may be due to the Multus daemonset taking a
> long time to delete -- not necessarily to be created.
> 
> Looking at an upstream cluster, if I delete a pod from the Multus daemonset,
> it can take as long as 36 seconds. 
> 
> ```
> $ time kubectl delete pod kube-multus-ds-amd64-hx582 -n kube-system
> pod "kube-multus-ds-amd64-hx582" deleted
> 
> real    0m36.297s
> ```
> 
> However, if we set `terminationGracePeriodSeconds: 10`, the time is reduced
> by about half:
> 
> ```
> [centos@kube-singlehost-master multus-cni]$ time kubectl delete pod
> kube-multus-ds-amd64-g5hzq -n kube-system
> pod "kube-multus-ds-amd64-g5hzq" deleted
> 
> real    0m17.937s
> ```
> 
> In a cluster of this size, this could potentially cause quite a slow down.
> 
> Additionally, we have a pull request upstream @
> https://github.com/intel/multus-cni/pull/600
> 
> Regarding some of the other points...
> 
> Regarding MaxUnavailble, I don't believe this is part of the Multus
> daemonset yaml. I have checked master as well as the 4.5 branch, master here
> @
> https://github.com/openshift/cluster-network-operator/blob/master/bindata/
> network/multus/multus.yaml#L96-L108
> 
> ```
> $ oc describe pod multus-hf4kb -n openshift-multus | grep -i maxunavail
> $ oc edit pod multus-hf4kb -n openshift-multus 
> ```
> 
> Additionally, I don't believe that the init containers are causing an undue
> amount of time, each of those init containers are used for copying CNI
> plugin binaries to disk, and they do not wait, and each one just copies a
> few binaries to the host from the image. These images are separate as CNI
> plugins are often unrelated projects, and this is the kind of keystone where
> they come together and are copied to disk.

Thank you for the quick investigation. I do see maxunavailable field in the multus daemonset though:
[kni@e16-h12-b01-fc640 benchmark-operator]$ oc get ds/multus -o yaml | grep -i maxun
            f:maxUnavailable: {}
      maxUnavailable: 1
You see to have checked the pod definition and not the daemonset definition?

Comment 3 Douglas Smith 2021-01-27 19:34:45 UTC

Sai, thanks for the double check on the `maxUnavailable: 1` -- turns out that this is the default for a daemonset: https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing-a-rolling-update

That being said, I think it could be reasonable that 1/10 instances of the Multus daemonset could be unavailable during an upgrade, which would considerably speed up upgrades on a larger deployment. 

I have a pull request up for review @ https://github.com/openshift/cluster-network-operator/pull/962

This sets `maxUnavailable: 10%` as well as `terminationGracePeriodSeconds: 10` which I believe will also reduce the time it takes to stop a pod in the Multus daemonset.

Comment 4 Sai Sindhur Malleni 2021-01-29 15:35:40 UTC

(In reply to Douglas Smith from comment #3)
> Sai, thanks for the double check on the `maxUnavailable: 1` -- turns out
> that this is the default for a daemonset:
> https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing-
> a-rolling-update
> 
> That being said, I think it could be reasonable that 1/10 instances of the
> Multus daemonset could be unavailable during an upgrade, which would
> considerably speed up upgrades on a larger deployment. 
> 
> I have a pull request up for review @
> https://github.com/openshift/cluster-network-operator/pull/962
> 
> This sets `maxUnavailable: 10%` as well as `terminationGracePeriodSeconds:
> 10` which I believe will also reduce the time it takes to stop a pod in the
> Multus daemonset.

Hey Doug,

Thanks for the patch. I do want to bring it up that all other network related daemonsets like ovnkube-node, ovs-node etc have a maxunuavailable set to 1, but they don't take this long. Also do you think setting a maxunavailable of 10% is acceptable for multus?

Comment 5 Douglas Smith 2021-02-05 16:58:57 UTC

I need to bring it up with some other folks to get a sense of the impact of the maxunavailable 10%. Thanks for the consideration.

Comment 6 Douglas Smith 2021-02-09 18:43:49 UTC

After some discussion, it has been determined that the maxunavailable 10% likely does more harm than good, so I've removed it from the PR. Thanks again for double checking this.

Comment 7 Sai Sindhur Malleni 2021-02-09 21:20:55 UTC

(In reply to Douglas Smith from comment #6)
> After some discussion, it has been determined that the maxunavailable 10%
> likely does more harm than good, so I've removed it from the PR. Thanks
> again for double checking this.

Great, any chance you could backport this to 4.5?

Comment 8 Douglas Smith 2021-02-17 14:35:43 UTC

I believe that should be possible. We'll need to get the original PR reviewed and merged, and then we can work on backports.

Comment 12 Weibin Liang 2021-02-22 21:25:03 UTC

QE did a quick testing on AWS cluster which has on 6 nodes, the whole upgrade process took about one hour.

[weliang@weliang ~]$ oc get ds -n openshift-multus
NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
multus                        6         6         6       6            6           kubernetes.io/os=linux            174m
multus-admission-controller   3         3         3       3            3           node-role.kubernetes.io/master=   174m
network-metrics-daemon        6         6         6       6            6           kubernetes.io/os=linux            34m
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.8     True        False         8m5s    Cluster version is 4.6.8
[weliang@weliang ~]$

Comment 13 Weibin Liang 2021-02-25 14:29:34 UTC

Tested AWS 200 nodes with 4.5.21
[weliang@weliang ~]$ time oc delete pod multus-zw4ph 
pod "multus-zw4ph" deleted
real	0m40.508s
user	0m0.175s
sys	0m0.033s

Tested AWS 200 nodes with 4.8.0-0.nightly-2021-02-25-092506
[weliang@weliang ~]$ time oc delete pod multus-zltjs
pod "multus-zltjs" deleted
real	0m31.781s
user	0m0.153s
sys	0m0.029s

The test result show the real time for deleting a multus pod has decreased in v4.8

Comment 14 Clayton Coleman 2021-02-25 19:59:54 UTC

> It is likely that the large number of init container per multus pod is leading to this slow update. It is my understanding that we can't set a maxUnavailable of 10% for the networking daemonsets like some of the other daemonsets as that will impact workload connectivity. It might make sense to reduce the  number of init containers/consolidate some logic for faster roll out of the multus daemonset. There are currently 5 init containers in the multus pod

This is explicitly not true.  We moved OVS to the host, therefore 10% should work for networking.  Any failure that impacts a user at maxUnavailable: 10% would also impact a workload at maxUnavailable: 1, and components are not allowed to impact workloads at maxUnavailable: 1 either (it is a bug in the component if they do not maintain availibility to action during upgrade).

All dameonsets in openshift should be at maxUnavailable: 10% and any bugs that block that should be prioritized.

Comment 15 Douglas Smith 2021-04-01 20:26:07 UTC

I believe that we may have changes in latest downstream Multus master that will cause the daemonset to exit in a timely fashion. Relative to the suggested need for the previous modification per Clayton @ https://github.com/openshift/cluster-network-operator/pull/962#issuecomment-782380195

Weibin, I'm setting this back to ON_QA to get you to verify, but, the steps should be to time how long it takes to kill a multus daemonset pod. The general steps are to:

1. Find the pods

```
$ kubectl oc get pods -n openshift-multus
```

And the time killing one, for example:

```
$ time oc exec -it kube-multus-ds-rjv6r -n openshift-multus -- kill 1

real	0m0.108s
user	0m0.068s
sys	0m0.014s
```

We're looking for something in the realm of "a couple seconds max" to verify. Thanks Weibin!

-Doug

Comment 16 Weibin Liang 2021-04-08 14:32:17 UTC

Tested in 4.8.0-0.nightly-2021-04-08-005413

[weliang@weliang Config]$ time oc exec -it multus-52lx6 -n openshift-multus -- kill 1

real	0m0.774s
user	0m0.235s
sys	0m0.037s
[weliang@weliang Config]$ time oc exec -it multus-l77mk -n openshift-multus -- kill 1

real	0m0.799s
user	0m0.244s
sys	0m0.032s
[weliang@weliang Config]$ time oc exec -it multus-fv6r6 -n openshift-multus -- kill 1

real	0m0.786s
user	0m0.224s
sys	0m0.045s
[weliang@weliang Config]$

Comment 17 Sai Sindhur Malleni 2021-04-12 17:36:08 UTC

Hey Doug,

So was the maxunavaialable raised to 10%? Or did this involve only faster handling of SIGTERM. Like Clayton said, maybe we should prioritize making maxunavailable 10%?

Comment 18 Douglas Smith 2021-04-12 18:59:07 UTC

Good question -- it is indeed set to 10% maxunavailable: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L103

So, three things tweaked as part of this BZ:

* SIGTERM handled correctly (I think this is the root cause of the problem, generally)
* Set maxunavailable 10%
* Set maxterminationseconds (likely least important given the SIGTERM handling)

Comment 19 W. Trevor King 2021-04-12 20:01:56 UTC

The 'maxUnavailable: 10%' change went in via bug 1933159 and [1].  So while that was also recently tweaked and is in the "Multus daemonset upgrade takes the longest time" space, it's not actually "tweaked as part of this BZ".

[1]: https://github.com/openshift/cluster-network-operator/pull/992

Comment 23 errata-xmlrpc 2021-07-27 22:36:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.