Description of problem: During an upgrade from 4.5.21 to 4.6.8 on a large cluster (240 nodes on baremetal) we have observed that the network operator takes the longest time to update and even within that it is the update of the multus daemonset that is taking the longest. While several other daemonsets managed by CNo like ovnkube-node, ovs-node, ovnkube-master also have maxUnavailable set to 1 in their rollingupdate strategy it is only multus that takes theongest, even after all other daemonsets managed by CNO have completed updating. It is likely that the large number of init container per multus pod is leading to this slow update. It is my understanding that we can't set a maxUnavailable of 10% for the networking daemonsets like some of the other daemonsets as that will impact workload connectivity. It might make sense to reduce the number of init containers/consolidate some logic for faster roll out of the multus daemonset. There are currently 5 init containers in the multus pod Version-Release number of selected component (if applicable): 4.5.21 -> 4.6.8 How reproducible: 100% Steps to Reproduce: 1. deploy a large cluster 2. Update it from 4.5 to 4.6 3. Observe time taken for network operator to complete updating. Actual results: Multus daemonset takes extremely long time to update Expected results: Multus should not take that long as several other daemonsets that have maxUnavaialble set to 1 like multus finish hours before multus Additional info: [kni@e16-h18-b03-fc640 kube-burner]$ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE multus 239 239 238 236 238 kubernetes.io/os=linux 16h multus-admission-controller 3 3 3 3 3 node-role.kubernetes.io/master= 16h network-metrics-daemon 239 239 239 239 239 kubernetes.io/os=linux 3h27m ============================================================================ updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate ============================================================================= status: conditions: - lastTransitionTime: "2021-01-25T05:13:35Z" status: "False" type: Degraded - lastTransitionTime: "2021-01-25T02:29:59Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-01-25T15:40:31Z" message: DaemonSet "openshift-multus/multus" update is rolling out (236 out of 239 updated) reason: Deploying status: "True" type: Progressing - lastTransitionTime: "2021-01-25T02:35:48Z" status: "True" type: Available extension: null ============================================================================
I currently believe that this may be due to the Multus daemonset taking a long time to delete -- not necessarily to be created. Looking at an upstream cluster, if I delete a pod from the Multus daemonset, it can take as long as 36 seconds. ``` $ time kubectl delete pod kube-multus-ds-amd64-hx582 -n kube-system pod "kube-multus-ds-amd64-hx582" deleted real 0m36.297s ``` However, if we set `terminationGracePeriodSeconds: 10`, the time is reduced by about half: ``` [centos@kube-singlehost-master multus-cni]$ time kubectl delete pod kube-multus-ds-amd64-g5hzq -n kube-system pod "kube-multus-ds-amd64-g5hzq" deleted real 0m17.937s ``` In a cluster of this size, this could potentially cause quite a slow down. Additionally, we have a pull request upstream @ https://github.com/intel/multus-cni/pull/600 Regarding some of the other points... Regarding MaxUnavailble, I don't believe this is part of the Multus daemonset yaml. I have checked master as well as the 4.5 branch, master here @ https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L96-L108 ``` $ oc describe pod multus-hf4kb -n openshift-multus | grep -i maxunavail $ oc edit pod multus-hf4kb -n openshift-multus ``` Additionally, I don't believe that the init containers are causing an undue amount of time, each of those init containers are used for copying CNI plugin binaries to disk, and they do not wait, and each one just copies a few binaries to the host from the image. These images are separate as CNI plugins are often unrelated projects, and this is the kind of keystone where they come together and are copied to disk.
(In reply to Douglas Smith from comment #1) > I currently believe that this may be due to the Multus daemonset taking a > long time to delete -- not necessarily to be created. > > Looking at an upstream cluster, if I delete a pod from the Multus daemonset, > it can take as long as 36 seconds. > > ``` > $ time kubectl delete pod kube-multus-ds-amd64-hx582 -n kube-system > pod "kube-multus-ds-amd64-hx582" deleted > > real 0m36.297s > ``` > > However, if we set `terminationGracePeriodSeconds: 10`, the time is reduced > by about half: > > ``` > [centos@kube-singlehost-master multus-cni]$ time kubectl delete pod > kube-multus-ds-amd64-g5hzq -n kube-system > pod "kube-multus-ds-amd64-g5hzq" deleted > > real 0m17.937s > ``` > > In a cluster of this size, this could potentially cause quite a slow down. > > Additionally, we have a pull request upstream @ > https://github.com/intel/multus-cni/pull/600 > > Regarding some of the other points... > > Regarding MaxUnavailble, I don't believe this is part of the Multus > daemonset yaml. I have checked master as well as the 4.5 branch, master here > @ > https://github.com/openshift/cluster-network-operator/blob/master/bindata/ > network/multus/multus.yaml#L96-L108 > > ``` > $ oc describe pod multus-hf4kb -n openshift-multus | grep -i maxunavail > $ oc edit pod multus-hf4kb -n openshift-multus > ``` > > Additionally, I don't believe that the init containers are causing an undue > amount of time, each of those init containers are used for copying CNI > plugin binaries to disk, and they do not wait, and each one just copies a > few binaries to the host from the image. These images are separate as CNI > plugins are often unrelated projects, and this is the kind of keystone where > they come together and are copied to disk. Thank you for the quick investigation. I do see maxunavailable field in the multus daemonset though: [kni@e16-h12-b01-fc640 benchmark-operator]$ oc get ds/multus -o yaml | grep -i maxun f:maxUnavailable: {} maxUnavailable: 1 You see to have checked the pod definition and not the daemonset definition?
Sai, thanks for the double check on the `maxUnavailable: 1` -- turns out that this is the default for a daemonset: https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing-a-rolling-update That being said, I think it could be reasonable that 1/10 instances of the Multus daemonset could be unavailable during an upgrade, which would considerably speed up upgrades on a larger deployment. I have a pull request up for review @ https://github.com/openshift/cluster-network-operator/pull/962 This sets `maxUnavailable: 10%` as well as `terminationGracePeriodSeconds: 10` which I believe will also reduce the time it takes to stop a pod in the Multus daemonset.
(In reply to Douglas Smith from comment #3) > Sai, thanks for the double check on the `maxUnavailable: 1` -- turns out > that this is the default for a daemonset: > https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing- > a-rolling-update > > That being said, I think it could be reasonable that 1/10 instances of the > Multus daemonset could be unavailable during an upgrade, which would > considerably speed up upgrades on a larger deployment. > > I have a pull request up for review @ > https://github.com/openshift/cluster-network-operator/pull/962 > > This sets `maxUnavailable: 10%` as well as `terminationGracePeriodSeconds: > 10` which I believe will also reduce the time it takes to stop a pod in the > Multus daemonset. Hey Doug, Thanks for the patch. I do want to bring it up that all other network related daemonsets like ovnkube-node, ovs-node etc have a maxunuavailable set to 1, but they don't take this long. Also do you think setting a maxunavailable of 10% is acceptable for multus?
I need to bring it up with some other folks to get a sense of the impact of the maxunavailable 10%. Thanks for the consideration.
After some discussion, it has been determined that the maxunavailable 10% likely does more harm than good, so I've removed it from the PR. Thanks again for double checking this.
(In reply to Douglas Smith from comment #6) > After some discussion, it has been determined that the maxunavailable 10% > likely does more harm than good, so I've removed it from the PR. Thanks > again for double checking this. Great, any chance you could backport this to 4.5?
I believe that should be possible. We'll need to get the original PR reviewed and merged, and then we can work on backports.
QE did a quick testing on AWS cluster which has on 6 nodes, the whole upgrade process took about one hour. [weliang@weliang ~]$ oc get ds -n openshift-multus NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE multus 6 6 6 6 6 kubernetes.io/os=linux 174m multus-admission-controller 3 3 3 3 3 node-role.kubernetes.io/master= 174m network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 34m [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.8 True False 8m5s Cluster version is 4.6.8 [weliang@weliang ~]$
Tested AWS 200 nodes with 4.5.21 [weliang@weliang ~]$ time oc delete pod multus-zw4ph pod "multus-zw4ph" deleted real 0m40.508s user 0m0.175s sys 0m0.033s Tested AWS 200 nodes with 4.8.0-0.nightly-2021-02-25-092506 [weliang@weliang ~]$ time oc delete pod multus-zltjs pod "multus-zltjs" deleted real 0m31.781s user 0m0.153s sys 0m0.029s The test result show the real time for deleting a multus pod has decreased in v4.8
> It is likely that the large number of init container per multus pod is leading to this slow update. It is my understanding that we can't set a maxUnavailable of 10% for the networking daemonsets like some of the other daemonsets as that will impact workload connectivity. It might make sense to reduce the number of init containers/consolidate some logic for faster roll out of the multus daemonset. There are currently 5 init containers in the multus pod This is explicitly not true. We moved OVS to the host, therefore 10% should work for networking. Any failure that impacts a user at maxUnavailable: 10% would also impact a workload at maxUnavailable: 1, and components are not allowed to impact workloads at maxUnavailable: 1 either (it is a bug in the component if they do not maintain availibility to action during upgrade). All dameonsets in openshift should be at maxUnavailable: 10% and any bugs that block that should be prioritized.
I believe that we may have changes in latest downstream Multus master that will cause the daemonset to exit in a timely fashion. Relative to the suggested need for the previous modification per Clayton @ https://github.com/openshift/cluster-network-operator/pull/962#issuecomment-782380195 Weibin, I'm setting this back to ON_QA to get you to verify, but, the steps should be to time how long it takes to kill a multus daemonset pod. The general steps are to: 1. Find the pods ``` $ kubectl oc get pods -n openshift-multus ``` And the time killing one, for example: ``` $ time oc exec -it kube-multus-ds-rjv6r -n openshift-multus -- kill 1 real 0m0.108s user 0m0.068s sys 0m0.014s ``` We're looking for something in the realm of "a couple seconds max" to verify. Thanks Weibin! -Doug
Tested in 4.8.0-0.nightly-2021-04-08-005413 [weliang@weliang Config]$ time oc exec -it multus-52lx6 -n openshift-multus -- kill 1 real 0m0.774s user 0m0.235s sys 0m0.037s [weliang@weliang Config]$ time oc exec -it multus-l77mk -n openshift-multus -- kill 1 real 0m0.799s user 0m0.244s sys 0m0.032s [weliang@weliang Config]$ time oc exec -it multus-fv6r6 -n openshift-multus -- kill 1 real 0m0.786s user 0m0.224s sys 0m0.045s [weliang@weliang Config]$
Hey Doug, So was the maxunavaialable raised to 10%? Or did this involve only faster handling of SIGTERM. Like Clayton said, maybe we should prioritize making maxunavailable 10%?
Good question -- it is indeed set to 10% maxunavailable: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L103 So, three things tweaked as part of this BZ: * SIGTERM handled correctly (I think this is the root cause of the problem, generally) * Set maxunavailable 10% * Set maxterminationseconds (likely least important given the SIGTERM handling)
The 'maxUnavailable: 10%' change went in via bug 1933159 and [1]. So while that was also recently tweaked and is in the "Multus daemonset upgrade takes the longest time" space, it's not actually "tweaked as part of this BZ". [1]: https://github.com/openshift/cluster-network-operator/pull/992
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438