Bug 1867603

Summary: ds/node-exporter roll out takes 100+ minutes on 250 node cluster
Product: OpenShift Container Platform Reporter: Scott Dodson <sdodson>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: VERIFIED --- QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.4CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The node_exporter daemonset doesn't affect workload availability so allow its rollout to be parallelized. The fix reduces the rollout on a 250 node cluster from about 100min to 10min.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Scott Dodson 2020-08-10 13:07:33 UTC
Description of problem:
Because ds/node-exporter defaults to rollingUpdate maxUnavialable 1 the rollout is entirely serialized and thus very slow on large clusters. We can speed the rollout of daemonsets which don't immediately affect availability by allowing the maxUnavailable to scale with cluster size.

A quick test on a 250 node cluster shows that the current behavior takes around 100 minutes where as with maxUnavailable 10% it takes under 10 minutes.

Version-Release number of selected component (if applicable):
4.4

How reproducible:
100%

Steps to Reproduce:
1. Install a cluster that's got 20 or more hosts
2. Perform an upgrade
3. Observe that only one node-exporter pod is unavailable at once and the amount of time the upgrade takes. 

Actual results:
1 node-exporter unavailable at a time, slow rollout

Expected results:
10% node-exporter pods unavailable at most, faster / more parallel rollout

Additional info:

Comment 4 Junqi Zhao 2020-09-14 02:35:47 UTC
tested with 4.6.0-0.nightly-2020-09-12-230035, maxUnavailable for rollingUpdate is 10%
# oc -n openshift-monitoring get ds node-exporter -oyaml
--
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate