1880148 – dns daemonset rolls out slowly in large clusters

Bug 1880148 - dns daemonset rolls out slowly in large clusters

Summary: dns daemonset rolls out slowly in large clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1903887
TreeView+	depends on / blocked

Reported:	2020-09-17 19:15 UTC by Scott Dodson
Modified:	2022-08-04 22:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1903887 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:18:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 217	None	closed	Bug 1880148: Set DNS DaemonSet's maxUnavailable value to 10%	2021-01-23 04:54:21 UTC
Github	openshift cluster-dns-operator pull 221	None	closed	Bug 1880148: Fix DNS DaemonSet's updateStrategy stanza	2021-01-23 04:54:22 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:19:03 UTC

Description Scott Dodson 2020-09-17 19:15:05 UTC

Description of problem:
Since the dns architecture is fault tolerant we should roll out the dns daemonset more agressively. Most other cluster wide daemonsets which are not critical to local workload availability are now using maxUnavailable of 10%. In 250 node clusters this typically reduces the rollout time from around 100 minutes to 10 minutes.

Please update your daemonset and operator status code to work with maxUnavailable of 10% so that upgrade time doesn't scale linearly with node count.

https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/dns_status.go#L84

Comment 2 Andrew McDermott 2020-10-02 16:07:19 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 3 Miciah Dashiel Butler Masters 2020-10-26 05:31:52 UTC

I'll try to get this done in the upcoming sprint.

Comment 4 Miciah Dashiel Butler Masters 2020-11-14 00:02:29 UTC

I'll see about doing this in the upcoming sprint.

Comment 5 W. Trevor King 2020-11-26 14:23:28 UTC

Cross-referencing bug 1867608, which landed a similar bump to 10% maxUnavailable for the machine-config DaemonSet.

Also, in 4.5->4.6 tests, looks like the DNS DaemonSet did not actually block the DNS operator from leveling, with the DNS operator claiming a 4.6 version ~6:30Z, with the dns-node-resolver transitioning from 4.5 images to 4.6 images between ~6:30Z and ~9:15Z.  Is the current logic "bump the DaemonSet spec, and then claim the operator and operand have leveled, regardless of how the DaemonSet rollouts go"?  I'd rather wait for at least a handful of DaemonSet pods to complete the transition before claiming completion, to limit the risk of unnoticed version skew (if a handful of pods make it over, it's likely, but not guaranteed, that the rest will also transition smoothly).  And as long as we are leveling the operator before the operand has entirely transitioned, we should consider the risks of longer-term version skew.  For example, what happens if someone moves 4.4->4.5->4.6, and when they start 4.5->4.6 they still have some dangling 4.4 DNS pods?  Maybe that's fine?  Maybe not?

Comment 8 Hongan Li 2020-12-04 10:18:15 UTC

scale the worker nodes from 3 to 8 to 20 but always get "maxUnavailable: 1", am I missing something? restarting dns operator is also not helpful. 

e.g.
### scale up worker nodes
# oc -n openshift-machine-api scale machinesets/hongli-pl004-khlwb-worker-b --replicas=8
...

### with 23 nodes totally (+3 master)
# oc -n openshift-dns get ds/dns-default -oyaml
<---snip--->

  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 23
  desiredNumberScheduled: 23
  numberAvailable: 23
  numberMisscheduled: 0
  numberReady: 23
  observedGeneration: 1

Comment 9 W. Trevor King 2020-12-04 23:51:34 UTC

Hard to figure out why the run failed if you don't tell us which release image you used ;).  Picking a recent-ish one that still predates your comment by a few hours:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-12-04-013308 | grep cluster-dns-operator
cd   cluster-dns-operator                           https://github.com/openshift/cluster-dns-operator                           af4dd73a36a2afc1d91a6c570dbe4d5d2ab6ea89

So that includes the PR [1].  AWS promotion job for that nightly [2]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880/artifacts/e2e-aws/daemonsets.json | gunzip | jq -r '.items[] | select(.metadata.name == "dns-default").spec.updateStrategy'
{
  "rollingUpdate": {
    "maxUnavailable": 1
  },
  "type": "RollingUpdate"
}

So yeah, the 10% goal seems to be getting lost before it makes it into the DaemonSet.  Checking the DaemonSets that have moved to 10%:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880/artifacts/e2e-aws/daemonsets.json | gunzip | jq -r '.items[] | .metadata as $m | select(.spec.updateStrategy.rollingUpdate.maxUnavailable == "10%") | $m.namespace + " " + $m.name'
openshift-cluster-node-tuning-operator tuned
openshift-machine-config-operator machine-config-daemon
openshift-monitoring node-exporter

[1]: https://github.com/openshift/cluster-dns-operator/commit/af4dd73a36a2afc1d91a6c570dbe4d5d2ab6ea89
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880

Comment 10 Miciah Dashiel Butler Masters 2020-12-07 03:02:17 UTC

I'll investigate why the fix isn't having the expected effect in the upcoming sprint.

Comment 11 Hongan Li 2020-12-07 03:10:03 UTC

(In reply to W. Trevor King from comment #9)
> Hard to figure out why the run failed if you don't tell us which release
> image you used ;).  Picking a recent-ish one that still predates your
> comment by a few hours:
> 
sorry, my bad. I tested it with 4.7.0-0.nightly-2020-12-03-205004.

Comment 13 Hongan Li 2020-12-10 02:49:12 UTC

verified with 4.7.0-0.nightly-2020-12-09-112139 and passed.

# oc -n openshift-dns get ds/dns-default -oyaml 
<---snip--->
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate
status:
  currentNumberScheduled: 6
  desiredNumberScheduled: 6

Comment 16 errata-xmlrpc 2021-02-24 15:18:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.