Bug 1880148

Summary:	dns daemonset rolls out slowly in large clusters
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	DNS	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, sgreene, wking
Version:	4.5
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1903887 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:18:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1903887

Description Scott Dodson 2020-09-17 19:15:05 UTC

Description of problem:
Since the dns architecture is fault tolerant we should roll out the dns daemonset more agressively. Most other cluster wide daemonsets which are not critical to local workload availability are now using maxUnavailable of 10%. In 250 node clusters this typically reduces the rollout time from around 100 minutes to 10 minutes.

Please update your daemonset and operator status code to work with maxUnavailable of 10% so that upgrade time doesn't scale linearly with node count.

https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/dns_status.go#L84

Comment 2 Andrew McDermott 2020-10-02 16:07:19 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 3 Miciah Dashiel Butler Masters 2020-10-26 05:31:52 UTC

I'll try to get this done in the upcoming sprint.

Comment 4 Miciah Dashiel Butler Masters 2020-11-14 00:02:29 UTC

I'll see about doing this in the upcoming sprint.

Comment 5 W. Trevor King 2020-11-26 14:23:28 UTC

Cross-referencing bug 1867608, which landed a similar bump to 10% maxUnavailable for the machine-config DaemonSet.

Also, in 4.5->4.6 tests, looks like the DNS DaemonSet did not actually block the DNS operator from leveling, with the DNS operator claiming a 4.6 version ~6:30Z, with the dns-node-resolver transitioning from 4.5 images to 4.6 images between ~6:30Z and ~9:15Z.  Is the current logic "bump the DaemonSet spec, and then claim the operator and operand have leveled, regardless of how the DaemonSet rollouts go"?  I'd rather wait for at least a handful of DaemonSet pods to complete the transition before claiming completion, to limit the risk of unnoticed version skew (if a handful of pods make it over, it's likely, but not guaranteed, that the rest will also transition smoothly).  And as long as we are leveling the operator before the operand has entirely transitioned, we should consider the risks of longer-term version skew.  For example, what happens if someone moves 4.4->4.5->4.6, and when they start 4.5->4.6 they still have some dangling 4.4 DNS pods?  Maybe that's fine?  Maybe not?

Comment 8 Hongan Li 2020-12-04 10:18:15 UTC

scale the worker nodes from 3 to 8 to 20 but always get "maxUnavailable: 1", am I missing something? restarting dns operator is also not helpful. 

e.g.
### scale up worker nodes
# oc -n openshift-machine-api scale machinesets/hongli-pl004-khlwb-worker-b --replicas=8
...

### with 23 nodes totally (+3 master)
# oc -n openshift-dns get ds/dns-default -oyaml
<---snip--->

  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 23
  desiredNumberScheduled: 23
  numberAvailable: 23
  numberMisscheduled: 0
  numberReady: 23
  observedGeneration: 1

Comment 9 W. Trevor King 2020-12-04 23:51:34 UTC

Hard to figure out why the run failed if you don't tell us which release image you used ;).  Picking a recent-ish one that still predates your comment by a few hours:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-12-04-013308 | grep cluster-dns-operator
cd   cluster-dns-operator                           https://github.com/openshift/cluster-dns-operator                           af4dd73a36a2afc1d91a6c570dbe4d5d2ab6ea89

So that includes the PR [1].  AWS promotion job for that nightly [2]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880/artifacts/e2e-aws/daemonsets.json | gunzip | jq -r '.items[] | select(.metadata.name == "dns-default").spec.updateStrategy'
{
  "rollingUpdate": {
    "maxUnavailable": 1
  },
  "type": "RollingUpdate"
}

So yeah, the 10% goal seems to be getting lost before it makes it into the DaemonSet.  Checking the DaemonSets that have moved to 10%:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880/artifacts/e2e-aws/daemonsets.json | gunzip | jq -r '.items[] | .metadata as $m | select(.spec.updateStrategy.rollingUpdate.maxUnavailable == "10%") | $m.namespace + " " + $m.name'
openshift-cluster-node-tuning-operator tuned
openshift-machine-config-operator machine-config-daemon
openshift-monitoring node-exporter

[1]: https://github.com/openshift/cluster-dns-operator/commit/af4dd73a36a2afc1d91a6c570dbe4d5d2ab6ea89
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880

Comment 10 Miciah Dashiel Butler Masters 2020-12-07 03:02:17 UTC

I'll investigate why the fix isn't having the expected effect in the upcoming sprint.

Comment 11 Hongan Li 2020-12-07 03:10:03 UTC

(In reply to W. Trevor King from comment #9)
> Hard to figure out why the run failed if you don't tell us which release
> image you used ;).  Picking a recent-ish one that still predates your
> comment by a few hours:
> 
sorry, my bad. I tested it with 4.7.0-0.nightly-2020-12-03-205004.

Comment 13 Hongan Li 2020-12-10 02:49:12 UTC

verified with 4.7.0-0.nightly-2020-12-09-112139 and passed.

# oc -n openshift-dns get ds/dns-default -oyaml 
<---snip--->
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate
status:
  currentNumberScheduled: 6
  desiredNumberScheduled: 6

Comment 16 errata-xmlrpc 2021-02-24 15:18:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633