1903887 – dns daemonset rolls out slowly in large clusters

Bug 1903887 - dns daemonset rolls out slowly in large clusters

Summary: dns daemonset rolls out slowly in large clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1917579 (view as bug list)
Depends On:	1880148
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-03 03:00 UTC by Miciah Dashiel Butler Masters
Modified:	2022-08-04 22:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1880148
Environment:
Last Closed:	2021-02-08 13:50:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 219	0	None	closed	[release-4.6] Bug 1903887: Set DNS DaemonSet's maxUnavailable value to 10%	2021-02-05 18:32:38 UTC
Red Hat Product Errata	RHSA-2021:0308	0	None	None	None	2021-02-08 13:51:05 UTC

Description Miciah Dashiel Butler Masters 2020-12-03 03:00:01 UTC

+++ This bug was initially created as a clone of Bug #1880148 +++

Description of problem:
Since the dns architecture is fault tolerant we should roll out the dns daemonset more agressively. Most other cluster wide daemonsets which are not critical to local workload availability are now using maxUnavailable of 10%. In 250 node clusters this typically reduces the rollout time from around 100 minutes to 10 minutes.

Please update your daemonset and operator status code to work with maxUnavailable of 10% so that upgrade time doesn't scale linearly with node count.

https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/dns_status.go#L84

Comment 1 Miciah Dashiel Butler Masters 2020-12-07 03:03:26 UTC

The original fix needs a follow-up before the backport can proceed.

Comment 2 Hongan Li 2020-12-10 03:55:50 UTC

verified with cluster launched by cluster-bot and passed

# oc get clusterversion
NAME      VERSION                                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.ci.test-2020-12-10-031026-ci-ln-kgv0vgb   True        False         4m25s   Cluster version is 4.6.0-0.ci.test-2020-12-10-031026-ci-ln-kgv0vgb

# oc -n openshift-dns get ds/dns-default -oyaml 
<---snip--->
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 10%
    type: RollingUpdate
status:
  currentNumberScheduled: 6

Comment 3 Andrew McDermott 2021-01-19 17:53:12 UTC

*** Bug 1917579 has been marked as a duplicate of this bug. ***

Comment 4 Miciah Dashiel Butler Masters 2021-01-19 19:25:21 UTC

Bumping severity because every 4.6.z release that doesn't have this fix is going to roll out slowly.

Comment 8 errata-xmlrpc 2021-02-08 13:50:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.6.16 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0308

Note You need to log in before you can comment on or make changes to this bug.