Bug 1880148
Summary: | dns daemonset rolls out slowly in large clusters | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Scott Dodson <sdodson> | |
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | |
Networking sub component: | DNS | QA Contact: | Hongan Li <hongli> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | aos-bugs, sgreene, wking | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1903887 (view as bug list) | Environment: | ||
Last Closed: | 2021-02-24 15:18:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1903887 |
Description
Scott Dodson
2020-09-17 19:15:05 UTC
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. I'll try to get this done in the upcoming sprint. I'll see about doing this in the upcoming sprint. Cross-referencing bug 1867608, which landed a similar bump to 10% maxUnavailable for the machine-config DaemonSet. Also, in 4.5->4.6 tests, looks like the DNS DaemonSet did not actually block the DNS operator from leveling, with the DNS operator claiming a 4.6 version ~6:30Z, with the dns-node-resolver transitioning from 4.5 images to 4.6 images between ~6:30Z and ~9:15Z. Is the current logic "bump the DaemonSet spec, and then claim the operator and operand have leveled, regardless of how the DaemonSet rollouts go"? I'd rather wait for at least a handful of DaemonSet pods to complete the transition before claiming completion, to limit the risk of unnoticed version skew (if a handful of pods make it over, it's likely, but not guaranteed, that the rest will also transition smoothly). And as long as we are leveling the operator before the operand has entirely transitioned, we should consider the risks of longer-term version skew. For example, what happens if someone moves 4.4->4.5->4.6, and when they start 4.5->4.6 they still have some dangling 4.4 DNS pods? Maybe that's fine? Maybe not? scale the worker nodes from 3 to 8 to 20 but always get "maxUnavailable: 1", am I missing something? restarting dns operator is also not helpful. e.g. ### scale up worker nodes # oc -n openshift-machine-api scale machinesets/hongli-pl004-khlwb-worker-b --replicas=8 ... ### with 23 nodes totally (+3 master) # oc -n openshift-dns get ds/dns-default -oyaml <---snip---> updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate status: currentNumberScheduled: 23 desiredNumberScheduled: 23 numberAvailable: 23 numberMisscheduled: 0 numberReady: 23 observedGeneration: 1 Hard to figure out why the run failed if you don't tell us which release image you used ;). Picking a recent-ish one that still predates your comment by a few hours: $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-12-04-013308 | grep cluster-dns-operator cd cluster-dns-operator https://github.com/openshift/cluster-dns-operator af4dd73a36a2afc1d91a6c570dbe4d5d2ab6ea89 So that includes the PR [1]. AWS promotion job for that nightly [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880/artifacts/e2e-aws/daemonsets.json | gunzip | jq -r '.items[] | select(.metadata.name == "dns-default").spec.updateStrategy' { "rollingUpdate": { "maxUnavailable": 1 }, "type": "RollingUpdate" } So yeah, the 10% goal seems to be getting lost before it makes it into the DaemonSet. Checking the DaemonSets that have moved to 10%: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880/artifacts/e2e-aws/daemonsets.json | gunzip | jq -r '.items[] | .metadata as $m | select(.spec.updateStrategy.rollingUpdate.maxUnavailable == "10%") | $m.namespace + " " + $m.name' openshift-cluster-node-tuning-operator tuned openshift-machine-config-operator machine-config-daemon openshift-monitoring node-exporter [1]: https://github.com/openshift/cluster-dns-operator/commit/af4dd73a36a2afc1d91a6c570dbe4d5d2ab6ea89 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1334672707534458880 I'll investigate why the fix isn't having the expected effect in the upcoming sprint. (In reply to W. Trevor King from comment #9) > Hard to figure out why the run failed if you don't tell us which release > image you used ;). Picking a recent-ish one that still predates your > comment by a few hours: > sorry, my bad. I tested it with 4.7.0-0.nightly-2020-12-03-205004. verified with 4.7.0-0.nightly-2020-12-09-112139 and passed. # oc -n openshift-dns get ds/dns-default -oyaml <---snip---> updateStrategy: rollingUpdate: maxUnavailable: 10% type: RollingUpdate status: currentNumberScheduled: 6 desiredNumberScheduled: 6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |