Description of problem: The Openshift DNS deployment for Calico SDN does not keep DNS traffic local to the node or zone. It simply randomly selects one of the backends which can lead to DNS traffic going cross zone and having higher latency on requests and sometimes failures. One option is to ensure topology hints are enabled in 4.11+ which will aim to keep traffic within a zone boundary whenever possible. Ideally: traffic would be kept to a node however within a zone is a major improvement versus the current topology for DNS. OpenShift release version: All Openshift releases Cluster Platform: Any provider using Calico SDN (IBM ROKS, IBM Cloud Satellite) How reproducible: 100% Steps to Reproduce (in detail): 1. Setup TCP Dump 2. Send DNS requests from a pod on the node in a multi zone cluster 3. Watch as the requests are distributed to different zones over time Actual results: DNS requests can leave node and zone to any random backend DNS pod Expected results: When possible: DNS requests stay local to node and/or zone Impact of the problem: Higher latency of DNS requests Increased failures of DNS requests Additional info: https://github.com/openshift/cluster-dns-operator/pull/322 ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
I did my best to write up doc text for this BZ. Please feel free to suggest or make corrections.
Verified with 4.11.0-0.ci-2022-06-20-211630 (since latest available nightly build is 5 days ago) and the annotation "service.kubernetes.io/topology-aware-hints: auto" is added to dns-default service. $ oc -n openshift-dns get svc/dns-default -oyaml apiVersion: v1 kind: Service metadata: annotations: service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655777103 service.beta.openshift.io/serving-cert-secret-name: dns-default-metrics-tls service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655777103 service.kubernetes.io/topology-aware-hints: auto
Checked with latest nightly build 4.11.0-0.nightly-2022-06-21-151125 and passed as well metadata: annotations: service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655866069 service.beta.openshift.io/serving-cert-secret-name: dns-default-metrics-tls service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655866069 service.kubernetes.io/topology-aware-hints: auto
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069