Bug 2095941

Summary: DNS Traffic not kept local to zone or node when Calico SDN utilized
Product: OpenShift Container Platform Reporter: Tyler Lisowski <lisowski>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, mmasters
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Topology Aware Hints is a new feature in OpenShift 4.11 that allows the EndpointSlice controller to specify hints to the CNI network provider for how it should route traffic to a service's endpoints. The DNS operator did not enable Topology Aware Hints for the cluster DNS service. Consequence: CNI network providers such as Calico SDN did not keep DNS traffic local to the zone or node. (Note that the OpenShift SDN and OVN-Kubernetes CNI network providers that are included in OpenShift have logic to prefer local DNS pods for the cluster DNS service and were not affected by this issue as long as the node had a local DNS pod.) Fix: The DNS operator was changed to specify Topology Aware Hints on the cluster DNS service. Result: The Topology Aware Hints feature is now enabled for the cluster DNS service for CNI network providers that support it.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:17:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Tyler Lisowski 2022-06-11 05:34:15 UTC
Description of problem:
The Openshift DNS deployment for Calico SDN does not keep DNS traffic local to the node or zone. It simply randomly selects one of the backends which can lead to DNS traffic going cross zone and having higher latency on requests and sometimes failures.

One option is to ensure topology hints are enabled in 4.11+ which will aim to keep traffic within a zone boundary whenever possible. Ideally: traffic would be kept to a node however within a zone is a major improvement versus the current topology for DNS.

OpenShift release version:
All Openshift releases 

Cluster Platform:
Any provider using Calico SDN (IBM ROKS, IBM Cloud Satellite)

How reproducible:

Steps to Reproduce (in detail):
1. Setup TCP Dump
2. Send DNS requests from a pod on the node in a multi zone cluster
3. Watch as the requests are distributed to different zones over time

Actual results:
DNS requests can leave node and zone to any random backend DNS pod

Expected results:
When possible: DNS requests stay local to node and/or zone

Impact of the problem:
Higher latency of DNS requests
Increased failures of DNS requests

Additional info:


** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2022-06-14 03:28:56 UTC
I did my best to write up doc text for this BZ.  Please feel free to suggest or make corrections.

Comment 3 Hongan Li 2022-06-21 04:17:43 UTC
Verified with 4.11.0-0.ci-2022-06-20-211630 (since latest available nightly build is 5 days ago) and the annotation "service.kubernetes.io/topology-aware-hints: auto" is added to dns-default service.

$ oc -n openshift-dns get svc/dns-default -oyaml
apiVersion: v1
kind: Service
    service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655777103
    service.beta.openshift.io/serving-cert-secret-name: dns-default-metrics-tls
    service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655777103
    service.kubernetes.io/topology-aware-hints: auto

Comment 4 Hongan Li 2022-06-22 03:42:12 UTC
Checked with latest nightly build 4.11.0-0.nightly-2022-06-21-151125 and passed as well

    service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655866069
    service.beta.openshift.io/serving-cert-secret-name: dns-default-metrics-tls
    service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1655866069
    service.kubernetes.io/topology-aware-hints: auto

Comment 6 errata-xmlrpc 2022-08-10 11:17:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.