1943578 – CoreDNS caches NXDOMAIN responses for up to 900 seconds

Bug 1943578 - CoreDNS caches NXDOMAIN responses for up to 900 seconds

Summary: CoreDNS caches NXDOMAIN responses for up to 900 seconds

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Stephen Greene
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1939070 (view as bug list)
Depends On:
Blocks:	1943826
TreeView+	depends on / blocked

Reported:	2021-03-26 13:45 UTC by Stephen Greene
Modified:	2022-11-07 15:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Bug 1933761 set the global CoreDNS cache max TTL to 900 seconds. Consequence: NXDOMAIN records received from upstream resolvers are cached for 900 seconds. Fix: Explicitly cache negative DNS response records for maximum 30 seconds. Result: Resolving domains that are in the process of being published does not take at minimum 15 minutes.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:56:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 253	0	None	open	Bug 1943578: Corefile: Use 30 second max TTL for caching of negative responses	2021-03-26 14:01:52 UTC
Red Hat Bugzilla	1933761	1	high	CLOSED	Cluster DNS service caps TTLs too low and thus evicts from its cache too aggressively	2024-10-01 17:35:07 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:56:36 UTC

Description Stephen Greene 2021-03-26 13:45:31 UTC

Description of problem:

Bug 1933761 set that max TTL of DNS records in CoreDNS's cache to 900 seconds. The previous max TTL before Bug 1933761 was 30 seconds.

With Bug 1933761 in place, CoreDNS now caches negative DNS responses from the upstream resolver (such as NXDOMAIN) for up to 900 seconds (depending on the TTL set by the an upstream resolver). This is undesired behavior since waiting 15 minutes to retry a query for a domain that may have recently been registered can increase cluster install time in some cases unnecessarily.


Version-Release number of selected component (if applicable):
4.6, 4.7, and 4.8

Note that this issue is more likely to cause problems in compact or single node clusters as the overall number of CoreDNS cache's decreases (the service proxy load-balances DNS queries, so the odds of hitting a CoreDNS instance with a cached NXDOMAIN response for a domain decreases as cluster size increases).

Comment 1 Stephen Greene 2021-03-26 19:59:35 UTC

*** Bug 1939070 has been marked as a duplicate of this bug. ***

Comment 3 Arvind iyengar 2021-03-29 05:36:19 UTC

Verified in "4.8.0-0.nightly-2021-03-29-000904" release version. With this payload it is observed that the additional configuration of 30 second TTL for negative records get set by default along with 900 seconds for positive record in cache plugin section:
-----
oc get clusterversion                           
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-29-000904   True        False         46m     Cluster version is 4.8.0-0.nightly-2021-03-29-000904

Use 'oc describe pod/dns-default-7xz8b -n openshift-dns' to see all of the containers in this pod.
.:5353 {
    errors
    health {
        lameduck 20s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus 127.0.0.1:9153
    forward . /etc/resolv.conf {
        policy sequential
    }
    cache 900 {      <----
        denial 9984 30 <---
    }
    reload
}
-----

Comment 6 errata-xmlrpc 2021-07-27 22:56:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.