Bug 1944245 - CoreDNS caches NXDOMAIN responses for up to 900 seconds
Summary: CoreDNS caches NXDOMAIN responses for up to 900 seconds
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: Stephen Greene
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On: 1943826
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-29 15:20 UTC by OpenShift BugZilla Robot
Modified: 2021-04-20 19:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Bug 1936587 set the global CoreDNS cache max TTL to 900 seconds. Consequence: NXDOMAIN records received from upstream resolvers are cached for 900 seconds. Fix: Explicitly cache negative DNS response records for maximum 30 seconds. Result: Resolving domains that are in the process of being published does not take at minimum 15 minutes.
Clone Of:
Environment:
Last Closed: 2021-04-20 19:27:50 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 256 0 None open [release-4.6] Bug 1944245: Corefile: Use 30 second max TTL for caching of negative responses 2021-04-02 00:50:23 UTC
Red Hat Product Errata RHBA-2021:1153 0 None None None 2021-04-20 19:27:58 UTC

Comment 1 Miciah Dashiel Butler Masters 2021-03-29 15:29:44 UTC
To fix bug 1936589, we configured cluster DNS to honor ttl values of up to 15 minutes from upstream resolvers and cap higher ttl values to 15 minutes.  Before that change, ttl values were capped to 30 seconds.  Capping ttl values for nxdomain responses to 15 minutes instead of 30 seconds causes long delays (15 minutes) when provisioning service load-balancers, including the default ingress load-balancer that is provisioned when a cluster is installed.  Bug 1936589 is verified but not shipped.  The justification for marking this new BZ as a blocker is that we want to fix the problem introduced by the fix for bug 1936589 before it ships.

Comment 3 Hongan Li 2021-04-02 02:44:10 UTC
Verified with the cluster launched by cluster-bot (launch openshift/cluster-dns-operator#256) and passed.

$ oc get clusterversion
NAME      VERSION                                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.ci.test-2021-04-02-005725-ci-ln-9c610wt   True        False         59m     Cluster version is 4.6.0-0.ci.test-2021-04-02-005725-ci-ln-9c610wt

$ oc -n openshift-dns get cm/dns-default -oyaml
apiVersion: v1
data:
  Corefile: |
    .:5353 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 900 {
            denial 9984 30
        }
        reload
    }

### check the TTL of positive response
sh-4.4# dig stackoverflow.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> stackoverflow.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10317
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;stackoverflow.com.		IN	A

;; ANSWER SECTION:
stackoverflow.com.	300	IN	A	151.101.65.69
stackoverflow.com.	300	IN	A	151.101.129.69


### check the TTL of negative response
sh-4.4# dig nxdomain.google.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> nxdomain.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 24399
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;nxdomain.google.com.		IN	A

;; AUTHORITY SECTION:
google.com.		27	IN	SOA	ns1.google.com. dns-admin.google.com. 366215971 900 900 1800 60

Comment 10 errata-xmlrpc 2021-04-20 19:27:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153


Note You need to log in before you can comment on or make changes to this bug.