Bug 1934905

Summary: CoreDNS's "errors" plugin is not enabled for custom upstream resolvers
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: jechen <jechen>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, hongli
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:49:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1953609    

Description Miciah Dashiel Butler Masters 2021-03-04 00:43:23 UTC
Description of problem:

When the cluster administrator configures the cluster DNS service to forward to custom upstream resolvers, the DNS operator configures CoreDNS with server blocks for the custom resolvers but does not enable the "errors" plugin in these server blocks.  As a result, CoreDNS does not log errors for custom upstream resolvers.  


Version-Release number of selected component (if applicable):

OCP 4.3 (when the DNS forwarding API was introduced) and later.  


Steps to Reproduce:

1. Set up a custom nameserver that resolves names for some zone; for example:

    $ oc adm new-project mydns
    Created project mydns
    $ cat >Corefile <<'EOF'
    .:5353 {
        hosts {
          1.2.3.4 www.redhat.com
        }
    }
    EOF
    $ oc -n mydns create configmap coredns --from-file=Corefile
    configmap/coredns created
    $ oc -n mydns create deployment coredns \
          --image=openshift/origin-coredns:latest --replicas=0 --port=5353 \
          -- coredns --conf=/etc/coredns/Corefile
    deployment.apps/coredns created
    $ oc -n mydns set volume deployments/coredns --add \
         --mount-path=/etc/coredns --type=configmap --configmap-name=coredns
    info: Generated volume name: volume-pj5g6
    deployment.apps/coredns volume updated
    $ oc -n mydns scale deployments/coredns --replicas=1
    deployment.apps/coredns scaled
    $ oc -n mydns expose deployments/coredns --port=5353 --protocol=UDP
    service/coredns exposed
    $ oc -n mydns get services
    NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
    coredns   ClusterIP   172.30.67.42   <none>        5353/UDP   3s

2. Configure cluster DNS to forward queries for the zone to this custom nameserver:

    $ oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"mydns","zones":["redhat.com"],"forwardPlugin":{"upstreams":["172.30.67.42:5353"]}}]}}'
    dns.operator.openshift.io/default patched

3. From a pod inside the cluster, continuously perform lookups for a name in the zone for which the custom nameserver is responsible:

    $ while :; do nslookup www.redhat.com; sleep 0.5; done
    Server:         172.30.0.10
    Address:        172.30.0.10#53
    
    Name:   www.redhat.com
    Address: 1.2.3.4
    
    Server:         172.30.0.10
    Address:        172.30.0.10#53
    
    Name:   www.redhat.com
    ...

4. Delete the custom nameserver pod:

    $ oc -n mydns delete pods --all
    pod "coredns-d8d568c95-69x6v" deleted

5. Check the output of the nslookup commands.

    Server:         172.30.0.10
    Address:        172.30.0.10#53
    
    ** server can't find www.redhat.com: SERVFAIL
    
    Server:         172.30.0.10
    Address:        172.30.0.10#53
    
    ** server can't find www.redhat.com: SERVFAIL
    ...

6. Check the DNS pods' log output.

    $ for pod in $(oc -n openshift-dns get pods -o name)
      do oc -n openshift-dns logs -c dns $pod
      done

7. Check the Corefile configmap for the cluster DNS pods:

    $ oc -n openshift-dns get configmaps/dns-default -o yaml
    apiVersion: v1
    data:
      Corefile: |
        # mydns
        redhat.com:5353 {
            forward . 172.30.67.42:5353
        }
        .:5353 {
            errors
            health
            kubernetes cluster.local in-addr.arpa ip6.arpa {
                pods insecure
                fallthrough in-addr.arpa ip6.arpa
            }
            prometheus 127.0.0.1:9153
            forward . /etc/resolv.conf {
                policy sequential
            }
            cache 30
            reload
        }
    ...


Actual results:

At step 6, the DNS pods' logs do not show any errors.

At step 7, the Corefile configmap does not include the "errors" plugin in the server block for the custom resolver.


Expected results:

At step 6, some DNS pods should log errors:

    [ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.128.2.26:42666->172.30.188.79:5353: i/o timeout

At step 7, the Corefile should include the "errors" plugin in the custom resolver's server block:

        # mydns
        redhat.com:5353 {
            forward . 172.30.67.42:5353
            errors
        }

Comment 2 jechen 2021-03-08 23:18:23 UTC
Verified in 4.8.0-0.nightly-2021-03-08-133419


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-08-133419   True        False         118m    Cluster version is 4.8.0-0.nightly-2021-03-08-133419

1. Set up a custom nameserver
$ oc adm new-project mydns
Created project mydns

$ oc -n mydns create configmap coredns --from-file=Corefile 
configmap/coredns created

$ oc -n mydns create deployment coredns --image=openshift/origin-coredns:latest --replicas=0 --port=5353 -- coredns --conf=/etc/coredns/Corefile
deployment.apps/coredns create


$ oc -n mydns set volume deployments/coredns --add --mount-path=/etc/coredns --type=configmap --configmap-name=coredns
info: Generated volume name: volume-6wqvb
deployment.apps/coredns volume updated

$ oc -n mydns scale deployments/coredns --replicas=1
deployment.apps/coredns scaled

$ oc -n mydns expose deployments/coredns --port=5353 --protocol=UDP
service/coredns exposed

$ oc -n mydns get services
NAME      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
coredns   ClusterIP   172.30.197.57   <none>        5353/UDP   9s


2. Configured cluster DNS to forward queries for the zone to this custom nameserver
$ oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"mydns","zones":["redhat.com"],"forwardPlugin":{"upstreams":["172.30.197.57:5353"]}}]}}'
dns.operator.openshift.io/default patched


3. created a test pod,  from the pod, continuously performed nslookups for a name in the zone for which the custom nameserver is responsible
$ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json
pod/hello-pod created

$ oc rsh hello-pod 
/ # while :; do nslookup www.redhat.com; sleep 0.5; done
Server:		172.30.0.10
Address:	172.30.0.10:53

Name:	www.redhat.com
Address: 1.2.3.4

Server:		172.30.0.10
Address:	172.30.0.10:53

Name:	www.redhat.com
Address: 1.2.3.4

4. changed the replicas for the deployment to 0
$ oc -n mydns scale deployments/coredns --replicas=0
deployment.apps/coredns scaled

5. after step 4, nslookup from test pod failed as expected
$ oc rsh hello-pod
/ # while :; do nslookup www.redhat.com; sleep 0.5; done
;; connection timed out; no servers could be reached

Server:		172.30.0.10
Address:	172.30.0.10:53

** server can't find www.redhat.com: SERVFAIL

** server can't find www.redhat.com: SERVFAIL


6. verified DNS pods started logging errors

for pod in $(oc -n openshift-dns get pods -o name)
>       do oc -n openshift-dns logs -c dns $pod
>       done
.:5353
<---snip --->
[ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.129.2.3:60009->172.30.197.57:5353: i/o timeout
[ERROR] plugin/errors: 2 www.redhat.com. AAAA: read udp 10.129.2.3:53513->172.30.197.57:5353: i/o timeout
[ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.129.2.3:35611->172.30.197.57:5353: read: connection refused
[ERROR] plugin/errors: 2 www.redhat.com. AAAA: read udp 10.129.2.3:41772->172.30.197.57:5353: read: connection refused
[ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.129.2.3:44281->172.30.197.57:5353: i/o timeout
[ERROR] plugin/errors: 2 www.redhat.com. AAAA: read udp 10.129.2.3:50071->172.30.197.57:5353: i/o timeout

<---snip--->



7. verified the custom nameserver Corefile has included "errors" plugin 
$ oc -n openshift-dns get configmaps/dns-default -o yaml
apiVersion: v1
data:
  Corefile: |
    # mydns
    redhat.com:5353 {
        forward . 172.30.197.57:5353
        errors                           <-- verified the fix from https://github.com/openshift/cluster-dns-operator/pull/241
    }
    .:5353 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus 127.0.0.1:9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 900
        reload
    }

Comment 5 errata-xmlrpc 2021-07-27 22:49:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438