Description of problem: When the cluster administrator configures the cluster DNS service to forward to custom upstream resolvers, the DNS operator configures CoreDNS with server blocks for the custom resolvers but does not enable the "errors" plugin in these server blocks. As a result, CoreDNS does not log errors for custom upstream resolvers. Version-Release number of selected component (if applicable): OCP 4.3 (when the DNS forwarding API was introduced) and later. Steps to Reproduce: 1. Set up a custom nameserver that resolves names for some zone; for example: $ oc adm new-project mydns Created project mydns $ cat >Corefile <<'EOF' .:5353 { hosts { 1.2.3.4 www.redhat.com } } EOF $ oc -n mydns create configmap coredns --from-file=Corefile configmap/coredns created $ oc -n mydns create deployment coredns \ --image=openshift/origin-coredns:latest --replicas=0 --port=5353 \ -- coredns --conf=/etc/coredns/Corefile deployment.apps/coredns created $ oc -n mydns set volume deployments/coredns --add \ --mount-path=/etc/coredns --type=configmap --configmap-name=coredns info: Generated volume name: volume-pj5g6 deployment.apps/coredns volume updated $ oc -n mydns scale deployments/coredns --replicas=1 deployment.apps/coredns scaled $ oc -n mydns expose deployments/coredns --port=5353 --protocol=UDP service/coredns exposed $ oc -n mydns get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE coredns ClusterIP 172.30.67.42 <none> 5353/UDP 3s 2. Configure cluster DNS to forward queries for the zone to this custom nameserver: $ oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"mydns","zones":["redhat.com"],"forwardPlugin":{"upstreams":["172.30.67.42:5353"]}}]}}' dns.operator.openshift.io/default patched 3. From a pod inside the cluster, continuously perform lookups for a name in the zone for which the custom nameserver is responsible: $ while :; do nslookup www.redhat.com; sleep 0.5; done Server: 172.30.0.10 Address: 172.30.0.10#53 Name: www.redhat.com Address: 1.2.3.4 Server: 172.30.0.10 Address: 172.30.0.10#53 Name: www.redhat.com ... 4. Delete the custom nameserver pod: $ oc -n mydns delete pods --all pod "coredns-d8d568c95-69x6v" deleted 5. Check the output of the nslookup commands. Server: 172.30.0.10 Address: 172.30.0.10#53 ** server can't find www.redhat.com: SERVFAIL Server: 172.30.0.10 Address: 172.30.0.10#53 ** server can't find www.redhat.com: SERVFAIL ... 6. Check the DNS pods' log output. $ for pod in $(oc -n openshift-dns get pods -o name) do oc -n openshift-dns logs -c dns $pod done 7. Check the Corefile configmap for the cluster DNS pods: $ oc -n openshift-dns get configmaps/dns-default -o yaml apiVersion: v1 data: Corefile: | # mydns redhat.com:5353 { forward . 172.30.67.42:5353 } .:5353 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa } prometheus 127.0.0.1:9153 forward . /etc/resolv.conf { policy sequential } cache 30 reload } ... Actual results: At step 6, the DNS pods' logs do not show any errors. At step 7, the Corefile configmap does not include the "errors" plugin in the server block for the custom resolver. Expected results: At step 6, some DNS pods should log errors: [ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.128.2.26:42666->172.30.188.79:5353: i/o timeout At step 7, the Corefile should include the "errors" plugin in the custom resolver's server block: # mydns redhat.com:5353 { forward . 172.30.67.42:5353 errors }
Verified in 4.8.0-0.nightly-2021-03-08-133419 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-08-133419 True False 118m Cluster version is 4.8.0-0.nightly-2021-03-08-133419 1. Set up a custom nameserver $ oc adm new-project mydns Created project mydns $ oc -n mydns create configmap coredns --from-file=Corefile configmap/coredns created $ oc -n mydns create deployment coredns --image=openshift/origin-coredns:latest --replicas=0 --port=5353 -- coredns --conf=/etc/coredns/Corefile deployment.apps/coredns create $ oc -n mydns set volume deployments/coredns --add --mount-path=/etc/coredns --type=configmap --configmap-name=coredns info: Generated volume name: volume-6wqvb deployment.apps/coredns volume updated $ oc -n mydns scale deployments/coredns --replicas=1 deployment.apps/coredns scaled $ oc -n mydns expose deployments/coredns --port=5353 --protocol=UDP service/coredns exposed $ oc -n mydns get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE coredns ClusterIP 172.30.197.57 <none> 5353/UDP 9s 2. Configured cluster DNS to forward queries for the zone to this custom nameserver $ oc patch dns.operator/default --type=merge --patch='{"spec":{"servers":[{"name":"mydns","zones":["redhat.com"],"forwardPlugin":{"upstreams":["172.30.197.57:5353"]}}]}}' dns.operator.openshift.io/default patched 3. created a test pod, from the pod, continuously performed nslookups for a name in the zone for which the custom nameserver is responsible $ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json pod/hello-pod created $ oc rsh hello-pod / # while :; do nslookup www.redhat.com; sleep 0.5; done Server: 172.30.0.10 Address: 172.30.0.10:53 Name: www.redhat.com Address: 1.2.3.4 Server: 172.30.0.10 Address: 172.30.0.10:53 Name: www.redhat.com Address: 1.2.3.4 4. changed the replicas for the deployment to 0 $ oc -n mydns scale deployments/coredns --replicas=0 deployment.apps/coredns scaled 5. after step 4, nslookup from test pod failed as expected $ oc rsh hello-pod / # while :; do nslookup www.redhat.com; sleep 0.5; done ;; connection timed out; no servers could be reached Server: 172.30.0.10 Address: 172.30.0.10:53 ** server can't find www.redhat.com: SERVFAIL ** server can't find www.redhat.com: SERVFAIL 6. verified DNS pods started logging errors for pod in $(oc -n openshift-dns get pods -o name) > do oc -n openshift-dns logs -c dns $pod > done .:5353 <---snip ---> [ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.129.2.3:60009->172.30.197.57:5353: i/o timeout [ERROR] plugin/errors: 2 www.redhat.com. AAAA: read udp 10.129.2.3:53513->172.30.197.57:5353: i/o timeout [ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.129.2.3:35611->172.30.197.57:5353: read: connection refused [ERROR] plugin/errors: 2 www.redhat.com. AAAA: read udp 10.129.2.3:41772->172.30.197.57:5353: read: connection refused [ERROR] plugin/errors: 2 www.redhat.com. A: read udp 10.129.2.3:44281->172.30.197.57:5353: i/o timeout [ERROR] plugin/errors: 2 www.redhat.com. AAAA: read udp 10.129.2.3:50071->172.30.197.57:5353: i/o timeout <---snip---> 7. verified the custom nameserver Corefile has included "errors" plugin $ oc -n openshift-dns get configmaps/dns-default -o yaml apiVersion: v1 data: Corefile: | # mydns redhat.com:5353 { forward . 172.30.197.57:5353 errors <-- verified the fix from https://github.com/openshift/cluster-dns-operator/pull/241 } .:5353 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa } prometheus 127.0.0.1:9153 forward . /etc/resolv.conf { policy sequential } cache 900 reload }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438