Description of problem: Deploying OpenShift 4.2 IPI on an OpenStack, pods trying to resolve the wildcard "apps" records randomly fail to get an answer back. How reproducible: always Steps to Reproduce: 1. Deploy OpenShift 4.2 IPI using the openshift-installer on an OpenStack cluster 2. Run: "oc logs 2>&1 -f -n openshift-authentication-operator authentication-operator-996ddcc5b-bkdtt" 3. Watch for the "RouteHealthDegraded" and "lookup oauth-openshift.apps.<cluster domain> on 172.30.0.10:53: no such host" messages. They tend to appear about every five minutes Alternatively: 1. SSH into a node in the cluster 2. Run: "watch -n1 dig @172.30.0.10 -p 53 oauth-openshift.apps.<cluster domain>" 3. Watch as the Answer section randomly appears and disappears Actual results: Here is a snippet from the authentication-operator logs: E0925 13:13:01.763313 1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp: lookup oauth-openshift.apps.tsedovic.upshift-dev.test on 172.30.0.10:53: no such host I0925 13:13:01.763637 1 status_controller.go:165] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-09-25T12:34:23Z","message":"RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.tsedovic.upshift-dev.test on 172.30.0.10:53: no such host","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2019-09-25T12:36:28Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-09-25T12:36:28Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-09-25T12:26:13Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} Here is a link for a full log (executing the conformance/parallel test suite) from the OpenStack CI: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-openstack-4.2/284/build-log.txt Look for the "RouteHealthDegraded" lines and the corresponding "no such host" messages. These errors persist for up to a few minutes (the duration differs from case to case) and the operator eventually rights itself. In the meantime however, the operator is degraded and we have seen similar issues fail tests that do not recover when the DNS resolution fails. Expected results: No DNS lookup errors in the authentication-operator or any other pod. Making wildcard apps queries against the cluster DNS service (172.30.0.10) should always return an IP address. Additional info: The OpenStack and Baremetal platforms cannot rely on a DNS-as-as-service project to provide the necessary records to install and keep the cluster functioning, i.e. api-int, *.apps, etcd SRV and node names. As a result, we run a static coredns pod on every node before the cluster is even up (and therefore before it could rely on the cluster DNS), serve these records from there and put its IP address as the first line in the node's /etc/resolv.conf (the remaining lines are the upstream resolvers added by OpenStack that our static pod forwards to). This functions well, but it relies on name requests preferrentially going to the first line in resolv.conf. This mostly works, but the "openshift-dns/dns-default" pods backing the "openshift-dns/dns-default" service runs CoreDNS configured in a way that it selects the nameservers form resolv.conf in a random order. This breaks us because only the first nameserver is able to resolve the *.apps records, any other server is an upstream that doesn't know anything about the OpenShift cluster.
I've locally verified that adding `policy sequential` to the `forward` block in the Corefile fixes this issue. I'll open up a pull request soon.
verified with 4.3.0-0.ci-2019-10-09-222432 and `policy sequential` has been added to Corefile $ oc get cm/dns-default -n openshift-dns -o yaml apiVersion: v1 data: Corefile: | .:5353 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure upstream fallthrough in-addr.arpa ip6.arpa } prometheus :9153 forward . /etc/resolv.conf { policy sequential } cache 30 reload } kind: ConfigMap
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062