Bug 1756344 - [IPI][OpenStack] Intermittent DNS errors for ingress records within the pods
Summary: [IPI][OpenStack] Intermittent DNS errors for ingress records within the pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.0
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1757124
TreeView+ depends on / blocked
 
Reported: 2019-09-27 12:52 UTC by Tomas Sedovic
Modified: 2020-05-18 21:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1757124 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:25:42 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 129 'None' closed Bug 1756344: Add `policy sequential` to Corefile 2020-06-25 14:56:49 UTC
Red Hat Product Errata RHBA-2020:0062 None None None 2020-05-13 21:25:44 UTC

Description Tomas Sedovic 2019-09-27 12:52:36 UTC
Description of problem:

Deploying OpenShift 4.2 IPI on an OpenStack, pods trying to resolve the wildcard "apps" records randomly fail to get an answer back.


How reproducible: always


Steps to Reproduce:
1. Deploy OpenShift 4.2 IPI using the openshift-installer on an OpenStack cluster
2. Run: "oc logs 2>&1 -f -n openshift-authentication-operator authentication-operator-996ddcc5b-bkdtt"
3. Watch for the "RouteHealthDegraded" and "lookup oauth-openshift.apps.<cluster domain> on 172.30.0.10:53: no such host" messages. They tend to appear about every five minutes

Alternatively:
1. SSH into a node in the cluster
2. Run: "watch -n1 dig @172.30.0.10 -p 53 oauth-openshift.apps.<cluster domain>"
3. Watch as the Answer section randomly appears and disappears


Actual results:

Here is a snippet from the authentication-operator logs:

E0925 13:13:01.763313       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp: lookup oauth-openshift.apps.tsedovic.upshift-dev.test on 172.30.0.10:53: no such host
I0925 13:13:01.763637       1 status_controller.go:165] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-09-25T12:34:23Z","message":"RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.tsedovic.upshift-dev.test on 172.30.0.10:53: no such host","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2019-09-25T12:36:28Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-09-25T12:36:28Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-09-25T12:26:13Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

Here is a link for a full log (executing the conformance/parallel test suite) from the OpenStack CI:

https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-openstack-4.2/284/build-log.txt

Look for the "RouteHealthDegraded" lines and the corresponding "no such host" messages.

These errors persist for up to a few minutes (the duration differs from case to case) and the operator eventually rights itself. In the meantime however, the operator is degraded and we have seen similar issues fail tests that do not recover when the DNS resolution fails.



Expected results:

No DNS lookup errors in the authentication-operator or any other pod. Making wildcard apps queries against the cluster DNS service (172.30.0.10) should always return an IP address.


Additional info:

The OpenStack and Baremetal platforms cannot rely on a DNS-as-as-service project to provide the necessary records to install and keep the cluster functioning, i.e. api-int, *.apps, etcd SRV and node names.

As a result, we run a static coredns pod on every node before the cluster is even up (and therefore before it could rely on the cluster DNS), serve these records from there and put its IP address as the first line in the node's /etc/resolv.conf (the remaining lines are the upstream resolvers added by OpenStack that our static pod forwards to).

This functions well, but it relies on name requests preferrentially going to the first line in resolv.conf. This mostly works, but the "openshift-dns/dns-default" pods backing the "openshift-dns/dns-default" service runs CoreDNS configured in a way that it selects the nameservers form resolv.conf in a random order.

This breaks us because only the first nameserver is able to resolve the *.apps records, any other server is an upstream that doesn't know anything about the OpenShift cluster.

Comment 1 Tomas Sedovic 2019-09-27 13:00:55 UTC
I've locally verified that adding `policy sequential` to the `forward` block in the Corefile fixes this issue. I'll open up a pull request soon.

Comment 4 Hongan Li 2019-10-10 03:39:26 UTC
verified with 4.3.0-0.ci-2019-10-09-222432 and `policy sequential` has been added to Corefile

$ oc get cm/dns-default -n openshift-dns -o yaml
apiVersion: v1
data:
  Corefile: |
    .:5353 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 30
        reload
    }
kind: ConfigMap

Comment 6 errata-xmlrpc 2020-05-13 21:25:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.