Bug 1966116 - DNS SRV request which worked in 4.7.9 stopped working in 4.7.11
Summary: DNS SRV request which worked in 4.7.9 stopped working in 4.7.11
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.6
Hardware: All
OS: All
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Stephen Greene
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1967766
TreeView+ depends on / blocked
 
Reported: 2021-05-31 12:19 UTC by Gabriel Stein
Modified: 2021-07-27 23:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The fix for Bug 1953097 enabled the CoreDNS Bufsize plugin with a size of 1232 bytes. Some primitive DNS resolvers are not capable of receiving DNS response messages over UDP that are greater than 512 bytes. Note that DNS resolvers that retry lookups using TCP (such as Dig) are not affected by this bug. Consequence: Some DNS resolvers (such as Go's internal DNS library) are unable to receive long-winded DNS responses from openshift-dns. Fix: Set the CoreDNS bufsize to 512 bytes for all servers. Result: DNS Clients that require UDP DNS messages to not exceed 512 bytes function as expected.
Clone Of:
Environment:
Last Closed: 2021-07-27 23:10:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 276 0 None open Bug 1966116: Corefile: Set bufsize to 512 bytes for all servers 2021-06-02 17:45:12 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:10:48 UTC

Comment 4 Stephen Greene 2021-06-01 19:10:51 UTC
I was also able to reproduce this on 4.8.0-0.ci-2021-06-01-124321. Looks like querying for the SRV record via dig works, but using Go's resolver does not. Using a shorter service/namespace name does appear to mitigate the issue (to some extent).

I was able to confirm that dropping the `bufsize 1232` option from the main section (:5353) of the CoreDNS Corefile resolves the issue, so this does appear to be a regression introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1953097.

See the following upstream Go issues relevant to this issue. At the moment, it's not precisely clear to me why Go's resolver is not compatible with a bufsize of 1232 when it comes to SRV records, but I am working to better understand that.

https://github.com/golang/go/issues/21160
https://github.com/golang/go/issues/37362
https://github.com/golang/go/issues/10622

Comment 8 Miciah Dashiel Butler Masters 2021-06-02 16:09:46 UTC
This appears to be a regression per comment 4, and we backported the change that caused it to 4.6.z, so we'll need to fix this in 4.8 and backport to 4.6.z.

Comment 9 Stephen Greene 2021-06-02 18:07:39 UTC
> Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

Customers running workloads that utilize Go's built-in DNS resolver, such as Grafana Loki, to resolve DNS records that exceed 512 bytes. This bug is a regression caused by the fix for Bug 1949361, which merged into 4.7.11 and 4.6.30.
Other primitive DNS resolvers that cannot accept UDP DNS messages longer than 512 bytes would be affected. Note that DNS resolvers that retry lookups using TCP (such as Dig) are not affected by this bug.

> What is the impact?  Is it serious enough to warrant blocking edges?

This bug could affect DNS queries of any type for primitive DNS resolvers, but long-winded SRV lookups, such as those used by Loki, are likely to hit this issue.

> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

https://access.redhat.com/solutions/5984291 details an immediate workaround that involves configuring CoreDNS to force the use of TCP for DNS. This workaround is entirely unsupported.
Service names can also be shortened in the case of SRV lookups as a potential mitigation. In some cases, a workload's DNS client could be switched out, or better configured, so that either DNS UDP messages longer than 512 bytes are accepted or failed quires are retried over TCP.

> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

Yes, this is a regression caused by the fix merged for Bug 1949361.

Comment 11 Hongan Li 2021-06-04 09:02:18 UTC
verified with 4.8.0-0.nightly-2021-06-03-221810 and passed. 

The bufsize is set to 512 for all servers.

# oc -n openshift-dns get cm/dns-default -oyaml
apiVersion: v1
data:
  Corefile: |
    # test
    mytest.ocp:5353 {
        forward . 192.168.1.1
        errors
        bufsize 512
    }
    .:5353 {
        bufsize 512
        errors
        health {
            lameduck 20s
        }

Comment 16 errata-xmlrpc 2021-07-27 23:10:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.