Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1966116

Summary: DNS SRV request which worked in 4.7.9 stopped working in 4.7.11
Product: OpenShift Container Platform Reporter: Gabriel Stein <gferrazs>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: aos-bugs, mapandey, mjoseph, mmasters, sgreene
Version: 4.6Keywords: Regression
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The fix for Bug 1953097 enabled the CoreDNS Bufsize plugin with a size of 1232 bytes. Some primitive DNS resolvers are not capable of receiving DNS response messages over UDP that are greater than 512 bytes. Note that DNS resolvers that retry lookups using TCP (such as Dig) are not affected by this bug. Consequence: Some DNS resolvers (such as Go's internal DNS library) are unable to receive long-winded DNS responses from openshift-dns. Fix: Set the CoreDNS bufsize to 512 bytes for all servers. Result: DNS Clients that require UDP DNS messages to not exceed 512 bytes function as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:10:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1967766    

Comment 4 Stephen Greene 2021-06-01 19:10:51 UTC
I was also able to reproduce this on 4.8.0-0.ci-2021-06-01-124321. Looks like querying for the SRV record via dig works, but using Go's resolver does not. Using a shorter service/namespace name does appear to mitigate the issue (to some extent).

I was able to confirm that dropping the `bufsize 1232` option from the main section (:5353) of the CoreDNS Corefile resolves the issue, so this does appear to be a regression introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1953097.

See the following upstream Go issues relevant to this issue. At the moment, it's not precisely clear to me why Go's resolver is not compatible with a bufsize of 1232 when it comes to SRV records, but I am working to better understand that.

https://github.com/golang/go/issues/21160
https://github.com/golang/go/issues/37362
https://github.com/golang/go/issues/10622

Comment 8 Miciah Dashiel Butler Masters 2021-06-02 16:09:46 UTC
This appears to be a regression per comment 4, and we backported the change that caused it to 4.6.z, so we'll need to fix this in 4.8 and backport to 4.6.z.

Comment 9 Stephen Greene 2021-06-02 18:07:39 UTC
> Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

Customers running workloads that utilize Go's built-in DNS resolver, such as Grafana Loki, to resolve DNS records that exceed 512 bytes. This bug is a regression caused by the fix for Bug 1949361, which merged into 4.7.11 and 4.6.30.
Other primitive DNS resolvers that cannot accept UDP DNS messages longer than 512 bytes would be affected. Note that DNS resolvers that retry lookups using TCP (such as Dig) are not affected by this bug.

> What is the impact?  Is it serious enough to warrant blocking edges?

This bug could affect DNS queries of any type for primitive DNS resolvers, but long-winded SRV lookups, such as those used by Loki, are likely to hit this issue.

> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

https://access.redhat.com/solutions/5984291 details an immediate workaround that involves configuring CoreDNS to force the use of TCP for DNS. This workaround is entirely unsupported.
Service names can also be shortened in the case of SRV lookups as a potential mitigation. In some cases, a workload's DNS client could be switched out, or better configured, so that either DNS UDP messages longer than 512 bytes are accepted or failed quires are retried over TCP.

> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

Yes, this is a regression caused by the fix merged for Bug 1949361.

Comment 11 Hongan Li 2021-06-04 09:02:18 UTC
verified with 4.8.0-0.nightly-2021-06-03-221810 and passed. 

The bufsize is set to 512 for all servers.

# oc -n openshift-dns get cm/dns-default -oyaml
apiVersion: v1
data:
  Corefile: |
    # test
    mytest.ocp:5353 {
        forward . 192.168.1.1
        errors
        bufsize 512
    }
    .:5353 {
        bufsize 512
        errors
        health {
            lameduck 20s
        }

Comment 16 errata-xmlrpc 2021-07-27 23:10:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438