I was also able to reproduce this on 4.8.0-0.ci-2021-06-01-124321. Looks like querying for the SRV record via dig works, but using Go's resolver does not. Using a shorter service/namespace name does appear to mitigate the issue (to some extent).
I was able to confirm that dropping the `bufsize 1232` option from the main section (:5353) of the CoreDNS Corefile resolves the issue, so this does appear to be a regression introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1953097.
See the following upstream Go issues relevant to this issue. At the moment, it's not precisely clear to me why Go's resolver is not compatible with a bufsize of 1232 when it comes to SRV records, but I am working to better understand that.
This appears to be a regression per comment 4, and we backported the change that caused it to 4.6.z, so we'll need to fix this in 4.8 and backport to 4.6.z.
> Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
Customers running workloads that utilize Go's built-in DNS resolver, such as Grafana Loki, to resolve DNS records that exceed 512 bytes. This bug is a regression caused by the fix for Bug 1949361, which merged into 4.7.11 and 4.6.30.
Other primitive DNS resolvers that cannot accept UDP DNS messages longer than 512 bytes would be affected. Note that DNS resolvers that retry lookups using TCP (such as Dig) are not affected by this bug.
> What is the impact? Is it serious enough to warrant blocking edges?
This bug could affect DNS queries of any type for primitive DNS resolvers, but long-winded SRV lookups, such as those used by Loki, are likely to hit this issue.
> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
https://access.redhat.com/solutions/5984291 details an immediate workaround that involves configuring CoreDNS to force the use of TCP for DNS. This workaround is entirely unsupported.
Service names can also be shortened in the case of SRV lookups as a potential mitigation. In some cases, a workload's DNS client could be switched out, or better configured, so that either DNS UDP messages longer than 512 bytes are accepted or failed quires are retried over TCP.
> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
Yes, this is a regression caused by the fix merged for Bug 1949361.
verified with 4.8.0-0.nightly-2021-06-03-221810 and passed.
The bufsize is set to 512 for all servers.
# oc -n openshift-dns get cm/dns-default -oyaml
forward . 192.168.1.1
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.