What problem/issue/behavior are you having trouble with? What do you expect to see? After upgrading to 4.7.13 several Go-based applications started reporting DNS resolution errors for FQDNs that are CNAMEs pointing to CNAMEs so the DNS records are large - for example: ``` level=error msg="Request from: prod.mydomain.com Namespace foo. Error: : Get \"https://prod.mydomain.com/blah\": dial tcp: lookup prod.mydomain.com on 11.32.0.10:53: cannot unmarshal DNS message" $ dig prod.mydomain.com ; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.2 <<>> prod.mydomain.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5825 ;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 6, ADDITIONAL: 7 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: f5a33702c9d7152571b66aa260c33157208c0199e3d261f7 (good) ;; QUESTION SECTION: ;prod.mydomain.com. IN A ;; ANSWER SECTION: prod.mydomain.com. 5 IN CNAME prd-live.anotherdomain.com. prd-live.anotherdomain.com. 5 IN CNAME prd-live-general.anotherdomain.com. prd-live-general.anotherdomain.com. 5 IN CNAME prd-live-general.container.domain.com. prd-live-general.container.domain.com. 5 IN CNAME prd-live-ingress.container.domain.com. prd-live-ingress.container.domain.com. 5 IN CNAME vip-84127-1-004.dc.gs.com. vip-84127-1-004.dc.gs.com. 60 IN A 10.38.18.128 ;; AUTHORITY SECTION: dc.gs.com. 287 IN NS vip-101783-1-006.dc.gs.com. dc.gs.com. 287 IN NS vip-101783-1-003.dc.gs.com. dc.gs.com. 287 IN NS vip-107537-1-006.dc.gs.com. dc.gs.com. 287 IN NS vip-111164-1-004.dc.gs.com. dc.gs.com. 287 IN NS vip-107537-1-003.dc.gs.com. dc.gs.com. 287 IN NS vip-111164-1-003.dc.gs.com. ;; ADDITIONAL SECTION: vip-111164-1-003.dc.gs.com. 26 IN A 10.231.173.43 vip-111164-1-004.dc.gs.com. 26 IN A 10.231.171.16 vip-101783-1-003.dc.gs.com. 26 IN A 10.205.35.60 vip-101783-1-006.dc.gs.com. 26 IN A 10.205.38.252 vip-107537-1-003.dc.gs.com. 26 IN A 10.238.118.15 vip-107537-1-006.dc.gs.com. 26 IN A 10.238.117.162 ;; Query time: 8 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Fri Jun 11 10:48:07 BST 2021 ;; MSG SIZE rcvd: 615 ``` These same FQDNs resolve fine on 4.6.19 and 4.6.23 clusters. The main difference between FQDNs that work and those that don't appears to be the size of the response, with those below 512 bytes working okay and those larger failing (though this number may not be exact). I suspect that the response is being truncated and then rejected by the Go DNS resolver code. Comparing the CoreDNS configuration between 4.6 and 4.7.13: ``` $ diff -u Corefile-4.6.23 Corefile-4.7.13 --- Corefile-4.6.23 2021-06-11 10:52:48.509166803 +0100 +++ Corefile-4.7.13 2021-06-11 10:52:22.742180047 +0100 @@ -1,13 +1,19 @@ + bufsize 1232 errors - health + health { + lameduck 20s + } + ready kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure upstream fallthrough in-addr.arpa ip6.arpa } - prometheus :9153 + prometheus 127.0.0.1:9153 forward . /etc/resolv.conf { policy sequential } - cache 30 + cache 900 { + denial 9984 30 + } reload ``` it seems that the bufsize change is the must likely cause. Where are you experiencing the behavior? What environment? Newly built 4.7.13 cluster - not currently production. When does the behavior occur? Frequency? Repeatedly? At certain times? Always with certain FQDNs. What information can you provide around timeframes and the business impact? No business impact but cluster handover is delayed due to this issue.
*** This bug has been marked as a duplicate of bug 1970889 ***