Bug 1970888

Summary: cannot unmarshal DNS message
Product: OpenShift Container Platform Reporter: peter ducai <pducai>
Component: NetworkingAssignee: aos-network-edge-staff <aos-network-edge-staff>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, mmasters
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-11 13:42:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description peter ducai 2021-06-11 12:09:09 UTC
What problem/issue/behavior are you having trouble with?  What do you expect to see?
After upgrading to 4.7.13 several Go-based applications started reporting DNS resolution errors for FQDNs that are CNAMEs pointing to CNAMEs so the DNS records are large - for example:
```
level=error msg="Request from: prod.mydomain.com Namespace foo. Error: : Get \"https://prod.mydomain.com/blah\": dial tcp: lookup prod.mydomain.com on 11.32.0.10:53: cannot unmarshal DNS message"

$ dig prod.mydomain.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.2 <<>> prod.mydomain.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5825
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 6, ADDITIONAL: 7

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: f5a33702c9d7152571b66aa260c33157208c0199e3d261f7 (good)
;; QUESTION SECTION:
;prod.mydomain.com.     IN      A

;; ANSWER SECTION:
prod.mydomain.com. 5    IN      CNAME   prd-live.anotherdomain.com.
prd-live.anotherdomain.com.     5 IN CNAME prd-live-general.anotherdomain.com.
prd-live-general.anotherdomain.com.     5 IN CNAME prd-live-general.container.domain.com.
prd-live-general.container.domain.com. 5 IN CNAME       prd-live-ingress.container.domain.com.
prd-live-ingress.container.domain.com.  5 IN CNAME vip-84127-1-004.dc.gs.com.
vip-84127-1-004.dc.gs.com. 60   IN      A       10.38.18.128

;; AUTHORITY SECTION:
dc.gs.com.              287     IN      NS      vip-101783-1-006.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-101783-1-003.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-107537-1-006.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-111164-1-004.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-107537-1-003.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-111164-1-003.dc.gs.com.

;; ADDITIONAL SECTION:
vip-111164-1-003.dc.gs.com. 26  IN      A       10.231.173.43
vip-111164-1-004.dc.gs.com. 26  IN      A       10.231.171.16
vip-101783-1-003.dc.gs.com. 26  IN      A       10.205.35.60
vip-101783-1-006.dc.gs.com. 26  IN      A       10.205.38.252
vip-107537-1-003.dc.gs.com. 26  IN      A       10.238.118.15
vip-107537-1-006.dc.gs.com. 26  IN      A       10.238.117.162

;; Query time: 8 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jun 11 10:48:07 BST 2021
;; MSG SIZE  rcvd: 615
```
These same FQDNs resolve fine on 4.6.19 and 4.6.23 clusters.

The main difference between FQDNs that work and those that don't appears to be the size of the response, with those below 512 bytes working okay and those larger failing (though this number may not be exact).  I suspect that the response is being truncated and then rejected by the Go DNS resolver code.

Comparing the CoreDNS configuration between 4.6 and 4.7.13:
```
$ diff -u Corefile-4.6.23 Corefile-4.7.13
--- Corefile-4.6.23     2021-06-11 10:52:48.509166803 +0100
+++ Corefile-4.7.13     2021-06-11 10:52:22.742180047 +0100
@@ -1,13 +1,19 @@
+        bufsize 1232
         errors
-        health
+        health {
+            lameduck 20s
+        }
+        ready
         kubernetes cluster.local in-addr.arpa ip6.arpa {
             pods insecure
             upstream
             fallthrough in-addr.arpa ip6.arpa
         }
-        prometheus :9153
+        prometheus 127.0.0.1:9153
         forward . /etc/resolv.conf {
             policy sequential
         }
-        cache 30
+        cache 900 {
+            denial 9984 30
+        }
         reload
```
it seems that the bufsize change is the must likely cause.

Where are you experiencing the behavior? What environment?
Newly built 4.7.13 cluster - not currently production.

When does the behavior occur? Frequency? Repeatedly? At certain times?
Always with certain FQDNs.

What information can you provide around timeframes and the business impact?
No business impact but cluster handover is delayed due to this issue.

Comment 1 Miciah Dashiel Butler Masters 2021-06-11 13:42:34 UTC

*** This bug has been marked as a duplicate of bug 1970889 ***