1970888 – cannot unmarshal DNS message

Bug 1970888 - cannot unmarshal DNS message

Summary: cannot unmarshal DNS message

Keywords:
Status:	CLOSED DUPLICATE of bug 1970889
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	aos-network-edge-staff
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-11 12:09 UTC by peter ducai
Modified:	2022-08-04 22:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-11 13:42:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1953518	1	high	CLOSED	thanos-ruler pods failed to start up for "cannot unmarshal DNS message"	2024-10-01 18:00:48 UTC
Red Hat Bugzilla	1967890	1	unspecified	CLOSED	Observability Thanos store shard crashing - cannot unmarshal DNS message	2021-10-06 19:53:56 UTC

Description peter ducai 2021-06-11 12:09:09 UTC

What problem/issue/behavior are you having trouble with?  What do you expect to see?
After upgrading to 4.7.13 several Go-based applications started reporting DNS resolution errors for FQDNs that are CNAMEs pointing to CNAMEs so the DNS records are large - for example:
```
level=error msg="Request from: prod.mydomain.com Namespace foo. Error: : Get \"https://prod.mydomain.com/blah\": dial tcp: lookup prod.mydomain.com on 11.32.0.10:53: cannot unmarshal DNS message"

$ dig prod.mydomain.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.2 <<>> prod.mydomain.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5825
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 6, ADDITIONAL: 7

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: f5a33702c9d7152571b66aa260c33157208c0199e3d261f7 (good)
;; QUESTION SECTION:
;prod.mydomain.com.     IN      A

;; ANSWER SECTION:
prod.mydomain.com. 5    IN      CNAME   prd-live.anotherdomain.com.
prd-live.anotherdomain.com.     5 IN CNAME prd-live-general.anotherdomain.com.
prd-live-general.anotherdomain.com.     5 IN CNAME prd-live-general.container.domain.com.
prd-live-general.container.domain.com. 5 IN CNAME       prd-live-ingress.container.domain.com.
prd-live-ingress.container.domain.com.  5 IN CNAME vip-84127-1-004.dc.gs.com.
vip-84127-1-004.dc.gs.com. 60   IN      A       10.38.18.128

;; AUTHORITY SECTION:
dc.gs.com.              287     IN      NS      vip-101783-1-006.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-101783-1-003.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-107537-1-006.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-111164-1-004.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-107537-1-003.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-111164-1-003.dc.gs.com.

;; ADDITIONAL SECTION:
vip-111164-1-003.dc.gs.com. 26  IN      A       10.231.173.43
vip-111164-1-004.dc.gs.com. 26  IN      A       10.231.171.16
vip-101783-1-003.dc.gs.com. 26  IN      A       10.205.35.60
vip-101783-1-006.dc.gs.com. 26  IN      A       10.205.38.252
vip-107537-1-003.dc.gs.com. 26  IN      A       10.238.118.15
vip-107537-1-006.dc.gs.com. 26  IN      A       10.238.117.162

;; Query time: 8 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jun 11 10:48:07 BST 2021
;; MSG SIZE  rcvd: 615
```
These same FQDNs resolve fine on 4.6.19 and 4.6.23 clusters.

The main difference between FQDNs that work and those that don't appears to be the size of the response, with those below 512 bytes working okay and those larger failing (though this number may not be exact).  I suspect that the response is being truncated and then rejected by the Go DNS resolver code.

Comparing the CoreDNS configuration between 4.6 and 4.7.13:
```
$ diff -u Corefile-4.6.23 Corefile-4.7.13
--- Corefile-4.6.23     2021-06-11 10:52:48.509166803 +0100
+++ Corefile-4.7.13     2021-06-11 10:52:22.742180047 +0100
@@ -1,13 +1,19 @@
+        bufsize 1232
         errors
-        health
+        health {
+            lameduck 20s
+        }
+        ready
         kubernetes cluster.local in-addr.arpa ip6.arpa {
             pods insecure
             upstream
             fallthrough in-addr.arpa ip6.arpa
         }
-        prometheus :9153
+        prometheus 127.0.0.1:9153
         forward . /etc/resolv.conf {
             policy sequential
         }
-        cache 30
+        cache 900 {
+            denial 9984 30
+        }
         reload
```
it seems that the bufsize change is the must likely cause.

Where are you experiencing the behavior? What environment?
Newly built 4.7.13 cluster - not currently production.

When does the behavior occur? Frequency? Repeatedly? At certain times?
Always with certain FQDNs.

What information can you provide around timeframes and the business impact?
No business impact but cluster handover is delayed due to this issue.

Comment 1 Miciah Dashiel Butler Masters 2021-06-11 13:42:34 UTC


*** This bug has been marked as a duplicate of bug 1970889 ***

Note You need to log in before you can comment on or make changes to this bug.