Bug 1970888 - cannot unmarshal DNS message
Summary: cannot unmarshal DNS message
Keywords:
Status: CLOSED DUPLICATE of bug 1970889
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: aos-network-edge-staff
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-11 12:09 UTC by peter ducai
Modified: 2022-08-04 22:39 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-11 13:42:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1953518 1 high CLOSED thanos-ruler pods failed to start up for "cannot unmarshal DNS message" 2022-02-04 08:58:24 UTC
Red Hat Bugzilla 1967890 1 unspecified CLOSED Observability Thanos store shard crashing - cannot unmarshal DNS message 2021-10-06 19:53:56 UTC

Description peter ducai 2021-06-11 12:09:09 UTC
What problem/issue/behavior are you having trouble with?  What do you expect to see?
After upgrading to 4.7.13 several Go-based applications started reporting DNS resolution errors for FQDNs that are CNAMEs pointing to CNAMEs so the DNS records are large - for example:
```
level=error msg="Request from: prod.mydomain.com Namespace foo. Error: : Get \"https://prod.mydomain.com/blah\": dial tcp: lookup prod.mydomain.com on 11.32.0.10:53: cannot unmarshal DNS message"

$ dig prod.mydomain.com

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.2 <<>> prod.mydomain.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5825
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 6, ADDITIONAL: 7

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: f5a33702c9d7152571b66aa260c33157208c0199e3d261f7 (good)
;; QUESTION SECTION:
;prod.mydomain.com.     IN      A

;; ANSWER SECTION:
prod.mydomain.com. 5    IN      CNAME   prd-live.anotherdomain.com.
prd-live.anotherdomain.com.     5 IN CNAME prd-live-general.anotherdomain.com.
prd-live-general.anotherdomain.com.     5 IN CNAME prd-live-general.container.domain.com.
prd-live-general.container.domain.com. 5 IN CNAME       prd-live-ingress.container.domain.com.
prd-live-ingress.container.domain.com.  5 IN CNAME vip-84127-1-004.dc.gs.com.
vip-84127-1-004.dc.gs.com. 60   IN      A       10.38.18.128

;; AUTHORITY SECTION:
dc.gs.com.              287     IN      NS      vip-101783-1-006.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-101783-1-003.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-107537-1-006.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-111164-1-004.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-107537-1-003.dc.gs.com.
dc.gs.com.              287     IN      NS      vip-111164-1-003.dc.gs.com.

;; ADDITIONAL SECTION:
vip-111164-1-003.dc.gs.com. 26  IN      A       10.231.173.43
vip-111164-1-004.dc.gs.com. 26  IN      A       10.231.171.16
vip-101783-1-003.dc.gs.com. 26  IN      A       10.205.35.60
vip-101783-1-006.dc.gs.com. 26  IN      A       10.205.38.252
vip-107537-1-003.dc.gs.com. 26  IN      A       10.238.118.15
vip-107537-1-006.dc.gs.com. 26  IN      A       10.238.117.162

;; Query time: 8 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jun 11 10:48:07 BST 2021
;; MSG SIZE  rcvd: 615
```
These same FQDNs resolve fine on 4.6.19 and 4.6.23 clusters.

The main difference between FQDNs that work and those that don't appears to be the size of the response, with those below 512 bytes working okay and those larger failing (though this number may not be exact).  I suspect that the response is being truncated and then rejected by the Go DNS resolver code.

Comparing the CoreDNS configuration between 4.6 and 4.7.13:
```
$ diff -u Corefile-4.6.23 Corefile-4.7.13
--- Corefile-4.6.23     2021-06-11 10:52:48.509166803 +0100
+++ Corefile-4.7.13     2021-06-11 10:52:22.742180047 +0100
@@ -1,13 +1,19 @@
+        bufsize 1232
         errors
-        health
+        health {
+            lameduck 20s
+        }
+        ready
         kubernetes cluster.local in-addr.arpa ip6.arpa {
             pods insecure
             upstream
             fallthrough in-addr.arpa ip6.arpa
         }
-        prometheus :9153
+        prometheus 127.0.0.1:9153
         forward . /etc/resolv.conf {
             policy sequential
         }
-        cache 30
+        cache 900 {
+            denial 9984 30
+        }
         reload
```
it seems that the bufsize change is the must likely cause.

Where are you experiencing the behavior? What environment?
Newly built 4.7.13 cluster - not currently production.

When does the behavior occur? Frequency? Repeatedly? At certain times?
Always with certain FQDNs.

What information can you provide around timeframes and the business impact?
No business impact but cluster handover is delayed due to this issue.

Comment 1 Miciah Dashiel Butler Masters 2021-06-11 13:42:34 UTC

*** This bug has been marked as a duplicate of bug 1970889 ***


Note You need to log in before you can comment on or make changes to this bug.