Bug 2043046 - nslookup reporting Truncated, retrying in TCP mode errors
Summary: nslookup reporting Truncated, retrying in TCP mode errors
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: aos-network-edge-staff
QA Contact: Melvin Joseph
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-20 14:31 UTC by Andy Bartlett
Modified: 2023-09-18 04:30 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-01 16:03:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andy Bartlett 2022-01-20 14:31:06 UTC
Description of problem:

My customer has reported the following issue:

[root@control-host-01 ~]# kubectl run -i -t network-utils --image=amouat/network-utils --restart=Never -n testns
If you don't see a command prompt, try pressing enter.
root@network-utils:/#
bash-5.1# nslookup www.google.nl
;; Truncated, retrying in TCP mode.
Server:		172.30.0.10
Address:	172.30.0.10#53

Non-authoritative answer:
Name:	www.google.nl
Address: 142.250.180.3
Name:	www.google.nl
Address: 2a00:1450:4009:816::2003

bash-5.1# for i in $(seq 1 100); do nslookup vk.nl | grep -i tcp; done
;; Truncated, retrying in TCP mode.
;; Truncated, retrying in TCP mode.
;; Truncated, retrying in TCP mode.
;; Truncated, retrying in TCP mode.
;; Truncated, retrying in TCP mode.
;; Truncated, retrying in TCP mode.
;; Truncated, retrying in TCP mode.


OpenShift release version:

OCP 4.8.18 and above

Cluster Platform:

Confirmed on Openstack and VMware

How reproducible:

100%

Steps to Reproduce (in detail):
1.
2.
3.


Actual results:


Expected results:


Impact of the problem:

This has caused my customer to put a hold on the 4.7 to 4.8 upgrades


Additional info:

It appears to be an issue caused by the bufsize 512 setting in coredns, if this is removed everything works normally.


** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 3 Miciah Dashiel Butler Masters 2022-01-21 01:46:54 UTC
The description of this report doesn't explicitly state what the expected behavior is, but based on the summary of the report ("nslookup reporting Truncated, retrying in TCP mode errors"), I assume the customer is concerned about the "Truncated, retrying in TCP mode" messages from nslookup.  

The basic DNS protocol allows query and response sizes of up to 512 bytes when using UDP; it is necessary to use TCP or an extension to the DNS protocol to accommodate larger queries or responses.  The "truncated" messages from nslookup indicate that nslookup attempted to perform a lookup using UDP, the response was truncated (probably because the response was more than 512 bytes), and so nslookup retried the request using TCP.  Afterwards, nslookup provided a response, which indicates that the TCP query succeeded in getting a response, so the lookup ultimately succeeded with no problems.  This is exactly the expected behavior for a DNS query over UDP that elicits a large response.  

(In case you are curious, here is the relevant code in nslookup: <https://gitlab.isc.org/isc-projects/bind9/-/blob/7267c3932362fe100ee2717b8a8ada1d21ce7987/bin/dig/dighost.c#L3850-3851>.)

In addition to TCP, there is the EDNS standard (cf. <https://en.wikipedia.org/wiki/Extension_Mechanisms_for_DNS>, <https://datatracker.ietf.org/doc/html/rfc2671>, and <https://datatracker.ietf.org/doc/html/rfc6891>).  Some resolvers and nameservers use this standard to send queries and responses larger than 512 bytes using UDP.  However, support for EDNS is not universal.  In particular, we have discovered that Go does not support EDNS (cf. bug 1949361, bug 1953518, bug 1966116, bug 1991067, <https://github.com/golang/go/issues/6464>, <https://github.com/golang/go/issues/11070>), which means that using EDNS to send more than 512 bytes over UDP causes problems for Go-based operators and builds.  

Because of this issue, we use CoreDNS's "bufsize" plugin (cf. <https://coredns.io/plugins/bufsize/>) to restrict UDP queries to 512 bytes (cf. <https://github.com/openshift/cluster-dns-operator/blob/0fcb6e5e330c26bb9d2ee32e9a63e87515c58784/pkg/operator/controller/controller_dns_configmap.go#L45-L52>, <https://github.com/openshift/machine-config-operator/blob/45d7287d05bcbcc8ef892e6613db3f02df05fd43/templates/common/on-prem/files/coredns-corefile.yaml#L7>).  

This restriction requires that resolvers retry with TCP when the UDP responses are truncated.  However, it is the best solution we have found to maximize compatibility with resolvers and nameservers that implement the basic DNS protocol but may not implement EDNS.  It does require that firewalls be properly configured to allow TCP port 53 and that resolvers properly retry with TCP when UDP does not work, but compliant resolvers should all do this.  

Does that resolve the matter?

Comment 5 Miciah Dashiel Butler Masters 2022-02-21 04:58:39 UTC
Is any further action required on this BZ?

Comment 6 Miciah Dashiel Butler Masters 2022-03-01 16:03:13 UTC
Closing because the described behavior is expected.

Comment 7 Andy Bartlett 2022-04-29 07:37:04 UTC
Hi Miciah,
 Yup thanks for closing this, the problem seems to be resolved at the customer.

Regards,

Andy

Comment 8 Pedro 2022-09-21 05:44:52 UTC
Can we reopen ?, we have exacltly the same issue, but removing the 'bufsize 512' is not changing the UDP Payload size ... 
We have a Weblogic Operator and Weblogic containers that needs to resolve names that doesn't exist at the pod boot.

Comment 10 Red Hat Bugzilla 2023-09-18 04:30:36 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.