Bug 2192225 - nslookup communications error to x.x.x.x#53: timed out
Summary: nslookup communications error to x.x.x.x#53: timed out
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: bind
Version: 38
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Petr Menšík
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 2223913
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-30 16:30 UTC by dan
Modified: 2023-09-22 17:22 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2023-09-22 17:22:05 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Internet Systems Consortium (ISC) isc-projects bind9 issues 4044 0 None opened nslookup reports timeout if input lookup is delayed 2023-05-03 12:08:23 UTC

Description dan 2023-04-30 16:30:25 UTC
For some time I have observed that nslookup queries on my various Fedora systems were returning:

nslookup communications error to x.x.x.x#53: timed out

and then immediately resolved for the query.

In trying to track this down I did 2 fresh installs of Fedora Workstation 38 on my own hypervisor and in the cloud.  The problem persists, leading me to believe that there is a bug.

Example from fresh install of Fedora 38:

$ nslookup
> server 8.8.8.8
Default server: 8.8.8.8
Address: 8.8.8.8#53
> set debug
> cisco.com
;; communications error to 8.8.8.8#53: timed out
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	cisco.com, type = A, class = IN
    ANSWERS:
    ->  cisco.com
	internet address = 72.163.4.185
	ttl = 281
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
Name:	cisco.com
Address: 72.163.4.185
------------
    QUESTIONS:
	cisco.com, type = AAAA, class = IN
    ANSWERS:
    ->  cisco.com
	has AAAA address 2001:420:1101:1::185
	ttl = 133
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:	cisco.com
Address: 2001:420:1101:1::185
> 




Reproducible: Always

Steps to Reproduce:
Install Fedora 38

nslookup
set debug
server 8.8.8.8 (or any server of your choosing)
cisco.com (or any other valid name of your choosing)

8 out of 10 times you'll get the communications error message, then a result.



Unable to select bind-utils as the involved package for nslookup.  Not in list.

Comment 1 Petr Menšík 2023-05-02 16:00:04 UTC
Interesting issue. It seems to be present just in interactive mode of nslookup. When I set those parameters from command line, I have not seen it there.

$ nslookup -debug cisco.com 8.8.8.8
Server:		8.8.8.8
Address:	8.8.8.8#53

------------
    QUESTIONS:
	cisco.com, type = A, class = IN
    ANSWERS:
    ->  cisco.com
	internet address = 72.163.4.185
	ttl = 247
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
Name:	cisco.com
Address: 72.163.4.185
------------
    QUESTIONS:
	cisco.com, type = AAAA, class = IN
    ANSWERS:
    ->  cisco.com
	has AAAA address 2001:420:1101:1::185
	ttl = 69
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:	cisco.com
Address: 2001:420:1101:1::185

But in interactive mode, even set debug followed by name seems to be enough:
$ nslookup
> set debug
> isc.org
;; communications error to 127.0.0.1#53: timed out
Server:		127.0.0.1
Address:	127.0.0.1#53

------------
    QUESTIONS:
	isc.org, type = A, class = IN
    ANSWERS:
    ->  isc.org
	internet address = 149.20.1.66
	ttl = 300
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
Name:	isc.org
Address: 149.20.1.66
------------
    QUESTIONS:
	isc.org, type = AAAA, class = IN
    ANSWERS:
    ->  isc.org
	has AAAA address 2001:4f8:1:f::66
	ttl = 300
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:	isc.org
Address: 2001:4f8:1:f::66

I agree it is too fast reply to met timeout, so there is more likely some logic wrong to print timeout message without any timeout happening.

Comment 2 Petr Menšík 2023-05-02 16:52:05 UTC
For some reason set debug made interactively results in the first recv_done returns eresult==timeout, always. If I set it on command line, even the first one returns with success. Queried name does not matter, even '.' works too.

(gdb) p eresult
$3 = ISC_R_TIMEDOUT
(gdb) bt
#0  recv_done (handle=0x7ffff5081000, eresult=ISC_R_TIMEDOUT, region=<optimized out>, arg=<optimized out>) at ../../../bin/dig/dighost.c:4037
#1  0x00007ffff7a29a30 in isc__nm_async_readcb (worker=worker@entry=0x0, ev0=ev0@entry=0x7ffff5dfe140) at ../../../lib/isc/netmgr/netmgr.c:2885
#2  0x00007ffff7a29b7c in isc__nm_readcb (sock=sock@entry=0x7ffff5027a00, uvreq=<optimized out>, eresult=eresult@entry=ISC_R_TIMEDOUT)
    at ../../../lib/isc/netmgr/netmgr.c:2858
#3  0x00007ffff7a2aa17 in isc__nmsocket_readtimeout_cb (timer=0x7ffff5027e98) at ../../../lib/isc/netmgr/netmgr.c:2085
#4  0x00007ffff793c1f0 in uv__run_timers (loop=0x7ffff622f010) at src/timer.c:178
#5  uv_run (loop=loop@entry=0x7ffff622f010, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:393
#6  0x00007ffff7a2e6b4 in nm_thread (worker0=0x7ffff622f000) at ../../../lib/isc/netmgr/netmgr.c:698
#7  0x00007ffff7a64569 in isc__trampoline_run (arg=0x5555555b38b0) at ../../../lib/isc/trampoline.c:189
#8  0x00007ffff74ae12d in start_thread (arg=<optimized out>) at pthread_create.c:442
#9  0x00007ffff752fbc0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) p eresult
$4 = ISC_R_TIMEDOUT
(gdb) info locals
sockstr = "127.0.0.1#53", '\000' <repeats 50 times>
query = 0x7ffff5042000
b = {magic = 0, base = 0x0, length = 0, used = 0, current = 0, active = 0, link = {prev = 0x0, next = 0x0}, mctx = 0x0, autore = false}
msg = 0x0
result = <optimized out>
n = 0x0
l = 0x7ffff5000000
docancel = false
donext = false
match = true
done_process_opt = false
parseflags = <optimized out>
id = 0
msgflags = 0
newedns = <optimized out>
peer = {type = {sa = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, sin = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, 
      sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {__in6_u = {
          __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, 
    ss = {ss_family = 0, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0}, sunix = {sun_family = 0, 
      sun_path = '\000' <repeats 107 times>}}, length = 0, link = {prev = 0x0, next = 0x0}}
next_lookup = <optimized out>
cancel_lookup = <optimized out>
keep_query = <optimized out>
__func__ = "recv_done"

Comment 3 Petr Menšík 2023-05-03 10:52:58 UTC
I have tried to capture the reason by using strace, but once I follow also children forks, it never happened to me. I think that means there is some weird race condition, which does not happen when it is slower.

$ strace -o /tmp/nslookup.strace -f nslookup

Comment 4 Petr Menšík 2023-05-03 11:15:19 UTC
Ah, found reliable reproducer. The set at the beginning does matter. What does matter is just waiting before query is asked.

Reproduce with:
(sleep 6 && echo localhost) | nslookup

The problem is not present in development version 9.19.11, but reliably reproduced on 9.18.14.

Comment 5 Petr Menšík 2023-05-03 12:08:23 UTC
Reported upstream: https://gitlab.isc.org/isc-projects/bind9/-/issues/4044

Comment 6 Petr Menšík 2023-09-22 17:22:05 UTC
Fixed in release 9.18.17, bug #2223913.


Note You need to log in before you can comment on or make changes to this bug.