Bug 2181244

Summary:	TCP queries hang forever when an upstream server is not reachable
Product:	[Fedora] Fedora	Reporter:	Petr Menšík <pemensik>
Component:	dnsmasq	Assignee:	Petr Menšík <pemensik>
Status:	CLOSED EOL	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	39	CC:	aegorenkov.91, dns-sig, dougsland, horst.thaller, jfindysz, pemensik, rhel-cs-infra-services-qe, rmetrich, tmihinto
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2160466	Environment:
Last Closed:	2024-11-27 21:09:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2160466
Bug Blocks:

Comment 1 Petr Menšík 2023-03-31 17:28:18 UTC

Description of problem:

A customer is using dnsmasq with 3 upstream servers.
When one of them is not reachable, queries hang until they time out.
This happens even though --all-servers is used, which is supposed to send the query to all servers concurrently, at least from the manpage:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
       --all-servers
              By  default,  when  dnsmasq has more than one upstream server available, it will send queries to just
              one server. Setting this flag forces dnsmasq to send all queries to all available servers. The  reply
              from the server which answers first will be returned to the original requester.
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Stracing dnsmasq, we can see indeed that it hangs on connect() until the daemon was killed:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
8731  14:27:08.458095 connect(13<TCP:[432416]>, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("1.2.3.4")}, 16 <unfinished ...>
 :
8731  14:29:05.145298 <... connect resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <116.687148>
8731  14:29:05.145373 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Version-Release number of selected component (if applicable):

dnsmasq-2.79-24.el8.x86_64 (also seen on RHEL9 dnsmasq-2.85-5.el9.x86_64)

How reproducible:

Always

Steps to Reproduce:
1. Setup dnsmasq with upstream servers 192.168.122.1 (my VM gateway) and 1.2.3.4 (not reachable)

  -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
  # dnsmasq -k --conf-file=/dev/null --port 2053 --server 192.168.122.1 --server 1.2.3.4 -i lo -z --all-servers --no-resolv --no-hosts
  -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

2. Query using *dig*

  -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
  # dig +tcp @localhost -p 2053 srv foo.bar
  
  ; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> +tcp @localhost -p 2053 srv foo.bar
  ; (2 servers found)
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached
  -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Actual results:

Time out, no result

Expected results:

Some result

Additional info:

When inversing --server options (--server 1.2.3.4 --server 192.168.122.1), we see the query being answered immediately, which "proves" 192.168.122.1 is queried first, and for sure nothing is queried in parallel.

ss shows that both children query the same server, which is not reachable:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# ss -anp | grep SYN
tcp   SYN-SENT   0      1                                             192.168.122.184:56355             1.2.3.4:53     users:(("dnsmasq",pid=9373,fd=13))                                                                                                                             
tcp   SYN-SENT   0      1                                             192.168.122.184:57173             1.2.3.4:53     users:(("dnsmasq",pid=9370,fd=13))
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This happens because dig internally retries the query upon not getting any result.

--- Additional comment from Renaud Métrich on 2023-01-12 15:46:12 CET ---

Clearly there is no parallelism in the tcp_request() code, servers are queried sequentially and there is no timeout handling either (socket is in blocking mode in particular).

Checking Upstream, I can see a complete rewrite of this code, which now brings concurrency:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
commit 12a9aa7c628e2d7dcd34949603848a3fb53fce9c
Author: Simon Kelley <simon.uk>
Date:   Tue Jun 8 22:10:55 2021 +0100

    Major rewrite of the DNS server and domain handling code.
...
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

--- Additional comment from Petr Menšík on 2023-01-23 19:14:25 CET ---

It seems to me those versions work very similarly. There is one difference however. Both version 2.79 in RHEL8 and 2.85 in RHEL9 uses the last --server first, then tries them in reverse order. On the other hand 2.88 from Fedora tries the first --server as first, then in forward order. If the same parameters are passed and only one server works well, then results are different.

127.0.0.1 has listening forwarder,
127.0.0.83 has closed port,
10.0.137.114 drops all incoming requests, timeouts.

Tried it on example:
# 2.79+2.85 timeouts
src/dnsmasq -d --log-queries --port 2053 --no-resolv --server=127.0.0.1 --server=127.0.0.83 --server=10.0.137.114
# 2.88 responds in time
dnsmasq -d --log-queries --port 2053 --no-resolv --server=127.0.0.1 --server=127.0.0.83 --server=10.0.137.114
# 2.88 timeouts too
dnsmasq -d --log-queries --port 2053 --no-resolv --server=127.0.0.83 --server=10.0.137.114 --server=127.0.0.1

I do not think --all-servers makes any difference here. The problem is there is no short timeout used and I have seen no code to allow multiple TCP queries done in parallel. It always does those queries sequentially, which does not work well for TCP queries. Dnsmasq does not seem to be able to do such thing even in latest master branch on upstream git repository.

Comment 2 Fedora Release Engineering 2023-08-16 08:07:30 UTC

This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.

Comment 3 Aoife Moloney 2024-11-08 10:50:18 UTC

This message is a reminder that Fedora Linux 39 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '39'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 39 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 4 Aoife Moloney 2024-11-27 21:09:36 UTC

Fedora Linux 39 entered end-of-life (EOL) status on 2024-11-26.

Fedora Linux 39 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.