1825248 – glibc: TCP-based DNS resolution does not appear to timeout correctly

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1825248 - glibc: TCP-based DNS resolution does not appear to timeout correctly

Summary: glibc: TCP-based DNS resolution does not appear to timeout correctly

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	glibc team
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-17 13:37 UTC by Dave Poston
Modified:	2023-07-18 14:30 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-21 14:54:02 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	19643	0	P2	ASSIGNED	libresolv: Lack of TCP timeout	2022-10-28 13:11:57 UTC

Description Dave Poston 2020-04-17 13:37:26 UTC

Description of problem:

Witnessed DNS resolution hanging a thread for several days (so assumed indefinite hang) after a network outage. When the network was restored, DNS resolution succeeded again for other threads, but the the stuck threads did not resume. This appears to be related to TCP based DNS resolution.

Version-Release number of selected component (if applicable):
RHEL 6.10

How reproducible:
Not reproduced yet, best guess is below.

Steps to Reproduce (my best guess - not done this):
1. Setup 'vanilla' TCP server on DNS port that never responds
2. Configure DNS to use server in 1) with TCP
3. Do DNS lookup

Actual results:
DNS lookup hangs indefinitely

Expected results:
DNS lookup obeys timeout and fails after ~5 secs

Additional info:
The callstack for the hang is:
#0 0x00000031ec80e82d in read () from /lib64/libpthread.so.0
#1 0x00000031ed80a85b in send_vc () from /lib64/libresolv.so.2
#2 0x00000031ed80c4cc in __libc_res_nsend () from /lib64/libresolv.so.2
#3 0x00000031ed808821 in __libc_res_nquery () from /lib64/libresolv.so.2
#4 0x00000031ed808de0 in __libc_res_nquerydomain () from /lib64/libresolv.so.2
#5 0x00000031ed809aa1 in __libc_res_nsearch () from /lib64/libresolv.so.2
#6 0x00002ae9cf7f8401 in _nss_dns_gethostbyname3_r () from /lib64/libnss_dns.so.2
#7 0x00002ae9cf7f86d4 in _nss_dns_gethostbyname2_r () from /lib64/libnss_dns.so.2
#8 0x00000031ebd03995 in gethostbyname2_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#9 0x00000031ebcd0de2 in gaih_inet () from /lib64/libc.so.6
#10 0x00000031ebcd303f in getaddrinfo () from /lib64/libc.so.6

The thread was stuck here for days whilst other DNS lookups succeeded. It appears that send_vc uses blocking sockets without any timeout, and is so liable to get stuck indefinitely under certain conditions. This is a nasty thing to deal with for applications that want to be reliable under adverse conditions.

The solution would be to make send_vc use non-blocking sockets and poll etc, similar to how send_dg works. It could then timeout.

Comment 2 Carlos O'Donell 2020-04-21 14:54:02 UTC

This issue touches some very sensitive code within the resolver. Making those changes in RHEL6 and RHEL7 would directly impact the behaviour of applications. As such I'm going to move this issue to RHEL 8 where we can backport more aggressive changes from upstream. The idea is that we need to use RES_TIMEOUT and RES_DFLRETRY to compute a reasonable timeout for the TCP connection, and likewise very the UDP timeout matches. There are other changes we might also like to make in this area, like serializing the requests, but we'll discuss this upstream.

We have an open ticket upstream to manage this issue and we are going to use that:
https://sourceware.org/bugzilla/show_bug.cgi?id=19643

I am moving this bug to RHEL 8 and marking it CLOSED/UPSTREAM, and when the upstream bug is fixed we can consider a backport.

Note You need to log in before you can comment on or make changes to this bug.