Description of problem: Witnessed DNS resolution hanging a thread for several days (so assumed indefinite hang) after a network outage. When the network was restored, DNS resolution succeeded again for other threads, but the the stuck threads did not resume. This appears to be related to TCP based DNS resolution. Version-Release number of selected component (if applicable): RHEL 6.10 How reproducible: Not reproduced yet, best guess is below. Steps to Reproduce (my best guess - not done this): 1. Setup 'vanilla' TCP server on DNS port that never responds 2. Configure DNS to use server in 1) with TCP 3. Do DNS lookup Actual results: DNS lookup hangs indefinitely Expected results: DNS lookup obeys timeout and fails after ~5 secs Additional info: The callstack for the hang is: #0 0x00000031ec80e82d in read () from /lib64/libpthread.so.0 #1 0x00000031ed80a85b in send_vc () from /lib64/libresolv.so.2 #2 0x00000031ed80c4cc in __libc_res_nsend () from /lib64/libresolv.so.2 #3 0x00000031ed808821 in __libc_res_nquery () from /lib64/libresolv.so.2 #4 0x00000031ed808de0 in __libc_res_nquerydomain () from /lib64/libresolv.so.2 #5 0x00000031ed809aa1 in __libc_res_nsearch () from /lib64/libresolv.so.2 #6 0x00002ae9cf7f8401 in _nss_dns_gethostbyname3_r () from /lib64/libnss_dns.so.2 #7 0x00002ae9cf7f86d4 in _nss_dns_gethostbyname2_r () from /lib64/libnss_dns.so.2 #8 0x00000031ebd03995 in gethostbyname2_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #9 0x00000031ebcd0de2 in gaih_inet () from /lib64/libc.so.6 #10 0x00000031ebcd303f in getaddrinfo () from /lib64/libc.so.6 The thread was stuck here for days whilst other DNS lookups succeeded. It appears that send_vc uses blocking sockets without any timeout, and is so liable to get stuck indefinitely under certain conditions. This is a nasty thing to deal with for applications that want to be reliable under adverse conditions. The solution would be to make send_vc use non-blocking sockets and poll etc, similar to how send_dg works. It could then timeout.
This issue touches some very sensitive code within the resolver. Making those changes in RHEL6 and RHEL7 would directly impact the behaviour of applications. As such I'm going to move this issue to RHEL 8 where we can backport more aggressive changes from upstream. The idea is that we need to use RES_TIMEOUT and RES_DFLRETRY to compute a reasonable timeout for the TCP connection, and likewise very the UDP timeout matches. There are other changes we might also like to make in this area, like serializing the requests, but we'll discuss this upstream. We have an open ticket upstream to manage this issue and we are going to use that: https://sourceware.org/bugzilla/show_bug.cgi?id=19643 I am moving this bug to RHEL 8 and marking it CLOSED/UPSTREAM, and when the upstream bug is fixed we can consider a backport.