Bug 1868106 - glibc: Transaction ID collisions cause slow DNS lookups in getaddrinfo
Summary: glibc: Transaction ID collisions cause slow DNS lookups in getaddrinfo
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: glibc
Version: 8.2
Hardware: All
OS: Linux
unspecified
high
Target Milestone: rc
: 8.4
Assignee: Florian Weimer
QA Contact: Sergey Kolosov
Zuzana Zoubkova
URL:
Whiteboard: multi-arch
Depends On:
Blocks: 1796871 1903880 1904153
TreeView+ depends on / blocked
 
Reported: 2020-08-11 17:18 UTC by Andy McCrae
Modified: 2021-09-17 12:34 UTC (History)
35 users (show)

Fixed In Version: glibc-2.28-134.el8
Doc Type: Bug Fix
Doc Text:
.The `glibc` DNS stub resolver correctly processes parallel queries with identical transaction IDs Prior to this update, the DNS stub resolver in the GNU C library `glibc` did not process responses to parallel queries with identical transaction IDs correctly. Consequently, when the transaction IDs were equal, the second parallel response was never matched to a query, resulting in a timeout and retry. With this update, the second parallel response is now recognized as valid. As a result, the `glibc` DNS stub resolver avoids excessive timeouts due to unrecognized responses.
Clone Of:
: 1904153 (view as bug list)
Environment:
Last Closed: 2021-05-18 14:36:34 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
tcpdump from container interface (7.02 KB, text/plain)
2020-08-11 17:18 UTC, Andy McCrae
no flags Details
tcpdump udp 53 (21.21 KB, application/vnd.tcpdump.pcap)
2020-08-14 09:55 UTC, Andy McCrae
no flags Details
tcpdump for enc600 interface (141.31 KB, application/vnd.tcpdump.pcap)
2020-08-14 09:56 UTC, Andy McCrae
no flags Details
strace pid 80756 (76.61 KB, text/plain)
2020-08-14 09:57 UTC, Andy McCrae
no flags Details
strace pid 80757 (21.19 KB, text/plain)
2020-08-14 09:58 UTC, Andy McCrae
no flags Details
zKVM LPAR ethtool -S output (1.89 KB, text/plain)
2020-08-21 13:24 UTC, Andy McCrae
no flags Details
zVM LPAR ethtool -S output (1.91 KB, text/plain)
2020-08-21 13:25 UTC, Andy McCrae
no flags Details
pcap from master-0 (51.08 KB, application/vnd.tcpdump.pcap)
2020-09-01 15:47 UTC, Andy McCrae
no flags Details
pcap from worker-0 (58.02 KB, application/vnd.tcpdump.pcap)
2020-09-01 15:49 UTC, Andy McCrae
no flags Details
strace for curl command from worker-0 (15.61 KB, text/plain)
2020-09-01 15:52 UTC, Andy McCrae
no flags Details
strace with higher STRSIZE (24.26 KB, text/plain)
2020-09-02 14:14 UTC, Andy McCrae
no flags Details


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 187505 0 None None None 2020-08-12 13:23:34 UTC
Sourceware 26600 0 P2 RESOLVED Transaction ID collisions cause slow DNS lookups in getaddrinfo 2021-02-19 14:54:10 UTC

Internal Links: 1846875

Description Andy McCrae 2020-08-11 17:18:25 UTC
Created attachment 1711098 [details]
tcpdump from container interface

Description of problem:

When making a curl request for an external site within a pod, (e.g. www.redhat.com) the DNS lookup takes around 5s (default timeout)

Version-Release number of selected component (if applicable):

All versions currently.

How reproducible:
Consistently

Steps to Reproduce:
1. On an s390x OCP cluster spin up a test pod with curl (e.g. nginx pod)
2. # oc exec $pod -- bash -c 'cat /etc/resolv.conf; while true; do echo -n "$(date)  "; curl -s -o /dev/null -w "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k; done'

Actual results:

The time_namelookup value is just over 5s reasonably consistently.

Expected results:

The time_namelookup value is consistently well below 1s.

Additional info:

I was unable to recreate this on a non-s390x cluster.

tcpdump from the container's interface shows the requests going out and being responded to, but the 5s gap occurs and then the request happens again, which succeeds (I've attached the tcpdump).

The issue looks to be related to the optimizations that mean A & AAAA records are requested in parallel, since you can adjust the pod to use "options single-request" within the pod's /etc/resolv.conf and the requests will go resolve quickly, as they should.
# oc exec $pod -- bash -c 'echo "options single-request" >> /etc/resolv.conf'

This is related to the timeout value, since adjusting the DNS timeout to 2 seconds would adjust the time_namelookup to be around 2 seconds instead:
# oc exec $pod -- bash -c 'echo "options timeout:2" >> /etc/resolv.conf'

Comment 1 Andy McCrae 2020-08-11 17:27:04 UTC
One additional note, this was originally seen as related to: https://bugzilla.redhat.com/show_bug.cgi?id=1821404 - this fix is now released and in the latest nightlies for OCP (as well as 4.3.31 full release), the issue persists across all of those builds.

Comment 2 Dennis Gilmore 2020-08-11 17:46:37 UTC
Steve, I suspect this is a kernel bug, could you please help us identify who on the Z side can work to help us resolve this issue

Comment 4 Hendrik Brueckner 2020-08-12 08:09:01 UTC
HaJo, could you reverse-mirror this bug. Thanks.

Comment 9 Andy McCrae 2020-08-13 15:16:25 UTC
I was able to recreate this bug on a RHEL 8.2 s390x host outside of openshift/libvirt:

# uname -r
4.18.0-193.14.3.el8_2.s390x

# cat /etc/resolv.conf 
# Generated by NetworkManager
search s390.bos.redhat.com default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.11.5.19
options ndots:5

# while true; do echo -n "$(date)  "; curl -s -o /dev/null -w "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k; done
Thu 13 Aug 11:14:00 EDT 2020  5.266154 5.411858 301
Thu 13 Aug 11:14:06 EDT 2020  0.106627 0.256583 301
Thu 13 Aug 11:14:06 EDT 2020  5.266541 5.408212 301
Thu 13 Aug 11:14:12 EDT 2020  5.271446 5.426200 301
Thu 13 Aug 11:14:17 EDT 2020  5.265497 5.487069 301
Thu 13 Aug 11:14:22 EDT 2020  0.103492 0.241870 301
Thu 13 Aug 11:14:23 EDT 2020  5.262853 5.428042 301

Comment 10 Carlos O'Donell 2020-08-13 20:38:57 UTC
(In reply to Andy McCrae from comment #9)
> I was able to recreate this bug on a RHEL 8.2 s390x host outside of
> openshift/libvirt:
> 
> # uname -r
> 4.18.0-193.14.3.el8_2.s390x
> 
> # cat /etc/resolv.conf 
> # Generated by NetworkManager
> search s390.bos.redhat.com default.svc.cluster.local svc.cluster.local
> cluster.local
> nameserver 10.11.5.19
> options ndots:5
> 
> # while true; do echo -n "$(date)  "; curl -s -o /dev/null -w
> "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k;
> done
> Thu 13 Aug 11:14:00 EDT 2020  5.266154 5.411858 301
> Thu 13 Aug 11:14:06 EDT 2020  0.106627 0.256583 301
> Thu 13 Aug 11:14:06 EDT 2020  5.266541 5.408212 301
> Thu 13 Aug 11:14:12 EDT 2020  5.271446 5.426200 301
> Thu 13 Aug 11:14:17 EDT 2020  5.265497 5.487069 301
> Thu 13 Aug 11:14:22 EDT 2020  0.103492 0.241870 301
> Thu 13 Aug 11:14:23 EDT 2020  5.262853 5.428042 301

The default is to use UDP to the DNS server.

If the system or DNS server is loaded you'll see dropped UDP packets, retries, and long DNS lookup times.

The fact that this either takes 100ms or 5s + 100ms indicates to me that the DNS server may be overloaded, dropping UDP packets, and you're getting 1 reply after the first timeout.

Can you please capture an strace -ff -ttt so we can see the syscalls and their timing?

Can you please do a pcap capture of the data going out the interface?

With the strace and pcap we'll see exactly what is taking time.

The above is not sufficient to rule out the loaded server / DNS server scenario.

Comment 11 Andy McCrae 2020-08-14 09:55:17 UTC
Created attachment 1711431 [details]
tcpdump udp 53

output of: tcpdump -i enc600 udp port 53 -nn -w dns.pcap

Comment 12 Andy McCrae 2020-08-14 09:56:17 UTC
Created attachment 1711432 [details]
tcpdump for enc600 interface

output of: tcpdump -i enc600 -nn -w interface.pcap

Comment 13 Andy McCrae 2020-08-14 09:57:11 UTC
Created attachment 1711433 [details]
strace pid 80756

First pid output from: # strace -o strace.file -ff -ttt curl -s -o /dev/null https://www.redhat.com -k

Comment 14 Andy McCrae 2020-08-14 09:58:00 UTC
Created attachment 1711434 [details]
strace pid 80757

Second pid output from: # strace -o strace.file -ff -ttt curl -s -o /dev/null https://www.redhat.com -k

Comment 15 Andy McCrae 2020-08-14 10:03:19 UTC
Hi Carlos,

I've attached the strace and pcap files.

For the pcap I ran around 6-7 curls in a loop, and have separated it into 1 tcpdump which is just the udp 53 traffic, and one for the interface as a whole.

I agree it could be overload on one side, especially since reducing the number of domains searched through, or making it single-request gets the desired effect. I've tried the same test from a non-s390x host to the exact same nameserver and it worked as expected, so I think it may be a client side issue, rather than the server side. I've also been able to repeat this on several different s390x hosts, including a rhel7 host.

Thanks for looking into it, and , let me know if I can provide other or more output.

Comment 16 IBM Bug Proxy 2020-08-17 15:31:24 UTC
------- Comment From STLI@de.ibm.com 2020-08-17 11:20 EDT-------
curl is using getaddrinfo:
(gdb) p	*hints
$16 = {ai_flags = 0, ai_family = 0, ai_socktype = 1, ai_protocol = 0, ai_addrlen = 0, ai_addr = 0x0,
ai_canonname = 0x0, ai_next = 0x0}
=> ai_family = AF_UNSPEC; ai_socktype = SOCK_STREAM;
../../lib/curl_addrinfo.c:131       error = getaddrinfo(nodename, servname, hints, &aihead);

According to the strace.file.80757 file, it is using <glibc-src>/resolv/nss_dns/dns-host.c:_nss_dns_gethostbyname4_r() -> libnss_dns.so.
In __res_context_query(), the T_QUERY_A_AND_AAAA is created and in __res_context_send(), send_dg() is called which is using the sendmmsg/recvfrom syscalls seen in the strace output in strace.file.80757:
...

1597397632.995409 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
1597397632.995457 setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
1597397632.995499 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0

1597397632.995554 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
1597397632.995613 sendmmsg(3, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"..., iov_len=58}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=58}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"..., iov_len=58}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=58}], 2, MSG_NOSIGNAL) = 2

1597397632.995691 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
1597397633.016204 ioctl(3, FIONREAD, [133]) = 0
1597397633.016241 recvfrom(3, "\260C\201\203\0\1\0\0\0\1\0\0\3www\6redhat\3com\7defa"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, [28->16]) = 133

1597397633.016299 poll([{fd=3, events=POLLIN}], 1, 4979) = 0 (Timeout)
=> According to dns.pcap, there is no response.
Thus this poll times out:
No	Time    	Source		Destination	Prot	Length	Info
25	0.333208	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 A www.redhat.com.default.svc.cluster.local
26	0.333223	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 AAAA www.redhat.com.default.svc.cluster.local
27	0.352611	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0xc337 No such name A www.redhat.com.default.svc.cluster.local SOA a.root-servers.net
Then, in res_send.c:send_dg() "single_request = true;" and "goto retry;", which sends the both requests again (see send-, recvfrom-syscalls below in strace output)
28	5.337219	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 A www.redhat.com.default.svc.cluster.local
29	5.356647	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0xc337 No such name A www.redhat.com.default.svc.cluster.local SOA a.root-servers.net
30	5.356734	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 AAAA www.redhat.com.default.svc.cluster.local
31	5.376432	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0xc337 No such name AAAA www.redhat.com.default.svc.cluster.local SOA a.root-servers.net
Note: The dns.pcap file contains multiple 5s timeouts where sometimes the A or AAAA response is missing.

1597397638.010618 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
1597397638.011052 send(3, "\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"..., 58, MSG_NOSIGNAL) = 58
1597397638.011181 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])

1597397638.034042 ioctl(3, FIONREAD, [133]) = 0
1597397638.034356 recvfrom(3, "\260C\201\203\0\1\0\0\0\1\0\0\3www\6redhat\3com\7defa"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, [28->16]) = 133
1597397638.034940 poll([{fd=3, events=POLLOUT}], 1, 4971) = 1 ([{fd=3, revents=POLLOUT}])

1597397638.037381 send(3, "\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"..., 58, MSG_NOSIGNAL) = 58
1597397638.042603 poll([{fd=3, events=POLLIN}], 1, 4965) = 1 ([{fd=3, revents=POLLIN}])
1597397638.061346 ioctl(3, FIONREAD, [133]) = 0
1597397638.061486 recvfrom(3, "\260C\201\203\0\1\0\0\0\1\0\0\3www\6redhat\3com\7defa"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, [28->16]) = 133

1597397638.061629 close(3)              = 0

Carlos, is this correct?

Comment 17 IBM Bug Proxy 2020-08-18 14:54:41 UTC
------- Comment From STLI@de.ibm.com 2020-08-18 10:48 EDT-------
According to the interface.pcap file, the DNS responses arrive fast if there is no further traffic, but the 5s timeouts happen while there is other traffic, like:
No	Time    	Source		Destination	Prot	Length	Info
162	1.914149	10.16.107.4	10.11.5.19	DNS	100	Standard query 0x7604 A www.redhat.com.default.svc.cluster.local
163	1.914158	10.16.107.4	10.11.5.19	DNS	100	Standard query 0x7604 AAAA www.redhat.com.default.svc.cluster.local
164	1.915390	104.106.245.11	10.16.107.4	TCP	66	443 ? 51256 [ACK] Seq=4391 Ack=845 Win=30080 Len=0 TSval=1270333268 TSecr=1121359670
165	1.915471	104.106.245.11	10.16.107.4	TLSv1.3	90	Application Data
166	1.915500	10.16.107.4	104.106.245.11	TCP	54	51256 ? 443 [RST] Seq=845 Win=0 Len=0
...
172	2.031267	IBM_3b:3f:88	Broadcast	ARP	42	Who has 10.16.106.130? Tell 10.16.106.128
...
222	4.495787	10.16.107.4	192.168.145.10	TCP	74	[TCP Retransmission] 36062 ? 22623 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=378392883 TSecr=0 WS=128
...
270	6.415853	fe80::38fa:94ff:fef7:2719	fe80::21f:9dff:fe27:e400	ICMPv6	86	Neighbor Solicitation for fe80::21f:9dff:fe27:e400 from 3a:fa:94:f7:27:19
271	6.419865	fe80::21f:9dff:fe27:e400	fe80::38fa:94ff:fef7:2719	ICMPv6	78	Neighbor Advertisement fe80::21f:9dff:fe27:e400 (rtr, sol)
...
289	6.495932	10.16.107.4	192.168.145.53	TCP	74	[TCP Retransmission] 40178 ? 80 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=600140501 TSecr=0 WS=128
290	6.917005	10.16.107.4	10.11.5.19	DNS	100	Standard query 0x7604 A www.redhat.com.default.svc.cluster.local
291	6.936443	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0x7604 No such name A www.redhat.com.default.svc.cluster.local SOA a.root-servers.net
292	6.936555	10.16.107.4	10.11.5.19	DNS	100	Standard query 0x7604 AAAA www.redhat.com.default.svc.cluster.local
293	6.956319	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0x7604 No such name AAAA www.redhat.com.default.svc.cluster.local SOA a.root-servers.net

Is there suspicious output with "ethtool -S <interface name>" or "netstat -s -u"?
Which network setup do you use? OSA, HiperSockets, Mellanox, zVM VSWITCH, ...

Comment 18 Carlos O'Donell 2020-08-20 17:36:07 UTC
(In reply to IBM Bug Proxy from comment #16)
> ------- Comment From STLI@de.ibm.com 2020-08-17 11:20 EDT-------
> curl is using getaddrinfo:
> (gdb) p	*hints
> $16 = {ai_flags = 0, ai_family = 0, ai_socktype = 1, ai_protocol = 0,
> ai_addrlen = 0, ai_addr = 0x0,
> ai_canonname = 0x0, ai_next = 0x0}
> => ai_family = AF_UNSPEC; ai_socktype = SOCK_STREAM;
> ../../lib/curl_addrinfo.c:131       error = getaddrinfo(nodename, servname,
> hints, &aihead);

Agreed.

PID 80756 is the main thread, and PID 80757 is the second thread that does
the query and lookup.

The main thread does no DNS lookups.

The second thread connects 5 times to the DNS server.

1597397632.975441 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0
1597397632.995499 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0
1597397638.063248 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0
1597397638.103372 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0
1597397638.158504 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")}, 16) = 0

Notice there is a 5 second gap between connect groups

> According to the strace.file.80757 file, it is using
> <glibc-src>/resolv/nss_dns/dns-host.c:_nss_dns_gethostbyname4_r() ->
> libnss_dns.so.
> In __res_context_query(), the T_QUERY_A_AND_AAAA is created and in
> __res_context_send(), send_dg() is called which is using the
> sendmmsg/recvfrom syscalls seen in the strace output in strace.file.80757:
> ...

Correct. It should be send_dg(), we're using UDP.

> 1597397632.995409 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK,
> IPPROTO_IP) = 3
> 1597397632.995457 setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
> 1597397632.995499 connect(3, {sa_family=AF_INET, sin_port=htons(53),
> sin_addr=inet_addr("10.11.5.19")}, 16) = 0
> 
> 1597397632.995554 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3,
> revents=POLLOUT}])
> 1597397632.995613 sendmmsg(3, [{msg_hdr={msg_name=NULL, msg_namelen=0,
> msg_iov=[{iov_base="\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"...,
> iov_len=58}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=58},
> {msg_hdr={msg_name=NULL, msg_namelen=0,
> msg_iov=[{iov_base="\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"...,
> iov_len=58}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=58}], 2,
> MSG_NOSIGNAL) = 2
> 
> 1597397632.995691 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3,
> revents=POLLIN}])
> 1597397633.016204 ioctl(3, FIONREAD, [133]) = 0
> 1597397633.016241 recvfrom(3,
> "\260C\201\203\0\1\0\0\0\1\0\0\3www\6redhat\3com\7defa"..., 2048, 0,
> {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")},
> [28->16]) = 133

This isn't the first query in the DNS pcap.

We have a first query for www.redhat.com.s390.bos.redhat.com and a very quick response within 22ms that returns "No such name"

The second query is for www.redhat.com.default.svc.cluster.local.

We have a response for that quickly within 19ms and that returns "No such name" for both A and AAAA.

This pattern seems to repeat.

Then at packet 163 in interface.pcap we have a serious amount of RST and TCP Retransmission.

At that point we don't see a returned AAAA answer from the server and so that triggers a timeout in the stub resolver.
 
> 1597397633.016299 poll([{fd=3, events=POLLIN}], 1, 4979) = 0 (Timeout)
> => According to dns.pcap, there is no response.
> Thus this poll times out:
> No	Time    	Source		Destination	Prot	Length	Info
> 25	0.333208	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 A
> www.redhat.com.default.svc.cluster.local
> 26	0.333223	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 AAAA
> www.redhat.com.default.svc.cluster.local
> 27	0.352611	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0xc337 No
> such name A www.redhat.com.default.svc.cluster.local SOA a.root-servers.net
> Then, in res_send.c:send_dg() "single_request = true;" and "goto retry;",
> which sends the both requests again (see send-, recvfrom-syscalls below in
> strace output)
> 28	5.337219	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 A
> www.redhat.com.default.svc.cluster.local
> 29	5.356647	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0xc337 No
> such name A www.redhat.com.default.svc.cluster.local SOA a.root-servers.net
> 30	5.356734	10.16.107.4	10.11.5.19	DNS	100	Standard query 0xc337 AAAA
> www.redhat.com.default.svc.cluster.local
> 31	5.376432	10.11.5.19	10.16.107.4	DNS	175	Standard query response 0xc337 No
> such name AAAA www.redhat.com.default.svc.cluster.local SOA
> a.root-servers.net
> Note: The dns.pcap file contains multiple 5s timeouts where sometimes the A
> or AAAA response is missing.

The strace strace.file.80757 contains one listed timeout.

I see one missing UDP response for packet 163 in interface.pcap.

Looking at UDP responses in dns.pcap I see a missing response for packet 26, 57, 76, 95, 130, 145 etc.

> 1597397638.010618 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3,
> revents=POLLOUT}])
> 1597397638.011052 send(3,
> "\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"..., 58, MSG_NOSIGNAL) =
> 58
> 1597397638.011181 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3,
> revents=POLLIN}])
> 
> 1597397638.034042 ioctl(3, FIONREAD, [133]) = 0
> 1597397638.034356 recvfrom(3,
> "\260C\201\203\0\1\0\0\0\1\0\0\3www\6redhat\3com\7defa"..., 2048, 0,
> {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")},
> [28->16]) = 133
> 1597397638.034940 poll([{fd=3, events=POLLOUT}], 1, 4971) = 1 ([{fd=3,
> revents=POLLOUT}])
> 
> 1597397638.037381 send(3,
> "\260C\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\7defa"..., 58, MSG_NOSIGNAL) =
> 58
> 1597397638.042603 poll([{fd=3, events=POLLIN}], 1, 4965) = 1 ([{fd=3,
> revents=POLLIN}])
> 1597397638.061346 ioctl(3, FIONREAD, [133]) = 0
> 1597397638.061486 recvfrom(3,
> "\260C\201\203\0\1\0\0\0\1\0\0\3www\6redhat\3com\7defa"..., 65536, 0,
> {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.11.5.19")},
> [28->16]) = 133
> 
> 1597397638.061629 close(3)              = 0
> 
> Carlos, is this correct?

I concur. Given the interface.pcap and dns.pcap data it looks like the UDP responses are being dropped due to some other network factor.

Thus glibc is correct to retry within the timeout window.

Comment 19 Carlos O'Donell 2020-08-20 17:46:22 UTC
(In reply to IBM Bug Proxy from comment #17)
> ------- Comment From STLI@de.ibm.com 2020-08-18 10:48 EDT-------
> According to the interface.pcap file, the DNS responses arrive fast if there
> is no further traffic, but the 5s timeouts happen while there is other
> traffic, like:

I concur. I see about a dozen RST/TCP Retransmission sequences in the interface trace. This could be network load. As indicated the dropped UDP packets happen during these times (as would be expected).

> Is there suspicious output with "ethtool -S <interface name>" or "netstat -s
> -u"?
> Which network setup do you use? OSA, HiperSockets, Mellanox, zVM VSWITCH, ...

Setting needinfo for Andy.

Have you tried setting 'use-vc' in /etc/resolv.conf to force TCP (send_vc() for those reading the sources) for DNS lookups?

It would cost a bit more to setup the DNS lookup because of the TCP setup and teardown, but could be a reliable workaround.

Comment 20 Andy McCrae 2020-08-21 13:23:42 UTC
Thanks for the feedback so far!

I've tried the use-vc setting which works fine (as expected), as does ensuring only A record lookups (/w UDP) using the curl -4 option, and the single-request option (also /w UDP) in resolv.conf ensuring only 1 happens at a time.

To sum up my understanding so far:
* The A & AAAA record requests go out as expected
* One of the responses isn't received (the packet is dropped/lost)
* glibc is doing the correct thing since a response is not received and it waits for the timeout (5s) and retries. 

The issue must then exist at the network level, leading to the packets being lost, and glibc responding as it would to that situation.
In that case, it seems the next step is to try and get a pcap from the DNS server side, I can achieve that in the OCP deployment since I can access the DNS server there - whereas I don't have access to the external server.

To respond to the initial queries around what networking is setup:

lsqeth on 2 of the hosts I can recreate it on, one zKVM LPAR using 10G OSA:
OSD_1000

One zVM LPAR using zVM VSWITCH:
VSWITCH: SYSTEM OCPDEVO1 (Type: QDIO)

I've attached the ethtool -S output for the interfaces, although I don't see anything untoward there. The ss -su output doesn't show anything of note.

I'll try to get a pcap from the dns server in the openshift example, and see if that will provide more light on the matter.

Comment 21 Andy McCrae 2020-08-21 13:24:47 UTC
Created attachment 1712172 [details]
zKVM LPAR ethtool -S output

Comment 22 Andy McCrae 2020-08-21 13:25:12 UTC
Created attachment 1712173 [details]
zVM LPAR ethtool -S output

Comment 23 IBM Bug Proxy 2020-08-27 15:51:04 UTC
------- Comment From ursula.braun@de.ibm.com 2020-08-27 11:44 EDT-------
Since we are not able to recreate the problem here, we want to debug the problem on your system. Let's skip KVM for now, and start with your zVM system.
As a first step could you please rerun the failing test on your zVM system, and then invoke "dbginfo.sh" and attach its output to this bugzilla?

Comment 24 alex.osadchyy@ibm.com 2020-08-28 21:46:39 UTC
@Ursula,
As Andy described, the issue is more evident inside the OpenShift pods. However, can be also reproduced on the underlying RHEL and RHCOS hosts.  I ran into this when working on a project with a customer. I have collected the dbginfo.sh dumps. Will share those separately as they may contain customer info. The way to reproduce it on the host is to apply a similar DNS lookup configuration as in OpenShift pods. for example. 
Modify /etc/resolve.conf as
search net-testing.svc.cluster.local svc.cluster.local cluster.local ocp.local
nameserver x.x.x.x
options ndots:5

Run the curl continuously and observe occasional requests taking 5s
# while true; do echo -n "$(date)  "; curl -s -o /dev/null -w "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k; done
Fri Aug 28 17:15:18 EDT 2020  5.514 5.910 301
Fri Aug 28 17:15:24 EDT 2020  0.004 0.392 301
Fri Aug 28 17:15:24 EDT 2020  0.004 0.447 301
Fri Aug 28 17:15:25 EDT 2020  0.004 0.394 301
Fri Aug 28 17:15:25 EDT 2020  0.004 0.394 301

Comment 25 Carlos O'Donell 2020-09-01 14:49:05 UTC
(In reply to Andy McCrae from comment #20)
> To sum up my understanding so far:
> * The A & AAAA record requests go out as expected
> * One of the responses isn't received (the packet is dropped/lost)
> * glibc is doing the correct thing since a response is not received and it
> waits for the timeout (5s) and retries. 

That looks accurate to me. Thank you for the review and summary.

Comment 26 Andy McCrae 2020-09-01 15:47:55 UTC
Created attachment 1713337 [details]
pcap from master-0

tcpdump from the master-0 host's interface running the DNS server - for traffic from client's source IP.

Comment 27 Andy McCrae 2020-09-01 15:49:34 UTC
Created attachment 1713338 [details]
pcap from worker-0

tcpdump from the worker-0 interface running the curl, to the DNS server. NB Since this is an Openshift install the DNS server is a service listening on 172.30.0.10, but there is only 1 host on the server (running on master-0)

Comment 28 Andy McCrae 2020-09-01 15:52:17 UTC
Created attachment 1713339 [details]
strace for curl command from worker-0

strace output for curl PID running the request.

Comment 29 Andy McCrae 2020-09-01 16:04:11 UTC
I was able to recreate this in a zVM openshift deploy - and then get pcap from both sides, as well as the strace output (which I've attached).

Interestingly, it looks like the host makes the request, and receives responses for both A & AAAA records from the server, but then does another POLLIN:

1598544230.050077 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.30.0.10")}, 16) = 0
1598544230.050116 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
1598544230.050152 sendmmsg(3, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\245\\\1\0\0\1\0\0\0\0\0\0\3
www\6redhat\3com\3svc\7"..., iov_len=50}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=50}, {msg_hdr={msg_nam
e=NULL, msg_namelen=0, msg_iov=[{iov_base="\245\\\1\0\0\1\0\0\0\0\0\0\3www\6redhat\3com\3svc\7"..., iov_len=50}], msg_i
ovlen=1, msg_controllen=0, msg_flags=0}, msg_len=50}], 2, MSG_NOSIGNAL) = 2
1598544230.050304 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
1598544230.050659 ioctl(3, FIONREAD, [143]) = 0
1598544230.050766 recvfrom(3, "\245\\\205\3\0\1\0\0\0\1\0\0\3www\6redhat\3com\3svc\7"..., 2048, 0, {sa_family=AF_INET, 
sin_port=htons(53), sin_addr=inet_addr("172.30.0.10")}, [28->16]) = 143
1598544230.050851 poll([{fd=3, events=POLLIN}], 1, 4999) = 1 ([{fd=3, revents=POLLIN}])
1598544230.050948 ioctl(3, FIONREAD, [143]) = 0
1598544230.050995 recvfrom(3, "\245\\\205\3\0\1\0\0\0\1\0\0\3www\6redhat\3com\3svc\7"..., 65536, 0, {sa_family=AF_INET,
 sin_port=htons(53), sin_addr=inet_addr("172.30.0.10")}, [28->16]) = 143
1598544230.051064 poll([{fd=3, events=POLLIN}], 1, 4999) = 0 (Timeout)

My understanding is that here it connects (connect()), is open for writing (POLLOUT), and sends the request for A & AAAA records (sendmsg).
It then waits for the response which it gets (POLLIN, ioctl, recvfrom), and a second response which it also gets (for A & AAAA records i assume).
However, at that point it waits again with the final POLLIN, waiting for more data which never comes - although it should have received responses for both A & AAAA requests already.
This POLLIN then timesout and the 5 sec gap happens.

The tcpdump shows similar outputs, where the requests go out and the response is received, but it still asks again for one of the records.

Comment 30 Florian Weimer 2020-09-01 16:18:46 UTC
Andy, could you perhaps get an strace with a higher -s value, so that we can read the actual packets received by the stub resolver? Thanks.

Comment 31 IBM Bug Proxy 2020-09-01 17:51:14 UTC
------- Comment From ursula.braun@de.ibm.com 2020-09-01 13:43 EDT-------
I could not detect anything strange in the network stuff within the provided dbginfo.sh output files.

But looking at the dns.pcap file I see a pattern before every 5 seconds gap: Every DNS query has a transaction ID. The 5 seconds gap is seen, if both the A-request and the AAAA-request are sent and have the same transaction ID. After 5 seconds the A-request is repeated with the same transaction ID, waiting for its response, and afterwards the AAAA-request is repeated with the same transaction ID.

Looks like it should be avoided to send an AAAA-request with the same transaction ID as the preceding A-request before the A-response is received. But I am not a DNS-expert.

Comment 32 Florian Westphal 2020-09-02 11:19:20 UTC
This is most likely a duplicate of 1821404.  Could someone have a look at "/proc/net/stat/nf_conntrack"?  The "drop" and "insert_failed" columns are the
more interesting ones for this bug.

Comment 33 IBM Bug Proxy 2020-09-02 11:31:33 UTC
------- Comment From tstaudt@de.ibm.com 2020-09-02 07:29 EDT-------
(In reply to comment #32)
> This is most likely a duplicate of 1821404.  Could someone have a look at
> "/proc/net/stat/nf_conntrack"?  The "drop" and "insert_failed" columns are
> the
> more interesting ones for this bug.

That bug is not accessible to IBM btw.

Comment 34 Florian Weimer 2020-09-02 11:34:11 UTC
(In reply to IBM Bug Proxy from comment #31)
> ------- Comment From ursula.braun@de.ibm.com 2020-09-01 13:43 EDT-------
> I could not detect anything strange in the network stuff within the provided
> dbginfo.sh output files.
> 
> But looking at the dns.pcap file I see a pattern before every 5 seconds gap:
> Every DNS query has a transaction ID. The 5 seconds gap is seen, if both the
> A-request and the AAAA-request are sent and have the same transaction ID.
> After 5 seconds the A-request is repeated with the same transaction ID,
> waiting for its response, and afterwards the AAAA-request is repeated with
> the same transaction ID.

This has been the glibc behavior for a long time (more than a decade, I think). This is the first time that I see a report that our internal s390x environment is incompatible with it (although we had other networking issues over the years, affecting more than just DNS). This suggests to me that it is caused by some sort of regression in the environment.

> Looks like it should be avoided to send an AAAA-request with the same
> transaction ID as the preceding A-request before the A-response is received.
> But I am not a DNS-expert.

Usually, when the glibc approach breaks, it's also necessary to use different *source port* numbers to work around the breakage, it's not sufficient to switch the transaction IDs. But until we find the root cause, we don't know if this is caused by general UDP packet drops.

It may be possible to put

options single-request-reopen

in /etc/resolv.conf and see if it improves name resolution behavior. This will reduce the amount of parallel queries, but I don't know if it will paper over the environment issue completely.

Comment 35 Andy McCrae 2020-09-02 14:14:47 UTC
Created attachment 1713448 [details]
strace with higher STRSIZE

Output of:
strace -o strace.file -ff -ttt -s 1024 -p <pid>
On the curl command:
while true; do echo -n "$(date)  "; curl -s -o /dev/null -w "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k; done

Comment 36 Andy McCrae 2020-09-02 14:30:08 UTC
Thanks Florian - I've uploaded a new strace for the same scenario, with a much higher STRSIZE, but let me know if you need additional output.

As a reference, changing options single-request or single-request-reopen does resolve the issue when set.

I see BZ 1821404 was reopened, so it could well be related! Initially we had updated the kernel to ensure the fix was included, but since it's been re-opened I'll keep an eye on that one.

Comment 37 Florian Westphal 2020-09-02 14:34:19 UTC
(In reply to Andy McCrae from comment #36)
> Thanks Florian - I've uploaded a new strace for the same scenario, with a
> much higher STRSIZE, but let me know if you need additional output.
> 
> As a reference, changing options single-request or single-request-reopen
> does resolve the issue when set.
> 
> I see BZ 1821404 was reopened, so it could well be related! Initially we had
> updated the kernel to ensure the fix was included, but since it's been
> re-opened I'll keep an eye on that one.

Can you check "cat /proc/net/stat/nf_conntrack" output for dropped packets?

I suspect there is a net namespace that performs some type of load balancing for DNS via
iptables "-m statistic" + DNAT?  If so, "cat /proc/net/stat/nf_conntrack" from that
netns would be useful.

Comment 38 Andy McCrae 2020-09-02 19:01:28 UTC
(In reply to Florian Westphal from comment #37)
> Can you check "cat /proc/net/stat/nf_conntrack" output for dropped packets?
> 
> I suspect there is a net namespace that performs some type of load balancing
> for DNS via
> iptables "-m statistic" + DNAT?  If so, "cat /proc/net/stat/nf_conntrack"
> from that
> netns would be useful.

Usually there would be, in this case I set the Openshift Service to have a nodeSelector based on 1 hostname, so it only exists on 1 host (to make the strace easier), so it isn't using the -m statistic --mode random --probability 1/#of nodes.

-A KUBE-SERVICES -d 172.30.0.10/32 -p udp -m comment --comment "openshift-dns/dns-default:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-BGNS3J6UB7MMLVDO
-A KUBE-SVC-BGNS3J6UB7MMLVDO -m comment --comment "openshift-dns/dns-default:dns" -j KUBE-SEP-2VAV3YJ6L4CHY6AE
-A KUBE-SEP-2VAV3YJ6L4CHY6AE -s 10.129.0.51/32 -m comment --comment "openshift-dns/dns-default:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-2VAV3YJ6L4CHY6AE -p udp -m comment --comment "openshift-dns/dns-default:dns" -m udp -j DNAT --to-destination 10.129.0.51:5353

I can get the output from the 10.129.0.51 netns, the only columns & rows that have entries other than 00000000 are included below:
# ip netns exec 30661060-12f8-4b01-acc0-9c04b0f95cd4 ip a | grep 10.129.0.51
    inet 10.129.0.51/23 brd 10.129.1.255 scope global eth0
# ip netns exec 30661060-12f8-4b01-acc0-9c04b0f95cd4 cat /proc/net/stat/nf_conntrack | awk '{ print $1 " " $6  " " $17 }'
entries ignore search_restart
00000153 00004100 00000049
00000153 00005655 0000000c
00000153 000055da 00000014
00000153 00007a02 00000015
00000153 00000000 00000000

It seems like the 'ignore' and 'search_restart' columns are not indicative of the issue here. 
All the other columns, including the drop and early_drop columns, are '00000000'.

Should I be attempting that output from a different netns? The iptables rules themselves exist on each of the hosts in the cluster, I could run that from the worker running the curl request, and the host running the DNS pod if that'd be helpful.

Thanks again for the help so far.

Comment 39 Florian Westphal 2020-09-02 19:47:24 UTC
(In reply to Andy McCrae from comment #38)
> (In reply to Florian Westphal from comment #37)
> > Can you check "cat /proc/net/stat/nf_conntrack" output for dropped packets?
> > 
> > I suspect there is a net namespace that performs some type of load balancing
> > for DNS via
> > iptables "-m statistic" + DNAT?  If so, "cat /proc/net/stat/nf_conntrack"
> > from that
> > netns would be useful.
> 
> Usually there would be, in this case I set the Openshift Service to have a
> nodeSelector based on 1 hostname, so it only exists on 1 host (to make the
> strace easier), so it isn't using the -m statistic --mode random
> --probability 1/#of nodes.

Oh, wow.  In that case this doesn't look related to the changes in 1821404 at all (in that case the drop
happens when A and AAAA request raced and each got different DNAT).
 
> -A KUBE-SERVICES -d 172.30.0.10/32 -p udp -m comment --comment
> "openshift-dns/dns-default:dns cluster IP" -m udp --dport 53 -j
[..]

> "openshift-dns/dns-default:dns" -m udp -j DNAT --to-destination
> 10.129.0.51:5353

Ok, so even if A and AAAA race they would always have same dnat-to address.
This should already be handed correctly even before changes added in bz 1821404.
 
> I can get the output from the 10.129.0.51 netns, the only columns & rows
> that have entries other than 00000000 are included below:
> # ip netns exec 30661060-12f8-4b01-acc0-9c04b0f95cd4 ip a | grep 10.129.0.51
>     inet 10.129.0.51/23 brd 10.129.1.255 scope global eth0
> # ip netns exec 30661060-12f8-4b01-acc0-9c04b0f95cd4 cat
> /proc/net/stat/nf_conntrack | awk '{ print $1 " " $6  " " $17 }'
> entries ignore search_restart
> 00000153 00004100 00000049
> 00000153 00005655 0000000c
> 00000153 000055da 00000014
> 00000153 00007a02 00000015
> 00000153 00000000 00000000
> 
> It seems like the 'ignore' and 'search_restart' columns are not indicative
> of the issue here. 

Right.

> All the other columns, including the drop and early_drop columns, are
> '00000000'.

Ok, at least in this netns conntrack did not auto-drop packets.

> Should I be attempting that output from a different netns? The iptables
> rules themselves exist on each of the hosts in the cluster, I could run that
> from the worker running the curl request, and the host running the DNS pod
> if that'd be helpful.

Yes, that would be good.  Did anyone try a "perf script net_dropmonitor"?

Comment 40 Florian Weimer 2020-09-03 09:59:29 UTC
(In reply to Andy McCrae from comment #36)
> Thanks Florian - I've uploaded a new strace for the same scenario, with a
> much higher STRSIZE, but let me know if you need additional output.

This was very helpful, thanks!

I have a theory what is going on here, and I'm working on a reproducer. This is going to take some time (a few days) due to other commitments and the fact that my DNS test harness is ill-equipped to reproduce the particular conditions that tickle this bug.

What kind of DNS implementation is running on 172.30.0.10? It will need fixes as well. The current behavior results in seriously degraded performance, potentially causing a lot of unnecessary DNS traffic. This problem will persist after the glibc change I'm investigating. In fact, it will get worse because the timeouts will be eliminated. So we really need to fix that DNS implementation as well.

Comment 41 Florian Weimer 2020-09-03 17:10:37 UTC
(In reply to Florian Weimer from comment #40)
> (In reply to Andy McCrae from comment #36)
> > Thanks Florian - I've uploaded a new strace for the same scenario, with a
> > much higher STRSIZE, but let me know if you need additional output.
> 
> This was very helpful, thanks!
> 
> I have a theory what is going on here, and I'm working on a reproducer. This
> is going to take some time (a few days) due to other commitments and the
> fact that my DNS test harness is ill-equipped to reproduce the particular
> conditions that tickle this bug.

I was slightly wrong about this. I should not try to parse packets in my head. I think I have understood the issue now, but I need to sleep over it.

> What kind of DNS implementation is running on 172.30.0.10? It will need
> fixes as well. The current behavior results in seriously degraded
> performance, potentially causing a lot of unnecessary DNS traffic. This
> problem will persist after the glibc change I'm investigating. In fact, it
> will get worse because the timeouts will be eliminated. So we really need to
> fix that DNS implementation as well.

Please disregard this for now. There does not seem to be anything wrong with the DNS server implementation after all.

Comment 42 Andy McCrae 2020-09-05 11:29:56 UTC
Hi Florian,

Thanks for the updates! That sounds like good progress.
I was able to run the perf drop_monitor as well as get the nf_conntrack stat from the worker-0 and master-0 hosts:

I've removed all 00000000 columns and rows.
From the worker-0 host (running the pod that makes the curl request):

# cat /proc/net/stat/nf_conntrack | awk '{ print $1 " " $5 " " $6 " " $10 " " $13 " " $17 }' | head -n 4
entries invalid ignore insert_failed icmp_error search_restart
00000c09 0002767e 00017a14 00000002 00000021 0026c0e6
00000c09 00002472 00015f4c 00000006 00000006 0002539a
00000c09 00000000 00000000 00000000 00000000 00000000

The netns for the container running the curl request is all 00000000 entries in all columns/rows.
# ip netns exec 9f0e4e95-c7ee-4b36-a0bb-5d6066c25ee9 cat /proc/net/stat/nf_conntrack | awk '{ print $1 }' | head -n 2
entries 
00000000

From the master-0 host (running the DNS service pod):
# cat /proc/net/stat/nf_conntrack | awk '{print $1 " " $3 " " $5 " " $6 " " $10 " " $11 " " $13 " " $17 }' | head -n 6
entries found invalid ignore insert_failed drop icmp_error search_restart
00001bf3 000000bb 0004cce0 00758f83 000002ee 00000000 0000000c 002bd193
00001bf3 000000ab 00003970 00777fda 00000321 00000001 00000007 000cef4d
00001bf3 000000c1 00002fc7 0076227e 0000038a 00000000 0000000a 000ba2b3
00001bf3 000000b9 00002e0a 007c90b6 00000338 00000000 00000007 000b4750
00001bf3 00000000 00000000 00000000 00000000 00000000 00000000 00000000

I've managed to run perf from the master-0 and worker-0 hosts, I don't have it running inside the specific curl pod though (I could try to build a pod with the perf tools installed to test that out), I then ran the curl loop and let it run a few times (around 10):

The master-0 host didn't output anything aside from the 3 headers themselves.
Here is the output from the worker-0 host:

# perf script net_dropmonitor
Starting trace (Ctrl-C to dump results)
^CGathering kallsyms data
                 LOCATION                    OFFSET                     COUNT
      unix_stream_connect                       772                        94
               tcp_v6_rcv                       508                        14
    sk_stream_kill_queues                       108                        11
               tcp_v4_rcv                       742                        24
          skb_queue_purge                        50                         1
             nf_hook_slow                       244                         3
      nfnetlink_rcv_batch                       602                         3
         skb_release_data                       194                         2

Comment 43 Florian Westphal 2020-09-08 16:25:25 UTC
(In reply to Andy McCrae from comment #42)
>                  LOCATION                    OFFSET                     COUNT
>              nf_hook_slow                       244                         3

This means 3 packets got dropped by iptables/nftables (or in another base hook but this is unlikely because
drop stat counter in conntrack is 0 and other hooks typically do not drop packets).

Is it possible that we're dropping replies because the NAT reverse mapping was already destroyed?
You could try:
iptables -t raw -A PREROUTING -p udp -j TRACE

and then run "xtables-monitor --trace".
If there is a lot of udp traffic you might have to make the selector more specific, e.g.

iptables -t raw -A PREROUTING -p udp -m multiport --ports 53,5353 -j TRACE

(The port 5353 is a guess, might be needed due to some dnat/port rewrite along the way).

This should be done in the pod containing the dns dnat rules.

Comment 44 Florian Weimer 2020-09-08 16:30:39 UTC
Just to reiterate, I think I have found the root cause in the glibc DNS stub resolver. At least I can observe something very similar locally. I just need to double-check that my reproducer is correct, and I have been buried in other commitments during the last few days.

Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Comment 45 Florian Weimer 2020-09-11 12:43:41 UTC
I have uploaded an unsupported test build here:

https://people.redhat.com/~fweimer/PuF9w7ZoYkpY/glibc-2.28-127.el8.bz1868106.0/

Andy, would mind giving it a try? I would appreciate that a lot.

The core issue is indeed this mishandling of transaction ID collisions in the answer matching, as mentioned in comment 31.

For some reason, I assumed that we were using the same transaction ID for both queries all the time anyway, but that's clearly not the case. The bug is not s390x-specific, but timing-dependent, and it happens that it shows up only on the overloaded s390x systems. (Time measurement might also be a factor.)

I'm going to post the patch upstream later today, with a slightly more polished test case. The actual code change will be the same.

Comment 46 Andy McCrae 2020-09-16 10:20:51 UTC
Hi Florian,

Sorry for the delay getting back to you. I've built a container with the new patched glibc, and it's looking good:

# oc exec ubi-image -- bash -c 'rpm -qa | grep glibc; cat /etc/resolv.conf; while true; do echo -n "$(date)  "; curl -s -o /dev/null -w "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k; done' 
glibc-2.28-127.el8.bz1868106.0.s390x
glibc-minimal-langpack-2.28-127.el8.bz1868106.0.s390x
glibc-common-2.28-127.el8.bz1868106.0.s390x
glibc-langpack-en-2.28-127.el8.bz1868106.0.s390x
search default.svc.cluster.local svc.cluster.local cluster.local andytest.example.com
nameserver 172.30.0.10
options ndots:5
Wed Sep 16 10:10:19 UTC 2020  0.288680 0.572678 301
Wed Sep 16 10:10:19 UTC 2020  0.028136 0.320243 301
Wed Sep 16 10:10:20 UTC 2020  0.115957 0.391503 301
Wed Sep 16 10:10:20 UTC 2020  0.059568 0.396658 301
Wed Sep 16 10:10:21 UTC 2020  0.011433 0.285790 301
Wed Sep 16 10:10:21 UTC 2020  0.031892 0.331723 301
Wed Sep 16 10:10:21 UTC 2020  0.168128 0.452927 301
Wed Sep 16 10:10:22 UTC 2020  0.299400 0.585025 301
Wed Sep 16 10:10:22 UTC 2020  0.101148 0.378738 301

In comparison the exact same image but with the older glibc (displaying the issue from before):
# oc exec ubi-image-oldglibc -- bash -c 'rpm -qa | grep glibc; cat /etc/resolv.conf; while true; do echo -n "$(date)  "; curl -s -o /dev/null -w "%{time_namelookup} %{time_total} %{http_code}\n" https://www.redhat.com -k; done'
glibc-common-2.28-101.el8.s390x
glibc-minimal-langpack-2.28-101.el8.s390x
glibc-2.28-101.el8.s390x
search default.svc.cluster.local svc.cluster.local cluster.local andytest.example.com
nameserver 172.30.0.10
options ndots:5
Wed Sep 16 10:10:40 UTC 2020  5.395583 5.678167 301
Wed Sep 16 10:10:46 UTC 2020  5.304105 5.576888 301
Wed Sep 16 10:10:51 UTC 2020  5.077270 5.517953 301
Wed Sep 16 10:10:57 UTC 2020  5.145085 5.417293 301
Wed Sep 16 10:11:02 UTC 2020  0.014790 0.286077 301
Wed Sep 16 10:11:03 UTC 2020  5.265114 5.632382 301
Wed Sep 16 10:11:08 UTC 2020  5.489430 5.768916 301
Wed Sep 16 10:11:14 UTC 2020  5.144675 5.432714 301

Thanks for the patch! I imagine the steps now are to wait for that to land upstream, and then to propagate down into the individual container builds to address the issue the customer has seen.

It's good to know it isn't s390x specific, the time specific aspect makes sense with the way Kubernetes is setup by default - meaning there are a lot more requests (with the ndots:5) which would explain why we don't see this on a regular basis.

Comment 47 Florian Weimer 2020-09-16 10:43:48 UTC
(In reply to Andy McCrae from comment #46)
> Thanks for the patch! I imagine the steps now are to wait for that to land
> upstream, and then to propagate down into the individual container builds to
> address the issue the customer has seen.

Somewhere in the middle, it has to be shipped in a release of Red Hat Enterprise Linux first. It would be helpful to know about the breadth of customer impact to determine if we have to release z-stream updates for the fix. Thanks.

Comment 48 Dennis Gilmore 2020-09-16 15:15:04 UTC
Hi Florian, 

Given that OCP 4.6 is an EUS release and will be built on RHEL 8.2 EUS we need to get the patch into a z stream update for 8.2

It has come up a few times and we expect that 4.6 will see high adoption

Dennis

Comment 49 Florian Weimer 2020-09-16 15:41:14 UTC
(In reply to Dennis Gilmore from comment #48)
> Given that OCP 4.6 is an EUS release and will be built on RHEL 8.2 EUS we
> need to get the patch into a z stream update for 8.2
> 
> It has come up a few times and we expect that 4.6 will see high adoption

I see, thanks. Container contents is impacted as well, but we can start with fixing 8.2.z (after the usual preconditions are met, of course).

Comment 50 alex.osadchyy@ibm.com 2020-09-16 19:29:31 UTC
>...it has to be shipped in a release of Red Hat
> Enterprise Linux first. It would be helpful to know about the breadth of
> customer impact to determine if we have to release z-stream updates for the
> fix. Thanks.

As far as business impact. The representative ISV that ran into this issue has 2 initial clients who will need the fix before going live. The fix is also important in order to remove doubts for marketing the offering. Need to be able to position against managed K8s clusters.

Comment 78 errata-xmlrpc 2021-05-18 14:36:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: glibc security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1585


Note You need to log in before you can comment on or make changes to this bug.