Description of problem: 5 seconds timeout problem with internal openshift DNS service Version-Release number of selected component (if applicable): v3.9.51 How reproducible: After upgrading the cluster from 3.9.41 to 3.9.51 this issue appeared on all the nodes Actual results: 5 seconds timeout with internal openshift DNS service Expected results: The delay in timeout should not be this large Additional info: Additional information in comments to follow as private.
Thanks Weibin, I was able to reproduce this in your environment and will take a look.
Just a quick note for anyone following along: I can reproduce this quickly and reliably from a container closer to the kernel (without curl): $ while true; do time getent ahosts $service; sleep 11; done
Get more interesting results when running more testing: Same testing env setup: v3.9.51 Red Hat Enterprise Linux Server release 7.6 (Maipo) openvswitch-2.9.0-83.el7fdp.1.x86_64 openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch Case1: Cluster created in openstack with instance_type: m1.large Test result: fail Case2: Cluster created in openstack with instance_type: m1.medium Test result: pass Case3: Cluster created in AWS/EC2 with instance_type: m1.large Test result: pass Case4: Cluster created in AWS/EC2 with instance_type: m1.medium Test result: pass All above test results are consistent.
(In reply to Weibin Liang from comment #13) > The problem can be reproduced every time now > > Setup cluster through QE Jenkins tool: > https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ > Launch%20Environment%20Flexy/ > > Same curl delay 5s issue can only be seen when set vm_type: m1.large, but > not vm_type: m1.medium > > Same curl delay 5s issue can be seen in both v3.9.51 and v3.11.67 > > [root@qe-weliang-case4master-etcd-nfs-1 ~]# oc new-app > https://github.com/OpenShiftDemos/os-sample-python.git > [root@qe-weliang-case4master-etcd-nfs-1 ~]# oc rsh os-sample-python-1-hv9ms > (app-root) sh-4.2$ export svc5=os-sample-python.p1.svc.cluster.local > (app-root) sh-4.2$ OUTPUT=" %{time_namelookup} %{time_connect} > %{time_appconnect} %{time_pretransfer} %{time_redirect} > %{time_starttransfer} %{time_total}\n"; echo ""; echo " > time namelookup connect appconnect pretransfer redirect > starttransfer total"; echo > "---------------------------------------------------------------------------- > -------------------------------------"; while true; do echo -n "$(date)"; > curl -w "$OUTPUT" -o /dev/null -s $svc5:8080; sleep 2; echo -n "$(date)"; > curl -w "$OUTPUT" -o /dev/null -s $svc5:8080; sleep 120; done > > time namelookup connect appconnect > pretransfer redirect starttransfer total > ----------------------------------------------------------------------------- > ------------------------------------ > Wed Jan 9 20:35:16 UTC 2019 0.125 0.126 0.000 > 0.126 0.000 0.130 0.130 > Wed Jan 9 20:35:19 UTC 2019 0.012 0.013 0.000 > 0.013 0.000 0.016 0.016 > Wed Jan 9 20:37:19 UTC 2019 5.515 5.515 0.000 > 5.516 0.000 5.518 5.519 > Wed Jan 9 20:37:26 UTC 2019 0.012 0.013 0.000 > 0.013 0.000 0.016 0.016 > Wed Jan 9 20:39:26 UTC 2019 5.514 5.515 0.000 > 5.515 0.000 5.521 5.521 > Wed Jan 9 20:39:34 UTC 2019 0.012 0.013 0.000 > 0.013 0.000 0.016 0.016 > Wed Jan 9 20:41:34 UTC 2019 5.513 5.514 0.000 > 5.514 0.000 5.519 5.519 > Wed Jan 9 20:41:41 UTC 2019 0.012 0.013 0.000 > 0.013 0.000 0.014 0.014 > Wed Jan 9 20:43:41 UTC 2019 5.515 5.515 0.000 > 5.516 0.000 5.518 5.518 > Wed Jan 9 20:43:49 UTC 2019 0.012 0.013 0.000 > 0.013 0.000 0.014 0.014 Here's the /etc/resolv.conf from the container: nameserver 172.16.122.6 search default.svc.cluster.local svc.cluster.local cluster.local int.0110-p6h.qe.rhcloud.com openstacklocal options ndots:5 The upstreams are: server=172.16.122.4 server=172.16.122.3 server=172.16.122.2 Because of ndots and the search path, dnsmasq is forwarding queries to `os-sample-python.p1.svc.cluster.local` to the upstream (not ideal, probably a separate issue). The delay comes from a problematic interaction between dnsmasq and the upstream caused by a separate process making periodic AAAA queries for some other name to the same upstream. Every 10 seconds, some other process is resolving qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com; it's the upstream AAAA reply for this "other" query that seems to temporarily block other queries to the same nameserver. Here's a tcpdump: 17:49:04.385698 IP (tos 0x0, ttl 64, id 41203, offset 0, flags [DF], proto UDP (17), length 104) 172.16.122.6.7510 > 172.16.122.2.53: 40796+ A? qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com. (76) 17:49:04.385808 IP (tos 0x0, ttl 64, id 41204, offset 0, flags [DF], proto UDP (17), length 104) 172.16.122.6.62840 > 172.16.122.2.53: 55908+ AAAA? qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com. (76) ^ -- | | The AAAA request from some other process that triggers a problematic response. 17:49:04.386822 IP (tos 0x0, ttl 64, id 50914, offset 0, flags [DF], proto UDP (17), length 120) 172.16.122.2.53 > 172.16.122.6.7510: 40796 1/0/0 qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com. A 172.16.122.6 (92) --^ | | The A record response is fine. 17:49:04.386846 IP (tos 0x0, ttl 64, id 50915, offset 0, flags [DF], proto UDP (17), length 182) 172.16.122.2.53 > 172.16.122.6.62840: 55908 0/1/0 (154) --^ | | This AAAA response to dnsmasq blocks ALL subsequent queries to this nameserver for 5 seconds the response. Notice the absence of operation (e.g. NDomain). 17:49:08.686055 IP (tos 0x0, ttl 64, id 44817, offset 0, flags [DF], proto UDP (17), length 111) 172.16.122.6.51978 > 172.16.122.2.53: 49930+ A? os-sample-python.p1.svc.cluster.local.int.0110-p6h.qe.rhcloud.com. (83) 17:49:08.686319 IP (tos 0x0, ttl 64, id 51927, offset 0, flags [DF], proto UDP (17), length 111) 172.16.122.2.53 > 172.16.122.6.51978: 49930 NXDomain 0/0/0 (83) 17:49:08.686584 IP (tos 0x0, ttl 64, id 44818, offset 0, flags [DF], proto UDP (17), length 111) 172.16.122.6.63158 > 172.16.122.2.53: 51097+ AAAA? os-sample-python.p1.svc.cluster.local.int.0110-p6h.qe.rhcloud.com. (83) 17:49:08.687506 IP (tos 0x0, ttl 64, id 51929, offset 0, flags [DF], proto UDP (17), length 189) 172.16.122.2.53 > 172.16.122.6.63158: 51097 NXDomain 0/1/0 (161) ^ -- | | These continue as if they were queued. The upstream AAAA query to qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com results in the problematic reply every time, and if my own test query to "os-sample-python" falls in the 5 second window following the bad AAAA response, the query to "os-sample-python" is blocked until the AAAA from the upstream is handled (what handled means here is still undefined). Still digging, but this is the progress I've made in Weibin's environment. I haven't had a chance to cross reference it with other issues to see if there's anything applicable to other reported scenarios.
also can reproduce on openstack with instance_type: m1.xlarge. looks that the differences between the types just are the number of vCPUs and RAM. vCPUs RAM Reproducible m1.medium 2 4G no m1.large 4 8G yes m1.xlarge 8 16G yes Same testing env setup: v3.9.51 Red Hat Enterprise Linux Server release 7.6 (Maipo) openvswitch-2.9.0-83.el7fdp.1.x86_64 openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch
Editing dc to have dnsConfig option name:single-request can be a optional workaround for this bug. [root@qe-weliang-311master-etcd-nfs-1 ~]# oc edit deploymentconfig.apps.openshift.io/os-sample-python spec template spec ... dnsConfig: options: - name: single-request dnsPolicy: ClusterFirst But this option only be support in 3.11, not 3.7 and 3.7
*** Bug 1662122 has been marked as a duplicate of this bug. ***
Hi, we are running 3.9.51 ATM and have encountered similar behaviour on our application nodes that have 8+ CPU cores. In our case the timeouts do not occur on a node with 4 or 6 cores. Our investigation has shown that every time we encounter a timeout in a container (due to RES_TIMEOUT = 5), there is an increase in the "insert_failed" counter of the conntrack on one of the CPUs (can be seen with conntrack -S). We can reproduce this behavior with a simple python app calling getaddrinfo(). Due to 4+ search domains in the resolv.conf, dnsmasq will try to resolve all of them including AAAA and A). Can this problem be actually a race condition in netfilter? Workarounds: CustomPODConfig could be activated on 3.9+ by enabling it as a feature gate, but adding "single-request" or "single-request-reopen" does not always solve the problem.
Hi, this is very useful. Do you need 8 cores in a POD to trigger the bug?
Hi, at the first glance we thought that the timeout only occurs on the application nodes with multiple CPUs, but after running further tests, we had to accept that the behavior also exists on the applications nodes with 4 CPUs. We needed to insert a sleep time between requests in our test runner to achieve the same behavior, but basically running "getent ahosts" in a while loop (with a sleep period like in #c15) in a container shows that the requests last over 5 sec. And of course - every time the conntrack "insert_failed" counter increases. Some (maybe) helpfull links: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts https://github.com/kubernetes/kubernetes/issues/62628 https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_core.c#L290-L291
Hi, I'm trying to replicate the bug on an OCP cluster I've set up. Regards,
(In reply to Weibin Liang from comment #17) An update against the test results in comment #17: CAN reproduce the issue on AWS with instance type "m4.large" which has 4 vCPUs. I checked "m3.large" instance and it has only 2 vCPUs and cannot reproduce the issue on it. Seems the issue can be reproduced easily if the number of CPU >= 4 on the nodes, and should not be related to the IaaS platform. 3.9 env version: openshift v3.9.51 openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch openvswitch-2.9.0-83.el7fdp.1.x86_64 3.11 env version: openshift v3.11.43 openvswitch-selinux-extra-policy-1.0-7.el7fdp.noarch openvswitch-2.9.0-70.el7fdp.1.x86_64
sorry for a typo in the previous comment and instance type should be "m4.xlarge" which has 4 vCPUs.
*** Bug 1662982 has been marked as a duplicate of this bug. ***
I'm not sure if this has been noticed already but the reproduction of the 5s delay in my 3.9.60 cluster coincides with atomic-openshift-node service restarts. This would be b/c the .svc.cluster.local queries are going to SkyDNS in node service (?)
(In reply to Robert Bost from comment #54) > I'm not sure if this has been noticed already but the reproduction of the 5s > delay in my 3.9.60 cluster coincides with atomic-openshift-node service > restarts. This would be b/c the .svc.cluster.local queries are going to > SkyDNS in node service (?) Is the restart happening because of a package upgrade, or if not what reason?
(In reply to Dan Williams from comment #57) > (In reply to Robert Bost from comment #54) > > I'm not sure if this has been noticed already but the reproduction of the 5s > > delay in my 3.9.60 cluster coincides with atomic-openshift-node service > > restarts. This would be b/c the .svc.cluster.local queries are going to > > SkyDNS in node service (?) > > Is the restart happening because of a package upgrade, or if not what reason? I was performing the restart by hand (systemctl restart atomic-openshift-node) and just poking around at another issue while running `getent hosts` as well as a Java app doing DNS resolution. I had just noticed there was about exactly a 5 second (5001-5010ms) delay in the response about every other service restart.
A workaround is adding below iptables rule to the nodes: iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK tested with both 3.9.51 and 3.11.43.
Hi Aaron, Since customer confirmed that the workaround (iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK) works in the test environment. Please go ahead with the RCA and fix. We also need to report to customer the details of RCA. Thanks & Regards, Judith
Hi Aaron & Jon, Since we have tested the kernel patch(in comment#75) could fix the issue (comment# 87, 88), could you please work together on the RCA and proceed with the final fix ? Thanks & Regards, Judith
(In reply to Judith Zhu from comment #89) > Hi Aaron & Jon, > > Since we have tested the kernel patch(in comment#75) could fix the issue > (comment# 87, 88), could you please work together on the RCA and proceed > with the final fix ? > > > Thanks & Regards, > Judith Hi Judith, The RCA is that nf_conntrack is exposed to race conditions when Openshift/containers are in use. This causes packet drop issues for both UDP and TCP traffic. DNS uses UDP and as we have found here it is also exposed. After verifying Florians kernel works for a customer affected by this bz, I think it's safe to say that it is actually a duplicate of 1648965. We will pursue the fix there. Regards Jon
*** This bug has been marked as a duplicate of bug 1648965 ***
Are you able to provide any information about the resolution of 1648965 as I dont have access to it.
Sorry about delay, bz1648965 is verified.
Hi, I'm from OpenShift consultant team and I've a customer that has the same issue in production (ocp v3.11.43). Please are you able to provide any information about the resolution of #1648965 as I dont have access to it. The above workaround (iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK) is the final solution to that issue? Do I need to apply it on all master/infra/compute nodes ?
*** Bug 1717105 has been marked as a duplicate of this bug. ***
Is this issue resolved? I can't access the BZ 1648965 as it seems to private and I heard that a fix has been released for it. Can you tell which specific kernel version contains the fix? We're planning to upgrade the RHEL version and it would be great if we could have the exact kernel version so that we could update the kernel as well. Thanks
Hello Waseem! Check this article and see if it helps. You can find the specific kernel versions and erratas https://access.redhat.com/solutions/3827421