Bug 1661928 - 5 seconds timeout problem with internal openshift DNS service
Summary: 5 seconds timeout problem with internal openshift DNS service
Keywords:
Status: CLOSED DUPLICATE of bug 1648965
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.z
Assignee: Aaron Conole
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1662122 1662982 1717105 (view as bug list)
Depends On: 1648965
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-24 14:03 UTC by Madhusudan Upadhyay
Modified: 2023-10-06 18:03 UTC (History)
67 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-12 11:38:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 56903 0 'None' closed DNS intermittent delays of 5s 2021-02-16 07:10:56 UTC
Red Hat Knowledge Base (Solution) 3827421 0 None None None 2019-01-24 01:27:49 UTC

Internal Links: 1648965 1686737 1686766

Description Madhusudan Upadhyay 2018-12-24 14:03:07 UTC
Description of problem:

5 seconds timeout problem with internal openshift DNS service

Version-Release number of selected component (if applicable):

v3.9.51

How reproducible:

After upgrading the cluster from 3.9.41 to 3.9.51 this issue appeared on all the nodes



Actual results:

5 seconds timeout with internal openshift DNS service


Expected results:

The delay in timeout should not be this large

Additional info:

Additional information in comments to follow as private.

Comment 14 Dan Mace 2019-01-09 21:59:54 UTC
Thanks Weibin, I was able to reproduce this in your environment and will take a look.

Comment 15 Dan Mace 2019-01-10 14:14:09 UTC
Just a quick note for anyone following along: I can reproduce this quickly and reliably from a container closer to the kernel (without curl):

   $ while true; do time getent ahosts $service; sleep 11; done

Comment 17 Weibin Liang 2019-01-10 16:33:47 UTC
Get more interesting results when running more testing:

Same testing env setup:
v3.9.51
Red Hat Enterprise Linux Server release 7.6 (Maipo)
openvswitch-2.9.0-83.el7fdp.1.x86_64
openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch

Case1: Cluster created in openstack with instance_type: m1.large
Test result: fail

Case2: Cluster created in openstack with instance_type: m1.medium
Test result: pass

Case3: Cluster created in AWS/EC2 with instance_type: m1.large
Test result: pass

Case4: Cluster created in AWS/EC2  with instance_type: m1.medium
Test result: pass

All above test results are consistent.

Comment 18 Dan Mace 2019-01-10 23:13:29 UTC
(In reply to Weibin Liang from comment #13)
> The problem can be reproduced every time now
> 
> Setup cluster through QE Jenkins tool:
> https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/
> Launch%20Environment%20Flexy/
> 
> Same curl delay 5s issue can only be seen when set vm_type: m1.large, but
> not vm_type: m1.medium
> 
> Same curl delay 5s issue can be seen in both v3.9.51 and v3.11.67
> 
> [root@qe-weliang-case4master-etcd-nfs-1 ~]# oc new-app
> https://github.com/OpenShiftDemos/os-sample-python.git
> [root@qe-weliang-case4master-etcd-nfs-1 ~]# oc rsh os-sample-python-1-hv9ms
> (app-root) sh-4.2$ export svc5=os-sample-python.p1.svc.cluster.local
> (app-root) sh-4.2$ OUTPUT="        %{time_namelookup}     %{time_connect}   
> %{time_appconnect}         %{time_pretransfer}      %{time_redirect}        
> %{time_starttransfer}   %{time_total}\n"; echo ""; echo "                   
> time   namelookup   connect   appconnect   pretransfer   redirect  
> starttransfer   total"; echo
> "----------------------------------------------------------------------------
> -------------------------------------"; while true; do echo -n "$(date)";
> curl -w "$OUTPUT" -o /dev/null -s $svc5:8080; sleep 2; echo -n "$(date)";
> curl -w "$OUTPUT" -o /dev/null -s $svc5:8080; sleep 120; done
> 
>                         time   namelookup   connect   appconnect  
> pretransfer   redirect   starttransfer   total
> -----------------------------------------------------------------------------
> ------------------------------------
> Wed Jan  9 20:35:16 UTC 2019        0.125     0.126        0.000        
> 0.126      0.000           0.130   0.130
> Wed Jan  9 20:35:19 UTC 2019        0.012     0.013        0.000        
> 0.013      0.000           0.016   0.016
> Wed Jan  9 20:37:19 UTC 2019        5.515     5.515        0.000        
> 5.516      0.000           5.518   5.519
> Wed Jan  9 20:37:26 UTC 2019        0.012     0.013        0.000        
> 0.013      0.000           0.016   0.016
> Wed Jan  9 20:39:26 UTC 2019        5.514     5.515        0.000        
> 5.515      0.000           5.521   5.521
> Wed Jan  9 20:39:34 UTC 2019        0.012     0.013        0.000        
> 0.013      0.000           0.016   0.016
> Wed Jan  9 20:41:34 UTC 2019        5.513     5.514        0.000        
> 5.514      0.000           5.519   5.519
> Wed Jan  9 20:41:41 UTC 2019        0.012     0.013        0.000        
> 0.013      0.000           0.014   0.014
> Wed Jan  9 20:43:41 UTC 2019        5.515     5.515        0.000        
> 5.516      0.000           5.518   5.518
> Wed Jan  9 20:43:49 UTC 2019        0.012     0.013        0.000        
> 0.013      0.000           0.014   0.014


Here's the /etc/resolv.conf from the container:

  nameserver 172.16.122.6
  search default.svc.cluster.local svc.cluster.local cluster.local int.0110-p6h.qe.rhcloud.com openstacklocal
  options ndots:5

The upstreams are:

  server=172.16.122.4
  server=172.16.122.3
  server=172.16.122.2

Because of ndots and the search path, dnsmasq is forwarding queries to `os-sample-python.p1.svc.cluster.local` to the upstream (not ideal, probably a separate issue). The delay comes from a problematic interaction between dnsmasq and the upstream caused by a separate process making periodic AAAA queries for some other name to the same upstream. Every 10 seconds, some other process is resolving qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com; it's the upstream AAAA reply for this "other" query that seems to temporarily block other queries to the same nameserver. Here's a tcpdump:

17:49:04.385698 IP (tos 0x0, ttl 64, id 41203, offset 0, flags [DF], proto UDP (17), length 104)
    172.16.122.6.7510 > 172.16.122.2.53: 40796+ A? qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com. (76)
17:49:04.385808 IP (tos 0x0, ttl 64, id 41204, offset 0, flags [DF], proto UDP (17), length 104)
    172.16.122.6.62840 > 172.16.122.2.53: 55908+ AAAA? qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com. (76)
    ^
  --
  |
  |
  The AAAA request from some other process that triggers a problematic response.

17:49:04.386822 IP (tos 0x0, ttl 64, id 50914, offset 0, flags [DF], proto UDP (17), length 120)
    172.16.122.2.53 > 172.16.122.6.7510: 40796 1/0/0 qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com. A 172.16.122.6 (92)
  --^
  |
  |
  The A record response is fine.

17:49:04.386846 IP (tos 0x0, ttl 64, id 50915, offset 0, flags [DF], proto UDP (17), length 182)
    172.16.122.2.53 > 172.16.122.6.62840: 55908 0/1/0 (154)
  --^
  |
  |
  This AAAA response to dnsmasq blocks ALL subsequent queries to this nameserver
  for 5 seconds the response. Notice the absence of operation (e.g. NDomain).

17:49:08.686055 IP (tos 0x0, ttl 64, id 44817, offset 0, flags [DF], proto UDP (17), length 111)
    172.16.122.6.51978 > 172.16.122.2.53: 49930+ A? os-sample-python.p1.svc.cluster.local.int.0110-p6h.qe.rhcloud.com. (83)
17:49:08.686319 IP (tos 0x0, ttl 64, id 51927, offset 0, flags [DF], proto UDP (17), length 111)
    172.16.122.2.53 > 172.16.122.6.51978: 49930 NXDomain 0/0/0 (83)
17:49:08.686584 IP (tos 0x0, ttl 64, id 44818, offset 0, flags [DF], proto UDP (17), length 111)
    172.16.122.6.63158 > 172.16.122.2.53: 51097+ AAAA? os-sample-python.p1.svc.cluster.local.int.0110-p6h.qe.rhcloud.com. (83)
17:49:08.687506 IP (tos 0x0, ttl 64, id 51929, offset 0, flags [DF], proto UDP (17), length 189)
    172.16.122.2.53 > 172.16.122.6.63158: 51097 NXDomain 0/1/0 (161)
    ^
  --
  |
  |
  These continue as if they were queued.

The upstream AAAA query to qe-weliang-39master-etcd-nfs-1.int.0110-p6h.qe.rhcloud.com results in the problematic reply every time, and if my own test query to "os-sample-python" falls in the 5 second window following the bad AAAA response, the query to "os-sample-python" is blocked until the AAAA from the upstream is handled (what handled means here is still undefined).

Still digging, but this is the progress I've made in Weibin's environment. I haven't had a chance to cross reference it with other issues to see if there's anything applicable to other reported scenarios.

Comment 19 Hongan Li 2019-01-11 07:59:23 UTC
also can reproduce on openstack with instance_type: m1.xlarge.

looks that the differences between the types just are the number of vCPUs and RAM.

             vCPUs     RAM    Reproducible
m1.medium      2        4G         no
m1.large       4        8G         yes
m1.xlarge      8       16G         yes


Same testing env setup:
v3.9.51
Red Hat Enterprise Linux Server release 7.6 (Maipo)
openvswitch-2.9.0-83.el7fdp.1.x86_64
openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch

Comment 32 Weibin Liang 2019-01-15 16:26:00 UTC
Editing dc to have dnsConfig option name:single-request can be a optional workaround for this bug.

[root@qe-weliang-311master-etcd-nfs-1 ~]# oc edit deploymentconfig.apps.openshift.io/os-sample-python
spec
  template
    spec
      ...
      dnsConfig:
        options:
        - name: single-request
      dnsPolicy: ClusterFirst


But this option only be support in 3.11, not 3.7 and 3.7

Comment 35 Stephen Cuppett 2019-01-18 13:20:37 UTC
*** Bug 1662122 has been marked as a duplicate of this bug. ***

Comment 36 seferovic 2019-01-21 09:00:19 UTC
Hi,

we are running 3.9.51 ATM and have encountered similar behaviour on our application nodes that have 8+ CPU cores. In our case the timeouts do not occur on a node with 4 or 6 cores. Our investigation has shown that every time we encounter a timeout in a container (due to RES_TIMEOUT = 5), there is an increase in the "insert_failed" counter of the conntrack on one of the CPUs (can be seen with conntrack -S). We can reproduce this behavior with a simple python app calling getaddrinfo(). Due to 4+ search domains in the resolv.conf, dnsmasq will try to resolve all of them including AAAA and A). Can this problem be actually a race condition in netfilter?

Workarounds: CustomPODConfig could be activated on 3.9+ by enabling it as a feature gate, but adding "single-request" or "single-request-reopen" does not always solve the problem.

Comment 37 Matteo Croce 2019-01-21 10:20:47 UTC
Hi,

this is very useful. Do you need 8 cores in a POD to trigger the bug?

Comment 38 seferovic 2019-01-21 15:07:30 UTC
Hi,

at the first glance we thought that the timeout only occurs on the application nodes with multiple CPUs, but after running further tests, we had to accept that the behavior also exists on the applications nodes with 4 CPUs. We needed to insert a sleep time between requests in our test runner to achieve the same behavior, but basically running "getent ahosts" in a while loop (with a sleep period like in #c15) in a container shows that the requests last over 5 sec. And of course - every time the conntrack "insert_failed" counter increases.

Some (maybe) helpfull links:

https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
https://github.com/kubernetes/kubernetes/issues/62628
https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/
https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02
https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_core.c#L290-L291

Comment 39 Matteo Croce 2019-01-21 17:26:41 UTC
Hi,

I'm trying to replicate the bug on an OCP cluster I've set up.

Regards,

Comment 47 Hongan Li 2019-01-23 06:04:30 UTC
(In reply to Weibin Liang from comment #17)

An update against the test results in comment #17:

CAN reproduce the issue on AWS with instance type "m4.large" which has 4 vCPUs. 

I checked "m3.large" instance and it has only 2 vCPUs and cannot reproduce the issue on it.

Seems the issue can be reproduced easily if the number of CPU >= 4 on the nodes, and should not be related to the IaaS platform.

3.9 env version:
openshift v3.9.51
openvswitch-selinux-extra-policy-1.0-8.el7fdp.noarch
openvswitch-2.9.0-83.el7fdp.1.x86_64

3.11 env version:
openshift v3.11.43
openvswitch-selinux-extra-policy-1.0-7.el7fdp.noarch
openvswitch-2.9.0-70.el7fdp.1.x86_64

Comment 48 Hongan Li 2019-01-23 08:50:07 UTC
sorry for a typo in the previous comment and instance type should be "m4.xlarge" which has 4 vCPUs.

Comment 53 Dan Mace 2019-01-23 20:39:55 UTC
*** Bug 1662982 has been marked as a duplicate of this bug. ***

Comment 54 Robert Bost 2019-01-24 01:49:34 UTC
I'm not sure if this has been noticed already but the reproduction of the 5s delay in my 3.9.60 cluster coincides with atomic-openshift-node service restarts. This would be b/c the .svc.cluster.local queries are going to SkyDNS in node service (?)

Comment 57 Dan Williams 2019-01-24 11:04:43 UTC
(In reply to Robert Bost from comment #54)
> I'm not sure if this has been noticed already but the reproduction of the 5s
> delay in my 3.9.60 cluster coincides with atomic-openshift-node service
> restarts. This would be b/c the .svc.cluster.local queries are going to
> SkyDNS in node service (?)

Is the restart happening because of a package upgrade, or if not what reason?

Comment 60 Robert Bost 2019-01-24 23:57:16 UTC
(In reply to Dan Williams from comment #57)
> (In reply to Robert Bost from comment #54)
> > I'm not sure if this has been noticed already but the reproduction of the 5s
> > delay in my 3.9.60 cluster coincides with atomic-openshift-node service
> > restarts. This would be b/c the .svc.cluster.local queries are going to
> > SkyDNS in node service (?)
> 
> Is the restart happening because of a package upgrade, or if not what reason?

I was performing the restart by hand (systemctl restart atomic-openshift-node) and just poking around at another issue while running `getent hosts` as well as a Java app doing DNS resolution. I had just noticed there was about exactly a 5 second (5001-5010ms) delay in the response about every other service restart.

Comment 61 Hongan Li 2019-01-25 01:21:14 UTC
A workaround is adding below iptables rule to the nodes:

iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK

tested with both 3.9.51 and 3.11.43.

Comment 66 Judith Zhu 2019-01-28 12:37:41 UTC
Hi Aaron,

Since customer confirmed that the workaround  (iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK) works in the test environment.  Please go ahead with the RCA and fix.  We also need to report to customer the details of RCA.

Thanks & Regards,
Judith

Comment 89 Judith Zhu 2019-02-12 09:48:26 UTC
Hi Aaron & Jon,

Since we have tested the kernel patch(in comment#75) could fix the issue  (comment# 87, 88), could you please work together on the RCA and proceed with the final fix ?


Thanks & Regards,
Judith

Comment 90 Jonathan Maxwell 2019-02-13 01:49:06 UTC
(In reply to Judith Zhu from comment #89)
> Hi Aaron & Jon,
> 
> Since we have tested the kernel patch(in comment#75) could fix the issue 
> (comment# 87, 88), could you please work together on the RCA and proceed
> with the final fix ?
> 
> 
> Thanks & Regards,
> Judith

Hi Judith,

The RCA is that nf_conntrack is exposed to race conditions when Openshift/containers are in use. This causes packet drop issues for both UDP and TCP traffic. DNS uses UDP and as we have found here it is also exposed. After verifying Florians kernel works for a customer affected by this bz, I think it's safe to say that it is actually a duplicate of 1648965. We will pursue the fix there.

Regards

Jon

Comment 114 Stephen Cuppett 2019-03-12 11:38:43 UTC

*** This bug has been marked as a duplicate of bug 1648965 ***

Comment 116 Cameron 2019-03-12 12:03:52 UTC
Are you able to provide any information about the resolution of 1648965 as I dont have access to it.

Comment 117 yiche 2019-03-13 05:39:19 UTC
Sorry about delay, bz1648965 is verified.

Comment 120 Ahmed Bessifi 2019-08-01 10:04:23 UTC
Hi,

I'm from OpenShift consultant team and I've a customer that has the same issue in production (ocp v3.11.43).
Please are you able to provide any information about the resolution of #1648965 as I dont have access to it.

The above workaround (iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK) is the final solution to that issue? Do I need to apply it on all master/infra/compute nodes ?

Comment 124 Dan Mace 2019-10-11 10:26:09 UTC
*** Bug 1717105 has been marked as a duplicate of this bug. ***

Comment 126 Waseem Hassan Shahid 2019-11-13 10:40:31 UTC
Is this issue resolved? I can't access the BZ 1648965 as it seems to private and I heard that a fix has been released for it. Can you tell which specific kernel version contains the fix? We're planning to upgrade the RHEL version and it would be great if we could have the exact kernel version so that we could update the kernel as well. Thanks

Comment 127 Alejandro Coma 2019-11-13 10:48:37 UTC
Hello Waseem!

Check this article and see if it helps. You can find the specific kernel versions and erratas https://access.redhat.com/solutions/3827421


Note You need to log in before you can comment on or make changes to this bug.