Bug 1772593 - [4.1]egressnetworkpolicy with dnsname has performance impact due to calling dig often
Summary: [4.1]egressnetworkpolicy with dnsname has performance impact due to calling d...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.1.z
Assignee: Juan Luis de Sousa-Valadas
QA Contact: huirwang
URL:
Whiteboard:
Depends On: 1684079
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-14 17:18 UTC by Juan Luis de Sousa-Valadas
Modified: 2020-03-11 14:19 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-11 14:19:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24515 0 None closed [release-4.1] Bug 1772593: Make DNS querying more efficient by querying once per dns name 2020-06-15 18:20:13 UTC
Github openshift ose pull 1549 0 None None None 2020-06-15 18:20:12 UTC

Description Juan Luis de Sousa-Valadas 2019-11-14 17:18:46 UTC
This bug was initially created as a copy of Bug #1684079

I am copying this bug because: 
Backport request 4.1.
I'll do the backport myself


Description of problem:

Customer has 15  egressNetworkpolicies, with 479 rules, of which 150 are dnsName. Most of these dnsName are repeated:
$ cat enp.txt | grep namespace: -c
15
$ cat enp.txt | grep -c -- '- to'
479
$ cat enp.txt | grep dnsName: | wc -l
150
$ cat enp.txt | grep dnsName: | sort -u | wc -l
17

This causes a severe performance issue because dig is being called constantly.

The egressNetworkPolicy checks for the dnsName A record TTL calling dig, as dig calls dnsmasq the first time this dig is called, dnsmasq returns the TTL, the second time it returns TTL - time elapsed since the previous query.

If an A record has a very small TTL (i.e. github.com has only 60 seconds) there will be a lot of digs called making things even worse.

Customer has 14 entries for github.com:
$ cat enp.txt | grep 'dnsName: github.com' -c
14

I asked the customer to use execsnoop ( https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py ) and I see in 10.283 seconds 82 occurences of "/usr/bin/dig +nocmd +noall +answer +ttlid a github.com"

In those 10.283 seconds I also see dig being called 1038 seconds by atomic-openshift-node pretty evenly distributed:

$ for i in {0..9}; do cat digsnoop | grep -v ^10 | grep -c ^$i; done
72
135
143
103
81
126
89
83
63
104

Version-Release number of selected component (if applicable):
3.9, but I don't see any relevant change in 3.11 so it probably affects both

How reproducible:
Always

Steps to Reproduce:
1. Create several egressNetworkPolicy objects in several projects pointing to the same hostnames. Use at least 10 different hostnames and make sure the A record has a low TTL (25 is pretty low)
2. Wait two minutes so that the caches start refreshing

Actual results:
dig is called several times per second

Expected results:
Dig is called once every TTL for all the rules

Additional info:
Calling dig so often on every node has a big performance impact.


Note You need to log in before you can comment on or make changes to this bug.