1772594 – [3.11] egressnetworkpolicy with dnsname has performance impact due to calling dig often

Bug 1772594 - [3.11] egressnetworkpolicy with dnsname has performance impact due to calling dig often

Summary: [3.11] egressnetworkpolicy with dnsname has performance impact due to calling...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Juan Luis de Sousa-Valadas
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:	1684079
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-14 17:19 UTC by Juan Luis de Sousa-Valadas
Modified:	2024-06-13 22:18 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Cause: DNS Names are present in each EgressNetworkPolicy they are defined as a part of. When the DNS records for a given network policy are refreshed, the current code calls dig irrespective of whether that particular DNS record has been refreshed as a virtue of the same DNS Name being present in another EgressNetworkPolicy. Consequence: If the same DNS Name is present in multiple egress network policies, at scale, we will end up calling DIG too often. Fix: Make the querying of DNS records based on uniqueness of DNS names rather than for each EgressNetworkPolicy Result: DNS records are queried only once uniquely no matter how many EgressNetworkPolicy objects they belong to. This significantly improves the performance of the queries.
Clone Of:
Environment:
Last Closed:	2020-05-28 05:44:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24518	0	None	closed	[release-3.11] Bug 1772594: Make DNS querying more efficient by querying once per dns name	2021-02-09 06:05:18 UTC
Red Hat Product Errata	RHBA-2020:2215	0	None	None	None	2020-05-28 05:44:30 UTC

Description Juan Luis de Sousa-Valadas 2019-11-14 17:19:49 UTC

This bug was initially created as a copy of Bug #1684079

I am copying this bug because: 
Backport request 3.11.
I'll do the backport myself



Description of problem:

Customer has 15  egressNetworkpolicies, with 479 rules, of which 150 are dnsName. Most of these dnsName are repeated:
$ cat enp.txt | grep namespace: -c
15
$ cat enp.txt | grep -c -- '- to'
479
$ cat enp.txt | grep dnsName: | wc -l
150
$ cat enp.txt | grep dnsName: | sort -u | wc -l
17

This causes a severe performance issue because dig is being called constantly.

The egressNetworkPolicy checks for the dnsName A record TTL calling dig, as dig calls dnsmasq the first time this dig is called, dnsmasq returns the TTL, the second time it returns TTL - time elapsed since the previous query.

If an A record has a very small TTL (i.e. github.com has only 60 seconds) there will be a lot of digs called making things even worse.

Customer has 14 entries for github.com:
$ cat enp.txt | grep 'dnsName: github.com' -c
14

I asked the customer to use execsnoop ( https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py ) and I see in 10.283 seconds 82 occurences of "/usr/bin/dig +nocmd +noall +answer +ttlid a github.com"

In those 10.283 seconds I also see dig being called 1038 seconds by atomic-openshift-node pretty evenly distributed:

$ for i in {0..9}; do cat digsnoop | grep -v ^10 | grep -c ^$i; done
72
135
143
103
81
126
89
83
63
104

Version-Release number of selected component (if applicable):
3.9, but I don't see any relevant change in 3.11 so it probably affects both

How reproducible:
Always

Steps to Reproduce:
1. Create several egressNetworkPolicy objects in several projects pointing to the same hostnames. Use at least 10 different hostnames and make sure the A record has a low TTL (25 is pretty low)
2. Wait two minutes so that the caches start refreshing

Actual results:
dig is called several times per second

Expected results:
Dig is called once every TTL for all the rules

Additional info:
Calling dig so often on every node has a big performance impact.

Comment 2 kedar 2020-02-25 03:59:41 UTC

Hello,

As the status of bugzilla is in POST what is the ETA of getting it shipped with the errata.

Comment 3 Juan Luis de Sousa-Valadas 2020-02-25 12:02:51 UTC

There is no ETA. I need to know if we are merging the 4.1 first, and then there are a couple issues with this backport which I have to fix.
Once it's fixed it needs to be approved, cherry-picked, verified by QA and finally released. So counting all this no less than 1 month.

Comment 21 errata-xmlrpc 2020-05-28 05:44:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2215

Note You need to log in before you can comment on or make changes to this bug.