Bug 1684079

Summary: egressnetworkpolicy with dnsname has performance impact due to calling dig often
Product: OpenShift Container Platform Reporter: Juan Luis de Sousa-Valadas <jdesousa>
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: anbhat, anusaxen, aos-bugs, cdc, erich, fgrosjea, travi, weliang
Version: 3.11.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: DNS Names are present in each EgressNetworkPolicy they are defined as a part of. When the DNS records for a given network policy are refreshed, the current code calls dig irrespective of whether that particular DNS record has been refreshed as a virtue of the same DNS Name being present in another EgressNetworkPolicy. Consequence: If the same DNS Name is present in multiple egress network policies, at scale, we will end up calling DIG too often. Fix: Make the querying of DNS records based on uniqueness of DNS names rather than for each EgressNetworkPolicy Result: DNS records are queried only once uniquely no matter how many EgressNetworkPolicy objects they belong to. This significantly improves the performance of the queries.
Story Points: ---
Clone Of:
: 1743881 (view as bug list) Environment:
Last Closed: 2020-01-23 11:03:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1743881, 1772592, 1772593, 1772594    

Description Juan Luis de Sousa-Valadas 2019-02-28 11:24:10 UTC
Description of problem:

Customer has 15  egressNetworkpolicies, with 479 rules, of which 150 are dnsName. Most of these dnsName are repeated:
$ cat enp.txt | grep namespace: -c
15
$ cat enp.txt | grep -c -- '- to'
479
$ cat enp.txt | grep dnsName: | wc -l
150
$ cat enp.txt | grep dnsName: | sort -u | wc -l
17

This causes a severe performance issue because dig is being called constantly.

The egressNetworkPolicy checks for the dnsName A record TTL calling dig, as dig calls dnsmasq the first time this dig is called, dnsmasq returns the TTL, the second time it returns TTL - time elapsed since the previous query.

If an A record has a very small TTL (i.e. github.com has only 60 seconds) there will be a lot of digs called making things even worse.

Customer has 14 entries for github.com:
$ cat enp.txt | grep 'dnsName: github.com' -c
14

I asked the customer to use execsnoop ( https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py ) and I see in 10.283 seconds 82 occurences of "/usr/bin/dig +nocmd +noall +answer +ttlid a github.com"

In those 10.283 seconds I also see dig being called 1038 seconds by atomic-openshift-node pretty evenly distributed:

$ for i in {0..9}; do cat digsnoop | grep -v ^10 | grep -c ^$i; done
72
135
143
103
81
126
89
83
63
104

Version-Release number of selected component (if applicable):
3.9, but I don't see any relevant change in 3.11 so it probably affects both

How reproducible:
Always

Steps to Reproduce:
1. Create several egressNetworkPolicy objects in several projects pointing to the same hostnames. Use at least 10 different hostnames and make sure the A record has a low TTL (25 is pretty low)
2. Wait two minutes so that the caches start refreshing

Actual results:
dig is called several times per second

Expected results:
Dig is called once every TTL for all the rules

Additional info:
Calling dig so often on every node has a big performance impact.

Comment 10 Casey Callendrello 2019-09-24 16:26:32 UTC
Aniket, can you take a look at this one next?

Comment 14 Anurag saxena 2019-10-18 17:00:47 UTC
Apparently the PR got merged in 4.3. So this needs to be verified on 4.3 first and then it will be back ported to 3.11. Hope my understanding is correct here.

Comment 16 Anurag saxena 2019-10-21 23:23:41 UTC
Verified based on Comment 15. Juan, please re-open if you see something different in your env

Comment 18 errata-xmlrpc 2020-01-23 11:03:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062