Bug 1684079 - egressnetworkpolicy with dnsname has performance impact due to calling dig often
Summary: egressnetworkpolicy with dnsname has performance impact due to calling dig often
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: 4.3.0
Assignee: Aniket Bhat
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1772594 1743881 1772592 1772593
TreeView+ depends on / blocked
 
Reported: 2019-02-28 11:24 UTC by Juan Luis de Sousa-Valadas
Modified: 2020-02-07 16:27 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: DNS Names are present in each EgressNetworkPolicy they are defined as a part of. When the DNS records for a given network policy are refreshed, the current code calls dig irrespective of whether that particular DNS record has been refreshed as a virtue of the same DNS Name being present in another EgressNetworkPolicy. Consequence: If the same DNS Name is present in multiple egress network policies, at scale, we will end up calling DIG too often. Fix: Make the querying of DNS records based on uniqueness of DNS names rather than for each EgressNetworkPolicy Result: DNS records are queried only once uniquely no matter how many EgressNetworkPolicy objects they belong to. This significantly improves the performance of the queries.
Clone Of:
: 1743881 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:03:45 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4027301 Performance tune None EgressNetworkPolicies with dnsName cause high CPU usage 2019-04-02 11:41:14 UTC
Red Hat Product Errata RHBA-2020:0062 None None None 2020-01-23 11:03:59 UTC

Description Juan Luis de Sousa-Valadas 2019-02-28 11:24:10 UTC
Description of problem:

Customer has 15  egressNetworkpolicies, with 479 rules, of which 150 are dnsName. Most of these dnsName are repeated:
$ cat enp.txt | grep namespace: -c
15
$ cat enp.txt | grep -c -- '- to'
479
$ cat enp.txt | grep dnsName: | wc -l
150
$ cat enp.txt | grep dnsName: | sort -u | wc -l
17

This causes a severe performance issue because dig is being called constantly.

The egressNetworkPolicy checks for the dnsName A record TTL calling dig, as dig calls dnsmasq the first time this dig is called, dnsmasq returns the TTL, the second time it returns TTL - time elapsed since the previous query.

If an A record has a very small TTL (i.e. github.com has only 60 seconds) there will be a lot of digs called making things even worse.

Customer has 14 entries for github.com:
$ cat enp.txt | grep 'dnsName: github.com' -c
14

I asked the customer to use execsnoop ( https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py ) and I see in 10.283 seconds 82 occurences of "/usr/bin/dig +nocmd +noall +answer +ttlid a github.com"

In those 10.283 seconds I also see dig being called 1038 seconds by atomic-openshift-node pretty evenly distributed:

$ for i in {0..9}; do cat digsnoop | grep -v ^10 | grep -c ^$i; done
72
135
143
103
81
126
89
83
63
104

Version-Release number of selected component (if applicable):
3.9, but I don't see any relevant change in 3.11 so it probably affects both

How reproducible:
Always

Steps to Reproduce:
1. Create several egressNetworkPolicy objects in several projects pointing to the same hostnames. Use at least 10 different hostnames and make sure the A record has a low TTL (25 is pretty low)
2. Wait two minutes so that the caches start refreshing

Actual results:
dig is called several times per second

Expected results:
Dig is called once every TTL for all the rules

Additional info:
Calling dig so often on every node has a big performance impact.

Comment 10 Casey Callendrello 2019-09-24 16:26:32 UTC
Aniket, can you take a look at this one next?

Comment 14 Anurag saxena 2019-10-18 17:00:47 UTC
Apparently the PR got merged in 4.3. So this needs to be verified on 4.3 first and then it will be back ported to 3.11. Hope my understanding is correct here.

Comment 16 Anurag saxena 2019-10-21 23:23:41 UTC
Verified based on Comment 15. Juan, please re-open if you see something different in your env

Comment 18 errata-xmlrpc 2020-01-23 11:03:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.