Description of problem: Customer has 15 egressNetworkpolicies, with 479 rules, of which 150 are dnsName. Most of these dnsName are repeated: $ cat enp.txt | grep namespace: -c 15 $ cat enp.txt | grep -c -- '- to' 479 $ cat enp.txt | grep dnsName: | wc -l 150 $ cat enp.txt | grep dnsName: | sort -u | wc -l 17 This causes a severe performance issue because dig is being called constantly. The egressNetworkPolicy checks for the dnsName A record TTL calling dig, as dig calls dnsmasq the first time this dig is called, dnsmasq returns the TTL, the second time it returns TTL - time elapsed since the previous query. If an A record has a very small TTL (i.e. github.com has only 60 seconds) there will be a lot of digs called making things even worse. Customer has 14 entries for github.com: $ cat enp.txt | grep 'dnsName: github.com' -c 14 I asked the customer to use execsnoop ( https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py ) and I see in 10.283 seconds 82 occurences of "/usr/bin/dig +nocmd +noall +answer +ttlid a github.com" In those 10.283 seconds I also see dig being called 1038 seconds by atomic-openshift-node pretty evenly distributed: $ for i in {0..9}; do cat digsnoop | grep -v ^10 | grep -c ^$i; done 72 135 143 103 81 126 89 83 63 104 Version-Release number of selected component (if applicable): 3.9, but I don't see any relevant change in 3.11 so it probably affects both How reproducible: Always Steps to Reproduce: 1. Create several egressNetworkPolicy objects in several projects pointing to the same hostnames. Use at least 10 different hostnames and make sure the A record has a low TTL (25 is pretty low) 2. Wait two minutes so that the caches start refreshing Actual results: dig is called several times per second Expected results: Dig is called once every TTL for all the rules Additional info: Calling dig so often on every node has a big performance impact.
Aniket, can you take a look at this one next?
Apparently the PR got merged in 4.3. So this needs to be verified on 4.3 first and then it will be back ported to 3.11. Hope my understanding is correct here.
Verified based on Comment 15. Juan, please re-open if you see something different in your env
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062