Description of problem: Customer reports that after upgrading to 3.11.219 they're seeing timeouts in several connections. They report this issue having been identified both after upgrading from 3.11.216 and 3.11.188 with ovs-multitenant in both cases. They are using actively egressNetworkPolicy but checking the flows of the node where this is happening and correlating them with the timeout timestamp flows it doesn't look like that patch is correlated in any way. Version-Release number of selected component (if applicable): 3.11.219 How reproducible: Intermittent. Steps to Reproduce: Customer is doing builds and they fail intermittently with both dns and cidr rules. Actual results: Sometimes the connections time out. Expected results: Connections don't time out.
Setting the target release to the latest dev branch. Once we have reproduced the issue we can consider a backport.
*** Bug 1843390 has been marked as a duplicate of this bug. ***
So at this point the customer is noticing 3 different cases which I believe right now have at least 2 independent root causes: 1- egressNetworkPolicy rule with dnsName which resolves to multiple ip addresses that change over time 2- egressNetworkPolicy rule with dnsName with exactly one IP address that doesn't change over time 3- egressNetworkPolicy rule with a cidrSelector. Problem #1 is well understood, we query a hostname, gather every A record for it and add flows in OVS for each one of them. Because the DNS won't return every ip address, the flows don't have every single IP address we see them failing more often. This has always been a well known problem, but on 3.11.219 due to https://github.com/openshift/origin/pull/24518 we see the problem significantly amplified. For fixing problem #1 I'm considering a few possibilities: 1- Disregarg the DNS TTL and query it once per second, this is the simplest fix but most likely it's also the least effective one 2- Every time an IP address disappears, don't remove it unless it doesn't appear in N queries (5?) 3- Query the server a few times on every loop, maybe something like 3-5 times is appropriate. I'm not certain this makes sense because I believe dnsmasq will return its cache. Problems 2 and 3 aren't caused by bad openVSwitch having a bad flow table. We did a test and capture the flows of table 100 during the test. We can see right after the timeout the flow table had a flow for the block or the A record long before the timeout. As problems 2 and 3 aren't understood at all and they are having a smaller impact, they'll be addressed in a different BZ which I'll file once the problem is understood.
After some investigation with 3.11.219 and 3.11.216 I see the flows last a similar amount of time. So the theory doesn't add up. At this point I have two theories: 1- There is a larger amount of time between the flow being deleted and created. I think this the most likely option. 2- There is a problem in a different component such as kernel or vswitchd. I asked the customer to create the egress network policy flows manually, if we stop seeing timeouts then we should be able to say the main issue is 1. This should also fix problems 2 and 3.
We've confirmed the scenario through manual egress flows and conversation between SDN Engineering and the customer, and SDN Engineering has provided a candidate build of his patch for the customer to test. The normal process is for the PR would be built in the current 4.x master ( 4.6), and then backported down the 4.x chain back to 4.3, and then finally built for 3.11. Between each of those backports it has to be validated by QA, so it typically takes several weeks. However, since this is critical, SDN Engineering is willing to take some shortcuts. Once the fix is confirmed to work for 4.6, gets approved, tested and merged, SDN Engineering will ask QA to validate the image for 3.11.
*** Bug 1835646 has been marked as a duplicate of this bug. ***
Setting the target release to the current development branch. We will consider backports once we have the fix.
My customer enabled egress network policies using the patched sdn image, and they immediately started seeing a spike of timeout issues. I've quoted the team's update to the case below. Please let us know if you need any other information to assist with this. -------------------------------- "We deployed the image quay.io/jdesousa/sdn:v3.11.219-concurrent-dns in our clusters and started noticing failed egress connections immediately, with some behavior differences that I'll list below: - Egress Network Policies that allow access to one domain would work for some pods and not for others. For example, pods of the same deployment were affected differently. * killing a pod (and it being rescheduled in another node) had a high probability that in the other node egress access is behaving as configured with ENPs * At least 1/3rd of the nodes were affected. So it's difficult to exclude that this is (or isn't) related to the nodes. [1] A concrete example was access to the domain `api.icims.com`, which was working from podA and not in podB. I verified manually using remote shell, and initiating a tcp connection using telnet. I was able to connect in podA and not in podB. The deployment of the fix was as per your suggestion to simply edit sdn deamonset by removing the image trigger and replacing the image. Then I deleted the sdn pods one by one and let new ones restart. I verified that the nodes had the new image by ssh-ing into those and checking the running image." --------------------------------
A new image has been provided yesterday between 16:00-17:00 UTC, so far there are no complains. Will wait until tomorrow before proceeding to write the tests for this so that this can get merged.
The core dump is corrupt: [jdesousa@systemv /tmp/mozilla_jdesousa0] $ md5sum sdn.core.18146.gz 2e19daad569f65486733a12257d2217e sdn.core.18146.gz [jdesousa@systemv /tmp/mozilla_jdesousa0] $ gzip -d sdn.core.18146.gz [jdesousa@systemv /tmp/mozilla_jdesousa0] $ md5sum sdn.core.18146 14ec2612b326ee5bc1d96d56f26925e7 sdn.core.18146 [jdesousa@systemv /tmp/mozilla_jdesousa0] $ dlv core openshift sdn.core.18146 reading NT_PRPSINFO: unexpected EOF The logs however contain useful information. Will provide an update today
The issue is well known, I'll make a few test builds and test it myself, and we'll test a new image on Friday.
For short lived TTLs (i.e less than half hour) we'll assume the TTL is at most 10 seconds, also we'll only remove an ip address from the list if it hasn't resolved to it 5 times in a row.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days