Bug 1850060
Summary: | After upgrading to 3.11.219 timeouts are appearing. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Juan Luis de Sousa-Valadas <jdesousa> |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Networking sub component: | openshift-sdn | QA Contact: | huirwang |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aconstan, airshad, anbhat, bbennett, bleanhar, bpritche, bretm, erich, fpan, gvaughn, nstielau, suchaudh, zzhao |
Version: | 3.11.0 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:32:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Juan Luis de Sousa-Valadas
2020-06-23 13:34:34 UTC
Setting the target release to the latest dev branch. Once we have reproduced the issue we can consider a backport. *** Bug 1843390 has been marked as a duplicate of this bug. *** So at this point the customer is noticing 3 different cases which I believe right now have at least 2 independent root causes: 1- egressNetworkPolicy rule with dnsName which resolves to multiple ip addresses that change over time 2- egressNetworkPolicy rule with dnsName with exactly one IP address that doesn't change over time 3- egressNetworkPolicy rule with a cidrSelector. Problem #1 is well understood, we query a hostname, gather every A record for it and add flows in OVS for each one of them. Because the DNS won't return every ip address, the flows don't have every single IP address we see them failing more often. This has always been a well known problem, but on 3.11.219 due to https://github.com/openshift/origin/pull/24518 we see the problem significantly amplified. For fixing problem #1 I'm considering a few possibilities: 1- Disregarg the DNS TTL and query it once per second, this is the simplest fix but most likely it's also the least effective one 2- Every time an IP address disappears, don't remove it unless it doesn't appear in N queries (5?) 3- Query the server a few times on every loop, maybe something like 3-5 times is appropriate. I'm not certain this makes sense because I believe dnsmasq will return its cache. Problems 2 and 3 aren't caused by bad openVSwitch having a bad flow table. We did a test and capture the flows of table 100 during the test. We can see right after the timeout the flow table had a flow for the block or the A record long before the timeout. As problems 2 and 3 aren't understood at all and they are having a smaller impact, they'll be addressed in a different BZ which I'll file once the problem is understood. After some investigation with 3.11.219 and 3.11.216 I see the flows last a similar amount of time. So the theory doesn't add up. At this point I have two theories: 1- There is a larger amount of time between the flow being deleted and created. I think this the most likely option. 2- There is a problem in a different component such as kernel or vswitchd. I asked the customer to create the egress network policy flows manually, if we stop seeing timeouts then we should be able to say the main issue is 1. This should also fix problems 2 and 3. We've confirmed the scenario through manual egress flows and conversation between SDN Engineering and the customer, and SDN Engineering has provided a candidate build of his patch for the customer to test. The normal process is for the PR would be built in the current 4.x master ( 4.6), and then backported down the 4.x chain back to 4.3, and then finally built for 3.11. Between each of those backports it has to be validated by QA, so it typically takes several weeks. However, since this is critical, SDN Engineering is willing to take some shortcuts. Once the fix is confirmed to work for 4.6, gets approved, tested and merged, SDN Engineering will ask QA to validate the image for 3.11. *** Bug 1835646 has been marked as a duplicate of this bug. *** Setting the target release to the current development branch. We will consider backports once we have the fix. My customer enabled egress network policies using the patched sdn image, and they immediately started seeing a spike of timeout issues. I've quoted the team's update to the case below. Please let us know if you need any other information to assist with this. -------------------------------- "We deployed the image quay.io/jdesousa/sdn:v3.11.219-concurrent-dns in our clusters and started noticing failed egress connections immediately, with some behavior differences that I'll list below: - Egress Network Policies that allow access to one domain would work for some pods and not for others. For example, pods of the same deployment were affected differently. * killing a pod (and it being rescheduled in another node) had a high probability that in the other node egress access is behaving as configured with ENPs * At least 1/3rd of the nodes were affected. So it's difficult to exclude that this is (or isn't) related to the nodes. [1] A concrete example was access to the domain `api.icims.com`, which was working from podA and not in podB. I verified manually using remote shell, and initiating a tcp connection using telnet. I was able to connect in podA and not in podB. The deployment of the fix was as per your suggestion to simply edit sdn deamonset by removing the image trigger and replacing the image. Then I deleted the sdn pods one by one and let new ones restart. I verified that the nodes had the new image by ssh-ing into those and checking the running image." -------------------------------- A new image has been provided yesterday between 16:00-17:00 UTC, so far there are no complains. Will wait until tomorrow before proceeding to write the tests for this so that this can get merged. The core dump is corrupt: [jdesousa@systemv /tmp/mozilla_jdesousa0] $ md5sum sdn.core.18146.gz 2e19daad569f65486733a12257d2217e sdn.core.18146.gz [jdesousa@systemv /tmp/mozilla_jdesousa0] $ gzip -d sdn.core.18146.gz [jdesousa@systemv /tmp/mozilla_jdesousa0] $ md5sum sdn.core.18146 14ec2612b326ee5bc1d96d56f26925e7 sdn.core.18146 [jdesousa@systemv /tmp/mozilla_jdesousa0] $ dlv core openshift sdn.core.18146 reading NT_PRPSINFO: unexpected EOF The logs however contain useful information. Will provide an update today The issue is well known, I'll make a few test builds and test it myself, and we'll test a new image on Friday. For short lived TTLs (i.e less than half hour) we'll assume the TTL is at most 10 seconds, also we'll only remove an ip address from the list if it hasn't resolved to it 5 times in a row. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |