Bug 1825355
Summary: | openshift-sdn node can permanently NetNamespaces when LIST times out | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jim Minter <jminter> | ||||||
Component: | Networking | Assignee: | Casey Callendrello <cdc> | ||||||
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | high | ||||||||
Priority: | urgent | CC: | aconstan, bbennett, ffranz, gmarkley, jeder, mjudeiki, skrenger, sreber | ||||||
Version: | 4.3.z | Keywords: | ServiceDeliveryBlocker | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: When etcd is very slow, openshift-sdn can miss namespace creation events due to a race condition.
Consequence: Pods in that namespace have no connectivity.
Fix: The race condition was removed.
Result: Pods eventually have connectivity.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1839107 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-07-13 17:28:35 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1839107 | ||||||||
Attachments: |
|
Description
Jim Minter
2020-04-17 18:56:55 UTC
Created attachment 1679732 [details]
log from sdn-zcwr6 pod on bad master
Created attachment 1679733 [details]
ovs-ofctl dump-flows br0 -O OpenFlow13 on bad master
Setting the target release to 4.5 so that this can get worked on. We will backport fixes as needed once we have identified the problem. There is a (prometheus) metric for VNID-not-found errors - you can graph this on a test cluster and see if there is a correlation. Jim, I think your analysis is exactly correct. And it should be an easy fix. It would be better if we didn't make direct GET calls to the apiserver, but that's a pattern that openshift-sdn often uses. Is it only NetNamespace watches that fail? Clearly GETs succeed. hi, Casey, any advice to verify this bug since it's not always to reproduce this kind of issue? thanks in advance. This is quite difficult to verify. It's only caused by etcd resource starvation, and has never happened in any meaningful way outside of Azure, which we can't use. Sorry, I don't immediately have any ideas. Maybe you can really throttle down disk iops? But that will break the cluster in other ways. Perhaps the best way to verify this is just to create a lot of namespaces very quickly and ensure that pods work. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |