Bug 1752636
Summary: | Networkpolicy resources not getting applied on update | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | rvanderp | |
Component: | Networking | Assignee: | Dan Winship <danw> | |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | unspecified | CC: | anusaxen, cdc, danw, dsafford, dyocum, erich, farandac, gparente, jack.ottofaro, jdesousa, mfiedler, mifiedle, misalunk, openshift-bugs-escalate, palonsor, piqin, rhowe, ricarril, rkshirsa, scuppett, sreber, tsmetana, wabouham, weliang | |
Version: | 3.11.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: In clusters with many namespaces (and especially, with lots of namespace creation/deletion) and lots of NetworkPolicies that select namespaces, OpenShift might take a very long time to apply the NetworkPolicy rules for newly-created namespaces.
Consequence: When a namespace was created, it might take an hour or more before it was correctly accessible to/from other namespaces.
Fix: Improvements were made to the Namespace and NetworkPolicy handling code.
Result: NetworkPolicies should be applied promptly to newly-created namespaces.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1758232 1758233 1758235 (view as bug list) | Environment: | ||
Last Closed: | 2020-01-23 11:05:53 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1758232, 1758233, 1758235 |
Description
rvanderp
2019-09-16 18:46:30 UTC
Hi Can you please paste 'oc get netnamespaces'? Do you know if this cluster was initially created as Multitenant, then moved to NetworkPolicy ? Yeah, thanks, that's what I was after. I wanted to know if they were created flows at all. Since there are flows in table 80, something had to be slowing it down, since the sync flows go code in networkpolicy in SDN syncs every second. Created attachment 1616210 [details]
AFTER deleting netpols
Created attachment 1616211 [details]
BEFORE deleting netpols
OK, so further debugging showed that every call to networkpolicy.go:handleAddOrUpdateNamespace() was taking 2 seconds to run. This turns out to be because they have a huge number of namespaces, and every one of them has an "allow from default namespace" policy (as created by the multitenant-to-networkpolicy migration script, and as recommended to make routers work in 3.11). The problem is that networkpolicy.go doesn't make any effort to recognize that these are all the same policy. So every time you add a new Namespace, it sees that there are 10,000 (or whatever) NetworkPolicies with namespaceSelectors, and tests each one against the new Namespace to see if it's a match or not. The fix is to reorganize that code to keep only a single copy of each NamespaceSelector, and apply its matches to every policy that uses that selector. This may not be easy to do. (I don't think there are any good workarounds until there is a fix: getting rid of the allow-from-default policies would be disruptive, in that it would break routers and some other stuff.) (In reply to Dan Winship from comment #49) > > The fix is to reorganize that code to keep only a single copy of each > NamespaceSelector, and apply its matches to every policy that uses that > selector. This may not be easy to do. > It would be interesting to maintain an inverted index of which network policies are interested in which label selectors. That should be a pretty good shortcut. We would have to be clever with the deleted-label case, though. Posted a patch against master; additional clones of this bug will be created for backports. Sure Dan Winship, will keep an eye on the merge. Thanks! Not sure why this didn't get automatically moved to ON_QA before (are we not QA'ing 4.3 bugs yet) but this needs to get officially VERIFIED before the backports can proceed. Verifying based on comment 68 and today's scale test observations as follow Steps: 1) 3 master-4 worker cluster was brought up on 4.3 CI build 4.3.0-0.ci-2019-10-02-102400 (There are no nightly's yet) 2) 5000 projects with each containing `allow-from-default-namespace` policy were created 3) 950 pods were created randomly among 5000 projects to observe ovs flows (we are bound by 250 pods /node) 4) OVS table=80 flows across the workers totals around 974, seems good $ oc exec ovs-fpj9s -- ovs-ofctl dump-flows br0 -O openflow13 | grep table=80 | wc -l 245 $ oc exec ovs-t49cn -- ovs-ofctl dump-flows br0 -O openflow13 | grep table=80 | wc -l 241 $ oc exec ovs-vzp6c -- ovs-ofctl dump-flows br0 -O openflow13 | grep table=80 | wc -l 244 $ oc exec ovs-xzp5d -- ovs-ofctl dump-flows br0 -O openflow13 | grep table=80 | wc -l 244 5) After a 6 hours longevity, OVS flow totals remain same 6) Network policy updates across the projects are also working without any sdn/ovs pod restarts. Will verify again on 3.11 once backported. Thanks. I guess we need a bug to be opened for 3.11z as well I changed the version of this bz to 4.3. The 3.11 bz clone is https://bugzilla.redhat.com/show_bug.cgi?id=1758235 (In reply to Borja from comment #72) > I changed the version of this bz to 4.3. "Version" is the version the bug was reported it; "Target Release" is the version it's being fixed in Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |