Bug 1859682 - Network Policies stale entries exists indefinitely under NB db
Summary: Network Policies stale entries exists indefinitely under NB db
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Andrew Stoycos
QA Contact: Anurag saxena
URL:
Whiteboard:
: 1877560 (view as bug list)
Depends On:
Blocks: 1877560 1877561
TreeView+ depends on / blocked
 
Reported: 2020-07-22 17:00 UTC by Anurag saxena
Modified: 2020-10-27 16:17 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Race condition Consequence: ACLs and associated OVN network entities were orphaned on cluster Fix: Add a cleanup network policy function that is called upon namespace deletion Result: This fix stops the race, all Network policies and they're associated ACLs + entities are removed
Clone Of:
: 1877560 1877561 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:16:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovnkube master logs (816.98 KB, text/plain)
2020-07-22 17:00 UTC, Anurag saxena
no flags Details
ovnkube-node logs (244.06 KB, text/plain)
2020-07-22 17:01 UTC, Anurag saxena
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 286 0 None closed Bug 1882037: 9-23-2020 merge 2020-12-16 18:24:47 UTC
Github ovn-org ovn-kubernetes pull 1665 0 None closed Fix Orphaned ACL race condition 2020-12-16 18:24:47 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:17:00 UTC

Description Anurag saxena 2020-07-22 17:00:42 UTC
Created attachment 1702116 [details]
ovnkube master logs

Description of problem:Found some of network policies exist indefinitely under NB db even though the project containing those policies was deleted.


Version-Release number of selected component (if applicable):4.5.3


How reproducible:Always


Steps to Reproduce:
1.On OVNKubernetes cluster, Create 10 network policies in a project

2.Perform a query searching those NPs under OVN NB -  oc exec ovnkube-master-mhrv6 -- ovn-nbctl list ACL

3.Delete project containing all NPs

4.Repeat step 2 to make sure policies gets deleted from NB db as well

Actual results: Step 4 complained 4/10 NPs exists indefinitely in NB db

Expected results:Network Policies should be cleared from NB db as well


Additional info:ovnkube-master and ovnkube-node logs attached

Policies created were "allow-from-blue-1" to "allow-from-blue-10" but following exists indefinitely

$ oc exec ovnkube-master-mhrv6 -- ovn-nbctl list ACL|grep -i allow-from-blue
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-mhrv6 -n openshift-ovn-kubernetes' to see all of the containers in this pod.
external_ids        : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-6, policy_type=Ingress}
external_ids        : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-8, policy_type=Ingress}
external_ids        : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-9, policy_type=Ingress}
external_ids        : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-7, policy_type=Ingress}

Comment 1 Anurag saxena 2020-07-22 17:01:19 UTC
Created attachment 1702117 [details]
ovnkube-node logs

Comment 2 Andrew Stoycos 2020-08-11 15:46:38 UTC
Hi Anurag, 

I have not been able to reproduce this in my local testing environment, I was successfully able to make 10 NetworkPolicies that were accurately removed upon namespace deletion.  Can you still re-produce this bug on your cluster?

Comment 3 Anurag saxena 2020-08-11 16:08:33 UTC
Sure, Andrew. I will try to repro this and share env with you

Comment 6 Andrew Stoycos 2020-08-19 20:20:39 UTC
Closing for now, please reopen if the problem is reproduced.

Comment 7 Anurag saxena 2020-08-19 20:41:06 UTC
@Andrew. Not sure why this is CLOSED but as i mentioned above that this is always a repro in 4.5 but not on 4.6. The customers like VZ are expected to upgrade to 4.5 and might encounter this. WDYT

Comment 8 Andrew Stoycos 2020-08-19 20:44:40 UTC
Sorry about that Anurag, I think my BZ permissions must be incorrect since I could not see any new commits.  I will attempt to reproduce on with 4.5 and get back to you.

Comment 9 Anurag saxena 2020-08-19 20:47:53 UTC
no worries. Let me know, I can also share cluster with you. Above one is pruned now

Comment 10 Andrew Stoycos 2020-08-19 20:52:19 UTC
Ok, that would be great, if you don't mind just sending me the cluster info to my email <astoycos> I will see what I can do

Comment 12 Andrew Stoycos 2020-08-20 20:12:08 UTC
Interesting... OVNkube Master is sending the delete command properly for all ACLS 

[astoycos@blademm ovn-kubernetes]$ oc logs ovnkube-master-6tds9 ovnkube-master | grep policy
I0819 21:50:19.820426       1 policy.go:930] Adding network policy allow-from-blue-1 in namespace c-4sw
I0819 21:50:21.126521       1 policy.go:930] Adding network policy allow-from-blue-2 in namespace c-4sw
I0819 21:50:22.425348       1 policy.go:930] Adding network policy allow-from-blue-3 in namespace c-4sw
I0819 21:50:23.570671       1 policy.go:930] Adding network policy allow-from-blue-4 in namespace c-4sw
I0819 21:50:25.099853       1 policy.go:930] Adding network policy allow-from-blue-5 in namespace c-4sw
I0819 21:50:26.488301       1 policy.go:930] Adding network policy allow-from-blue-6 in namespace c-4sw
I0819 21:50:27.681563       1 policy.go:930] Adding network policy allow-from-blue-7 in namespace c-4sw
I0819 21:50:29.090299       1 policy.go:930] Adding network policy allow-from-blue-8 in namespace c-4sw
I0819 21:50:30.302540       1 policy.go:930] Adding network policy allow-from-blue-9 in namespace c-4sw
I0819 21:50:31.564745       1 policy.go:930] Adding network policy allow-from-blue-10 in namespace c-4sw
I0819 21:52:14.571226       1 policy.go:930] Adding network policy allow-from-red-1 in namespace c-4sw
I0819 21:52:15.709963       1 policy.go:930] Adding network policy allow-from-red-2 in namespace c-4sw
I0819 21:52:16.907521       1 policy.go:930] Adding network policy allow-from-red-3 in namespace c-4sw
I0819 21:52:18.166393       1 policy.go:930] Adding network policy allow-from-red-4 in namespace c-4sw
I0819 21:52:19.461405       1 policy.go:930] Adding network policy allow-from-red-5 in namespace c-4sw
I0819 21:52:20.643500       1 policy.go:930] Adding network policy allow-from-red-6 in namespace c-4sw
I0819 21:52:21.847235       1 policy.go:930] Adding network policy allow-from-red-7 in namespace c-4sw
I0819 21:52:22.971010       1 policy.go:930] Adding network policy allow-from-red-8 in namespace c-4sw
I0819 21:52:24.476897       1 policy.go:930] Adding network policy allow-from-red-9 in namespace c-4sw
I0819 21:52:25.657679       1 policy.go:930] Adding network policy allow-from-red-10 in namespace c-4sw
I0819 21:54:02.072549       1 policy.go:1115] Deleting network policy allow-from-blue-1 in namespace c-4sw
I0819 21:54:02.114856       1 policy.go:1115] Deleting network policy allow-from-blue-10 in namespace c-4sw
I0819 21:54:02.138441       1 policy.go:1115] Deleting network policy allow-from-blue-2 in namespace c-4sw
I0819 21:54:02.162150       1 policy.go:1115] Deleting network policy allow-from-blue-3 in namespace c-4sw
I0819 21:54:02.190051       1 policy.go:1115] Deleting network policy allow-from-blue-4 in namespace c-4sw
I0819 21:54:02.212620       1 policy.go:1115] Deleting network policy allow-from-blue-5 in namespace c-4sw
I0819 21:54:02.240378       1 policy.go:1115] Deleting network policy allow-from-blue-6 in namespace c-4sw
I0819 21:54:02.267691       1 policy.go:1115] Deleting network policy allow-from-blue-7 in namespace c-4sw
I0819 21:54:02.298269       1 policy.go:1115] Deleting network policy allow-from-blue-8 in namespace c-4sw
I0819 21:54:02.337928       1 policy.go:1115] Deleting network policy allow-from-blue-9 in namespace c-4sw
I0819 21:54:02.372556       1 policy.go:1115] Deleting network policy allow-from-red-1 in namespace c-4sw
I0819 21:54:02.404705       1 policy.go:1115] Deleting network policy allow-from-red-10 in namespace c-4sw
I0819 21:54:02.478611       1 policy.go:1115] Deleting network policy allow-from-red-2 in namespace c-4sw
I0819 21:54:02.478771       1 policy.go:1115] Deleting network policy allow-from-red-3 in namespace c-4sw
I0819 21:54:02.479137       1 policy.go:1115] Deleting network policy allow-from-red-4 in namespace c-4sw
I0819 21:54:02.479220       1 policy.go:1115] Deleting network policy allow-from-red-5 in namespace c-4sw
I0819 21:54:02.479314       1 policy.go:1115] Deleting network policy allow-from-red-6 in namespace c-4sw
I0819 21:54:02.479385       1 policy.go:1115] Deleting network policy allow-from-red-7 in namespace c-4sw
I0819 21:54:02.479468       1 policy.go:1115] Deleting network policy allow-from-red-8 in namespace c-4sw
I0819 21:54:02.479559       1 policy.go:1115] Deleting network policy allow-from-red-9 in namespace c-4sw

but they are still being persisted in ovnnbdb, so the issue is somewhere in the communication between ovnkube-master and the nbdb.  I am leaning towards this being a race condition where the namespace is deleted before all the ACL's can be removed which I believe was solved by the addition of workqueues in 4.6.  However I am going to dig a bit more to confirm the problem

Comment 15 Ben Bennett 2020-09-03 13:46:59 UTC
This appears not to affect 4.6, so will only need a backport to 4.5.

Comment 16 Andrew Stoycos 2020-09-03 14:01:08 UTC
Ok so after talking with some other team members I think we've isolated where the race is, i.e a namespace is deleted before all of the networking entities are deleted in https://github.com/openshift/ovn-kubernetes/blob/release-4.5/go-controller/pkg/ovn/policy.go#L1114, resulting in OVN not cleaning up any ACLs that remain.  @Anurag if you could recreate one more time and share the cluster I think we will be able to confirm the exact issue.

Comment 18 Andrew Stoycos 2020-09-04 20:19:00 UTC
Was finally able to force reproduce this race on OVN-k8's upstream by adding a pause in the NetworkPolicy delete code.  This convinces me that the race is not isolated to OCP OVN K8's 4.5 but affects master as well.  The linked PR tracks my work to fix this bug.

Comment 19 Anurag saxena 2020-09-09 20:28:00 UTC
I created the clones for 4.5 and 4.6 in case. Needs to be tested again on backported releases. Thanks. Correct me if i am mistaken.

Comment 20 Ben Bennett 2020-09-10 13:12:43 UTC
*** Bug 1877560 has been marked as a duplicate of this bug. ***

Comment 21 Andrew Stoycos 2020-09-23 21:39:33 UTC
Has been merged upstream as of 9/23, just waiting for downstream rebase for backport.

Comment 23 Andrew Stoycos 2020-09-24 19:09:50 UTC
@Anurag do you mind verifying this fix so I can backport

Comment 24 Anurag saxena 2020-09-24 19:36:38 UTC
Sure, Andrew. I will try to verify this by EOD

Comment 25 Anurag saxena 2020-09-24 20:22:35 UTC
Thanks Andrew. Test (mentioned in bug description) looks good on 4.6.0-0.nightly-2020-09-24-111253. Verifying this one.

Comment 26 Andrew Stoycos 2020-09-24 20:23:16 UTC
Thanks!

Comment 29 errata-xmlrpc 2020-10-27 16:16:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.