Created attachment 1702116 [details] ovnkube master logs Description of problem:Found some of network policies exist indefinitely under NB db even though the project containing those policies was deleted. Version-Release number of selected component (if applicable):4.5.3 How reproducible:Always Steps to Reproduce: 1.On OVNKubernetes cluster, Create 10 network policies in a project 2.Perform a query searching those NPs under OVN NB - oc exec ovnkube-master-mhrv6 -- ovn-nbctl list ACL 3.Delete project containing all NPs 4.Repeat step 2 to make sure policies gets deleted from NB db as well Actual results: Step 4 complained 4/10 NPs exists indefinitely in NB db Expected results:Network Policies should be cleared from NB db as well Additional info:ovnkube-master and ovnkube-node logs attached Policies created were "allow-from-blue-1" to "allow-from-blue-10" but following exists indefinitely $ oc exec ovnkube-master-mhrv6 -- ovn-nbctl list ACL|grep -i allow-from-blue Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-mhrv6 -n openshift-ovn-kubernetes' to see all of the containers in this pod. external_ids : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-6, policy_type=Ingress} external_ids : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-8, policy_type=Ingress} external_ids : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-9, policy_type=Ingress} external_ids : {Ingress_num="0", ipblock_cidr="false", l4Match=None, namespace=d9yev, policy=allow-from-blue-7, policy_type=Ingress}
Created attachment 1702117 [details] ovnkube-node logs
Hi Anurag, I have not been able to reproduce this in my local testing environment, I was successfully able to make 10 NetworkPolicies that were accurately removed upon namespace deletion. Can you still re-produce this bug on your cluster?
Sure, Andrew. I will try to repro this and share env with you
Closing for now, please reopen if the problem is reproduced.
@Andrew. Not sure why this is CLOSED but as i mentioned above that this is always a repro in 4.5 but not on 4.6. The customers like VZ are expected to upgrade to 4.5 and might encounter this. WDYT
Sorry about that Anurag, I think my BZ permissions must be incorrect since I could not see any new commits. I will attempt to reproduce on with 4.5 and get back to you.
no worries. Let me know, I can also share cluster with you. Above one is pruned now
Ok, that would be great, if you don't mind just sending me the cluster info to my email <astoycos> I will see what I can do
Interesting... OVNkube Master is sending the delete command properly for all ACLS [astoycos@blademm ovn-kubernetes]$ oc logs ovnkube-master-6tds9 ovnkube-master | grep policy I0819 21:50:19.820426 1 policy.go:930] Adding network policy allow-from-blue-1 in namespace c-4sw I0819 21:50:21.126521 1 policy.go:930] Adding network policy allow-from-blue-2 in namespace c-4sw I0819 21:50:22.425348 1 policy.go:930] Adding network policy allow-from-blue-3 in namespace c-4sw I0819 21:50:23.570671 1 policy.go:930] Adding network policy allow-from-blue-4 in namespace c-4sw I0819 21:50:25.099853 1 policy.go:930] Adding network policy allow-from-blue-5 in namespace c-4sw I0819 21:50:26.488301 1 policy.go:930] Adding network policy allow-from-blue-6 in namespace c-4sw I0819 21:50:27.681563 1 policy.go:930] Adding network policy allow-from-blue-7 in namespace c-4sw I0819 21:50:29.090299 1 policy.go:930] Adding network policy allow-from-blue-8 in namespace c-4sw I0819 21:50:30.302540 1 policy.go:930] Adding network policy allow-from-blue-9 in namespace c-4sw I0819 21:50:31.564745 1 policy.go:930] Adding network policy allow-from-blue-10 in namespace c-4sw I0819 21:52:14.571226 1 policy.go:930] Adding network policy allow-from-red-1 in namespace c-4sw I0819 21:52:15.709963 1 policy.go:930] Adding network policy allow-from-red-2 in namespace c-4sw I0819 21:52:16.907521 1 policy.go:930] Adding network policy allow-from-red-3 in namespace c-4sw I0819 21:52:18.166393 1 policy.go:930] Adding network policy allow-from-red-4 in namespace c-4sw I0819 21:52:19.461405 1 policy.go:930] Adding network policy allow-from-red-5 in namespace c-4sw I0819 21:52:20.643500 1 policy.go:930] Adding network policy allow-from-red-6 in namespace c-4sw I0819 21:52:21.847235 1 policy.go:930] Adding network policy allow-from-red-7 in namespace c-4sw I0819 21:52:22.971010 1 policy.go:930] Adding network policy allow-from-red-8 in namespace c-4sw I0819 21:52:24.476897 1 policy.go:930] Adding network policy allow-from-red-9 in namespace c-4sw I0819 21:52:25.657679 1 policy.go:930] Adding network policy allow-from-red-10 in namespace c-4sw I0819 21:54:02.072549 1 policy.go:1115] Deleting network policy allow-from-blue-1 in namespace c-4sw I0819 21:54:02.114856 1 policy.go:1115] Deleting network policy allow-from-blue-10 in namespace c-4sw I0819 21:54:02.138441 1 policy.go:1115] Deleting network policy allow-from-blue-2 in namespace c-4sw I0819 21:54:02.162150 1 policy.go:1115] Deleting network policy allow-from-blue-3 in namespace c-4sw I0819 21:54:02.190051 1 policy.go:1115] Deleting network policy allow-from-blue-4 in namespace c-4sw I0819 21:54:02.212620 1 policy.go:1115] Deleting network policy allow-from-blue-5 in namespace c-4sw I0819 21:54:02.240378 1 policy.go:1115] Deleting network policy allow-from-blue-6 in namespace c-4sw I0819 21:54:02.267691 1 policy.go:1115] Deleting network policy allow-from-blue-7 in namespace c-4sw I0819 21:54:02.298269 1 policy.go:1115] Deleting network policy allow-from-blue-8 in namespace c-4sw I0819 21:54:02.337928 1 policy.go:1115] Deleting network policy allow-from-blue-9 in namespace c-4sw I0819 21:54:02.372556 1 policy.go:1115] Deleting network policy allow-from-red-1 in namespace c-4sw I0819 21:54:02.404705 1 policy.go:1115] Deleting network policy allow-from-red-10 in namespace c-4sw I0819 21:54:02.478611 1 policy.go:1115] Deleting network policy allow-from-red-2 in namespace c-4sw I0819 21:54:02.478771 1 policy.go:1115] Deleting network policy allow-from-red-3 in namespace c-4sw I0819 21:54:02.479137 1 policy.go:1115] Deleting network policy allow-from-red-4 in namespace c-4sw I0819 21:54:02.479220 1 policy.go:1115] Deleting network policy allow-from-red-5 in namespace c-4sw I0819 21:54:02.479314 1 policy.go:1115] Deleting network policy allow-from-red-6 in namespace c-4sw I0819 21:54:02.479385 1 policy.go:1115] Deleting network policy allow-from-red-7 in namespace c-4sw I0819 21:54:02.479468 1 policy.go:1115] Deleting network policy allow-from-red-8 in namespace c-4sw I0819 21:54:02.479559 1 policy.go:1115] Deleting network policy allow-from-red-9 in namespace c-4sw but they are still being persisted in ovnnbdb, so the issue is somewhere in the communication between ovnkube-master and the nbdb. I am leaning towards this being a race condition where the namespace is deleted before all the ACL's can be removed which I believe was solved by the addition of workqueues in 4.6. However I am going to dig a bit more to confirm the problem
This appears not to affect 4.6, so will only need a backport to 4.5.
Ok so after talking with some other team members I think we've isolated where the race is, i.e a namespace is deleted before all of the networking entities are deleted in https://github.com/openshift/ovn-kubernetes/blob/release-4.5/go-controller/pkg/ovn/policy.go#L1114, resulting in OVN not cleaning up any ACLs that remain. @Anurag if you could recreate one more time and share the cluster I think we will be able to confirm the exact issue.
Was finally able to force reproduce this race on OVN-k8's upstream by adding a pause in the NetworkPolicy delete code. This convinces me that the race is not isolated to OCP OVN K8's 4.5 but affects master as well. The linked PR tracks my work to fix this bug.
I created the clones for 4.5 and 4.6 in case. Needs to be tested again on backported releases. Thanks. Correct me if i am mistaken.
*** Bug 1877560 has been marked as a duplicate of this bug. ***
Has been merged upstream as of 9/23, just waiting for downstream rebase for backport.
@Anurag do you mind verifying this fix so I can backport
Sure, Andrew. I will try to verify this by EOD
Thanks Andrew. Test (mentioned in bug description) looks good on 4.6.0-0.nightly-2020-09-24-111253. Verifying this one.
Thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196