Bug 2095852 - Unable to create Network Policies: error: unexpectedly found multiple equivalent ACLs (arp v/s arp||nd) (ns_netpol1 v/s ns_netpol2)
Summary: Unable to create Network Policies: error: unexpectedly found multiple equival...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Surya Seetharaman
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-10 18:47 UTC by Nate Childers
Modified: 2023-01-17 19:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In order to support allowing arp traffic through the default ACLs that are created for network policies, we changed the match on the ACL from "arp" to "arp || nd" in this change: https://github.com/openshift/ovn-kubernetes/pull/1043/files. Instead of updating the existing "arp" match ACLs over to "arp || nd", this change started creating a new default arp ACL for every network policy namespace in addition to the existing ones that were present in the cluster. Consequence: That created errors like this "unexpectedly found multiple equivalent ACLs (arp v/s arp||nd)" in ovnkube-master which prevented network policies from being created properly. Fix: This fix removes the older ACLs with just the "arp" match so that only the ones with the new match "arp || nd" exist. Result: Network policies can be created correctly and no errors will be observed on ovnkube-master. NOTE: This effects customers upgrading into 4.8.14 or 4.9.32 or 4.10.13 or higher from older versions.
Clone Of:
: 2123121 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:49:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
selected items from must-gather.local.6813085809915985852 (4.64 MB, application/zip)
2022-06-13 11:55 UTC, Nate Childers
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1210 0 None Merged OCPBUGSM-45393: Bug 2078691: [Downstream Merge] 22-07-2022 2023-01-09 17:53:33 UTC
Github ovn-org ovn-kubernetes pull 3038 0 None Merged Cleanup stale acls as part of syncNetworkPolicies on startup 2023-01-09 17:53:34 UTC
Github ovn-org ovn-kubernetes pull 3076 0 None Merged syncNetworkPolicies: Remove ACLs from PGs before deleting 2023-01-09 17:53:36 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:17 UTC

Description Nate Childers 2022-06-10 18:47:08 UTC
I saw the following while trying to debug the following "unexpectedly found multiple equivalent ACLs" error.

Add a generic networkpolicy:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-same-namespace
  namespace: nbc9-demo-project
spec:
  podSelector: {}
  ingress:
    - from:
        - podSelector: {}
  policyTypes:
    - Ingress


$ kubectl get pod ovnkube-master-pk89w -o jsonpath='{range .spec.containers[]}{@.image}'
quay.io/openshift/okd-content@sha256:79ee71e045a7b224a132f6c75b4220ec35b9a06049061a6bd9ca9fc976c412e5

[root@dev-nkjpp-master-2 ~]# ovnkube -v
I0609 17:33:34.930787      58 ovs.go:93] Maximum command line arguments set to: 191102
Version: 0.3.0
Git commit: 7bf36eea28fe66365d0dfdf8c39e3311ea14d19b
Git branch: release-4.10
Go version: go1.16.6
Build date: 2022-05-27
OS/Arch: linux amd64

Which then fails to apply, retries, and when the networkpolicy is deleted, the ovnkube-master pod segfaults: 

I0609 17:00:26.653710       1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project
E0609 17:00:26.656858       1 ovn.go:753] Failed to create network policy nbc9-demo-project/allow-same-namespace, error: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df310 Name:0xc0010df320 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df330} {UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df390 Name:0xc0010df3d0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df3e0}]
I0609 17:00:51.437895       1 policy_retry.go:46] Network Policy Retry: nbc9-demo-project/allow-same-namespace retry network policy setup
I0609 17:00:51.437935       1 policy_retry.go:63] Network Policy Retry: Creating new policy for nbc9-demo-project/allow-same-namespace
I0609 17:00:51.437941       1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project
I0609 17:00:51.438174       1 policy_retry.go:65] Network Policy Retry create failed for nbc9-demo-project/allow-same-namespace, will try again later: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc002215e00 Name:0xc002215e70 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc002215e80} {UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0022b0310 Name:0xc0022b03a0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc000070ab0}]
I0609 17:01:02.679219       1 policy.go:1174] Deleting network policy allow-same-namespace in namespace nbc9-demo-project


E0609 17:01:02.679407       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 249 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1c19c80, 0x2e9a810)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1c19c80, 0x2e9a810)
	/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a021d5]

goroutine 249 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1c19c80, 0x2e9a810)
	/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850)
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65

Please let me know if any further information is required. I have a must-gather for this cluster but the file attachment tool in bugzilla won't let me attach anything larger than 19.5MB (the must-gather is 212.1MB)

Comment 1 Surya Seetharaman 2022-06-10 19:25:10 UTC
This is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2091238#c1 and fix is being worked on.

Comment 2 Surya Seetharaman 2022-06-10 19:35:08 UTC
Hi Nate,

thanks for the bz, could you tell me the exact reproduction steps and also how did you end up with 

error: unexpectedly found multiple equivalent ACLs: [{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc002215e00 Name:0xc002215e70 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc002215e80} {UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0022b0310 Name:0xc0022b03a0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc000070ab0}]
?

I know the panic here is a problem of course but also trying to understand the multiple ACLs error. Which exact 4.10 version of OCP is this?

Comment 3 Nate Childers 2022-06-10 19:57:50 UTC
Hi Surya,

the Multiple ACLs error happens when i create the network policy given the policy yaml i put in the top of the description. That's all. Create the policy with that yaml, see the failure in the logs, delete the policy, ovn crashes. It's very strange.

This is OKD 4.10, not OCP.

Your good colleagues on github sent me here:
https://github.com/openshift/okd/issues/1257
https://github.com/ovn-org/ovn-kubernetes/issues/3031

Comment 4 Surya Seetharaman 2022-06-10 20:03:26 UTC
Thanks Nate!

In your case I know why the panic is happening, I will post a PR to fix this.
Let me see if I can reproduce the ACLs issue you are seeing because when I tried to use your yaml on my cluster, creation worked fine:

I0610 19:29:43.867352      53 policy.go:715] Processing NetworkPolicy nbc9-demo-project/allow-same-namespace to have 0 local pods...
I0610 19:29:43.868605      53 model_client.go:344] Create operations generated as: [{Op:insert Table:Port_Group Row:map[external_ids:{GoMap:map[name:nbc9-demo-project_allow-same-namespace]} name:a17939553712128537144] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996267}]
I0610 19:29:43.868740      53 transact.go:41] Configuring OVN: [{Op:insert Table:Port_Group Row:map[external_ids:{GoMap:map[name:nbc9-demo-project_allow-same-namespace]} name:a17939553712128537144] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996267}]
I0610 19:29:43.868873      53 client.go:781]  "msg"="transacting operations"  "database"="OVN_Northbound" "operations"="[{Op:insert Table:Port_Group Row:map[external_ids:{GoMap:map[name:nbc9-demo-project_allow-same-namespace]} name:a17939553712128537144] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996267}]"
I0610 19:29:43.869697      53 cache.go:668] cache "msg"="inserting row" "database"="OVN_Northbound" "table"="Port_Group" "uuid"="9b7f55fc-af82-48b0-a2e9-f6a5c364bcb4" "model"="&{UUID:9b7f55fc-af82-48b0-a2e9-f6a5c364bcb4 ACLs:[] ExternalIDs:map[name:nbc9-demo-project_allow-same-namespace] Name:a17939553712128537144 Ports:[]}"
I0610 19:29:43.871151      53 policy.go:380] ACL for network policy: allow-same-namespace, updated to new log level: 
I0610 19:29:43.871206      53 obj_retry.go:1363] Creating *v1.NetworkPolicy nbc9-demo-project/allow-same-namespace took: 11.452211ms



To me it seems like there is some residue of ACLs in your cluster or something.. But let me close this bug as dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2091238 and work on the fix of the nil pointer panic. Is that ok for you?

Comment 5 Surya Seetharaman 2022-06-10 20:09:55 UTC
Or actually on second thoughts, on the other bug as well the symptoms are the same:

2022-06-01T08:17:45.418763259Z I0601 08:17:45.418759       1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx
2022-06-01T08:17:45.419281334Z E0601 08:17:45.419267       1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:801a19c3-0464-451d-b9d3-373e7503f6ba Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && arp Meter:0xc0025e1000 Name:0xc0025e1010 Options:map[] Priority:1001 Severity:0xc0025e1020} {UUID:2b2f0849-0075-4046-b4cf-2dec306c9550 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1070 Name:0xc0025e1080 Options:map[] Priority:1001 Severity:0xc0025e1090}]
2022-06-01T08:17:45.419289685Z I0601 08:17:45.419279       1 policy.go:1092] Adding network policy allow-from-other-namespaces in namespace xxxxx
2022-06-01T08:17:45.419750650Z E0601 08:17:45.419739       1 ovn.go:753] Failed to create network policy xxxxx/allow-from-other-namespaces, error: failed to create default port groups and acls for policy: xxxxx/allow-from-other-namespaces, error: unexpectedly found multiple equivalent ACLs: [{UUID:14c169a9-069f-45cd-86d3-3d2038340105 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && (arp || nd) Meter:0xc0025e16f0 Name:0xc0025e1700 Options:map[] Priority:1001 Severity:0xc0025e1710} {UUID:00202ce5-2851-426a-ba46-16d4f1b06c41 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && arp Meter:0xc0025e1750 Name:0xc0025e1760 Options:map[] Priority:1001 Severity:0xc0025e1770}]
2022-06-01T08:17:45.419759490Z I0601 08:17:45.419748       1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx
2022-06-01T08:17:45.422286498Z E0601 08:17:45.420247       1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:c10e465c-4241-4d8c-8f78-7ceb6e38aeaa Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && arp Meter:0xc0025e1e00 Name:0xc0025e1e10 Options:map[] Priority:1001 Severity:0xc0025e1e20} {UUID:626d3785-8bce-4198-b322-ae747a8c95fb Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1ed0 Name:0xc0025e1ee0 Options:map[] Priority:1001 Severity:0xc0025e1ef0}]
2022-06-01T08:17:45.422286498Z I0601 08:17:45.420261       1 ovn.go:808] Bootstrapping existing policies and cleaning stale policies took 1.049654614s
2022-06-01T08:17:45.422286498Z I0601 08:17:45.420352       1 policy.go:1174] Deleting network policy allow-from-same-namespace in namespace xxxxx
2022-06-01T08:17:45.422286498Z E0601 08:17:45.420411       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

Since creation never worked, we ended up passing a nil pointer during deletion which causes the panic. I'll do a fix for that as part of https://bugzilla.redhat.com/show_bug.cgi?id=2091238

I will keep this bug opened to track the error: "unexpectedly found multiple equivalent ACLs" error

Comment 6 Surya Seetharaman 2022-06-10 21:02:11 UTC
hm seems like we surely introduced this with: https://github.com/openshift/ovn-kubernetes/pull/1043/files

[root@bc104a3c4732 ~]# ovn-nbctl list acl | grep @a14044754821019150557_ingressDefaultDeny -C 5
action              : allow
direction           : to-lport
external_ids        : {default-deny-policy-type=Ingress}
label               : 0
log                 : false
match               : "outport == @a14044754821019150557_ingressDefaultDeny && (arp || nd)"
meter               : acl-logging
name                : elk_ARPallowPolicy
options             : {}
priority            : 1001
severity            : info
--
action              : allow
direction           : to-lport
external_ids        : {default-deny-policy-type=Ingress}
label               : 0
log                 : false
match               : "outport == @a14044754821019150557_ingressDefaultDeny && arp"
meter               : acl-logging
name                : elk_ARPallowPolicy
options             : {}
priority            : 1001
severity            : info

Comment 7 Surya Seetharaman 2022-06-10 21:30:24 UTC
> Please let me know if any further information is required. I have a must-gather for this cluster but the file attachment tool in bugzilla won't let me attach anything larger than 19.5MB (the must-gather is 212.1MB)

Could you please attach the following from the must-gather:

must-gather.local.6240950779494572355/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/network_logs/leader_nbdb.gz

and

must-gather.local.6240950779494572355/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/network_logs/leader_sbdb.gz

and

all logs inside:

must-gather.local.6240950779494572355/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/namespaces/openshift-ovn-kubernetes/pods

these mustn't be that high a limit and doing individual attachments will help.

Comment 8 Nate Childers 2022-06-13 11:55:46 UTC
Created attachment 1889401 [details]
selected items from must-gather.local.6813085809915985852

Thank you very much for looking into this, Please find attached:

.../network_logs/leader_sbdb.gz
.../network_logs/leader_nbdb.gz
and
.../namespaces/openshift-ovn-kubernetes/.../logs/*.log

Let me know if i missed anything or if you need anything else. I'll poke around some today and see if i can find a place to put the full must-gather. Worst case I can put it on our linux mirror. :)

Comment 9 Nate Childers 2022-06-13 12:34:37 UTC
ok, i found a place to put the full must-gather in case i missed anything in the slimmed down one: https://nbc9-snips.cloud.duke.edu/snips/must-gather.local.6813085809915985852.zip

Comment 10 Surya Seetharaman 2022-06-13 19:14:07 UTC
Thanks Nate! Downloaded it! 

            Projects: 75
             API URL: ['https://api.dev.okd4.fitz.cloud.duke.edu:6443']
            Platform: ['VSphere']
          Cluster ID: ['98250500-3303-4c96-9873-ab4e969da463']
     Desired Version: ['4.10.0-0.okd-2022-05-28-062148']


Looking at why the ACL creation failed in your case:

_uuid               : b9349cb8-99b0-4130-92cf-ab11b4874a11
action              : drop
direction           : from-lport
external_ids        : {default-deny-policy-type=Egress}
label               : 0
log                 : false
match               : "inport == @a12933912868060780448_egressDefaultDeny"
meter               : acl-logging
name                : oit-ssi-fluentd_allow-same-namespace
options             : {apply-after-lb="true"}
priority            : 1000
severity            : info

--
_uuid               : 5b98f17d-789f-4de1-9beb-36741bfa40d1
action              : drop
direction           : from-lport
external_ids        : {default-deny-policy-type=Egress}
label               : 0
log                 : false
match               : "inport == @a12933912868060780448_egressDefaultDeny"
meter               : acl-logging
name                : oit-ssi-fluentd_allow-from-openshift-ingress
options             : {apply-after-lb="true"}
priority            : 1000
severity            : info

OVNK code thinks these two are the same ACLs and we error out.

  name: allow-from-openshift-ingress
  namespace: oit-ssi-fluentd
  resourceVersion: "4266701"
  uid: 8b86da03-db3e-4385-ace9-75df5564d1bd
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          policy-group.network.openshift.io/ingress: ""
  podSelector: {}
  policyTypes:
  - Ingress 

versus

  name: allow-same-namespace
  namespace: oit-ssi-fluentd
  resourceVersion: "4267434"
  uid: fb8e716c-3c94-447e-b8e7-cc249d68c410
spec:
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress

I feel that's a bug on our side. I will look at:

// IsEquivalentACL if it has same uuid, or if it has same name
// and external ids, or if it has same priority, direction, match
// and action.
func IsEquivalentACL(existing *nbdb.ACL, searched *nbdb.ACL) bool {
	if searched.UUID != "" && existing.UUID == searched.UUID {
		return true
	}

	eName := getACLName(existing)
	sName := getACLName(searched)
	// TODO if we want to support adding/removing external ids,
	// we need to compare them differently, perhaps just the common subset
	if eName != "" && eName == sName && reflect.DeepEqual(existing.ExternalIDs, searched.ExternalIDs) {
		return true
	}
	return existing.Priority == searched.Priority &&
		existing.Direction == searched.Direction &&
		existing.Match == searched.Match &&
		existing.Action == searched.Action
}

and talk to the team about the logic around "matching ACLs".

Comment 11 Nate Childers 2022-06-13 19:30:38 UTC
I appreciate your expert analysis! Hopefully fixing that bug will help some other poor soul who finds themselves in this weird edge case with me.  My understanding from the documentation and examples was that the "allow ingress" rule and "allow same namespace" rule would be needed when in the presence of a "default deny" rule. I'm a little surprised that this specific configuration isn't more common so I'm assuming I've misunderstood something.

Please let me know if you need anything else from me. It does sound like in this scenario I should be able to just clear out the Network policies but ... given the above apparent misunderstanding I'm not quite sure how to structure the NetworkPolicy objects such that:
- Ingress related traffic allowed
- same namespace traffic allowed
- everything else not allowed.

If i've misunderstood I'll happily accept a pointer to some documentation. I was previously referring to
https://docs.okd.io/4.10/networking/network_policy/about-network-policy.html

Comment 12 Surya Seetharaman 2022-06-14 17:33:40 UTC
This bug will fix two things: https://github.com/ovn-org/ovn-kubernetes/pull/3034#issuecomment-1155488213.

The actual panic will be fixed via https://bugzilla.redhat.com/show_bug.cgi?id=2091238

Comment 13 Surya Seetharaman 2022-06-14 19:48:46 UTC
Hey Nate:

So I spoke to my team about this and I actually see on the 4.10 code:

	if len(nsInfo.networkPolicies) == 0 {
		err = oc.createDefaultDenyPGAndACLs(policy.Namespace, policy.Name, nsInfo)
		if err != nil {
			nsUnlock()
			return fmt.Errorf("failed to create default port groups and acls for policy: %s/%s, error: %v",
				policy.Namespace, policy.Name, err)
		}
	}

So in the end we only create ONE deny default ACL for ingress and ONE deny ACL for egress even if we have more than one network policy per namespace... I am not sure how you ended up in this situation.

I took all your policies from the must-gather and created them on a 4.10 OKD cluster and it works fine for me! As in I do not get TWO egressDefaultDeny's like it did in your cluster. Like you somehow seem to have gotten:


_uuid               : b9349cb8-99b0-4130-92cf-ab11b4874a11
action              : drop
direction           : from-lport
external_ids        : {default-deny-policy-type=Egress}
label               : 0
log                 : false
match               : "inport == @a12933912868060780448_egressDefaultDeny"
meter               : acl-logging
name                : oit-ssi-fluentd_allow-same-namespace
options             : {apply-after-lb="true"}
priority            : 1000
severity            : info

AND

--
_uuid               : 5b98f17d-789f-4de1-9beb-36741bfa40d1
action              : drop
direction           : from-lport
external_ids        : {default-deny-policy-type=Egress}
label               : 0
log                 : false
match               : "inport == @a12933912868060780448_egressDefaultDeny"
meter               : acl-logging
name                : oit-ssi-fluentd_allow-from-openshift-ingress
options             : {apply-after-lb="true"}
priority            : 1000
severity            : info

in your DB. I can't understand how that happened from the logs. The must-gather you attached is missing logs from startup of the pods because of the panic probably and restarts. So I can't really figure out exactly when the allow-from-openshift-ingress and allow-same-namespace ACLs were created for the first time before you ended up with duplicate ACLs...But basically what I am trying to say is we never should have hit that problem to begin with.

Here is what I did on my cluster. I created the oit-ssi-fluentd namespace and then created: allow-from-openshift-ingress.yaml  allow-from-openshift-monitoring.yaml  allow-same-namespace.yaml  fluentd-input.yaml network policies. It works perfectly fine for me. I only have one ACL:

_uuid               : f09c94a7-6dda-4327-9903-39162c439644                                                                                                                   
--                                                                                                                                                                           
external_ids        : {default-deny-policy-type=Egress}                                                                                                                      
label               : 0                                                                                                                                                      
log                 : false                                                                                                                                                  
match               : "inport == @a12933912868060780448_egressDefaultDeny"                                                                                                   
meter               : acl-logging                                                                                                                                            
name                : oit-ssi-fluentd_allow-same-namespace                                                                                                                   
options             : {apply-after-lb="true"}                                                                                                                                
priority            : 1000                                                                                                                                                   
severity            : info  

So the problem is that you aren't able to clear out the policies because of the other bug on panic with deletion: https://bugzilla.redhat.com/show_bug.cgi?id=2091238. There is one thing you could do which is to cleanup the OVN dbs i.e ssh into the master nodes and rm -rf the ovn dbs and then restart the ovnk masters to clear this up or reinstall OKD fresh and see. Because as per code, we never should have gotten the two acls you have. Not sure if you have more container logs on the master nodes we could look at to see how we reached this state with two ACLs (there must have been a successful create in one of the policies before ovnk master restarted).

Comment 14 Nate Childers 2022-06-15 11:43:04 UTC
Thank you again for the masterful analysis. We really do appreciate it. That scenario does make sense to me as I know we had policies working previously (the fluentd aggregator mentioned in this ACL has been working for months). However this cluster has been updated several times so there's some chance this was introduced during that process. There was quite a bit of rockiness in the 4.9 era.

I had considered clearing out the ovndb but wanted to capture this bug report and the issue on github before so that we could make sure it wasn't a deeper flaw. It sounds like you and your team are confident that this shouldn't be an issue going forward which is also reassuring.

Thanks again!

Comment 15 Surya Seetharaman 2022-06-15 11:59:15 UTC
Yes, thanks Nate! Meanwhile I have done a proactive fix in case we do run into these issues again: https://github.com/ovn-org/ovn-kubernetes/pull/3038 where we fix these glitches on restart.

Comment 16 Andreas Karis 2022-06-27 12:24:46 UTC
*** Bug 2101366 has been marked as a duplicate of this bug. ***

Comment 21 Surya Seetharaman 2022-07-22 18:05:16 UTC
Steps to verify for QE:

1) Bring up a cluster in 4.9.31 or older
2) Create a sample network policy
3) Check if ACL like this:

_uuid               : 8907909e-ff7f-48d0-becd-28fc995a7f60                                                                                                                   
action              : allow                                                                                                                                                  
direction           : from-lport                                                                                                                                             
external_ids        : {default-deny-policy-type=Egress}                                                                                                                      
label               : 0                                                                                                                                                      
log                 : false                                                                                                                                                  
match               : "inport == @a16323395479447859119_egressDefaultDeny && arp"                                                                                            
meter               : acl-logging                                                                                                                                            
name                : surya_ARPallowPolicy                                                                                                                                   
options             : {apply-after-lb="true"}                                                                                                                                
priority            : 1001                                                                                                                                                   
severity            : info 

exists by running ovn-nbctl list acl command
4) upgrade to 4.9.32 or just swap out images for ovnk after scaling down CVO (but really this is an upgrade bug)
Now repeat step3 and you should see:
_uuid               : 91392d7e-d2db-4530-a632-4db0f7cb6edf
action              : allow
direction           : from-lport
external_ids        : {default-deny-policy-type=Egress}
label               : 0
log                 : false
match               : "inport == @a16323395479447859119_egressDefaultDeny && (arp || nd)"
meter               : acl-logging
name                : surya_ARPallowPolicy
options             : {apply-after-lb="true"}
priority            : 1001
severity            : info

_uuid               : 8907909e-ff7f-48d0-becd-28fc995a7f60                                                                                                                   
action              : allow                                                                                                                                                  
direction           : from-lport                                                                                                                                             
external_ids        : {default-deny-policy-type=Egress}                                                                                                                      
label               : 0                                                                                                                                                      
log                 : false                                                                                                                                                  
match               : "inport == @a16323395479447859119_egressDefaultDeny && arp"                                                                                            
meter               : acl-logging                                                                                                                                            
name                : surya_ARPallowPolicy                                                                                                                                   
options             : {apply-after-lb="true"}                                                                                                                                
priority            : 1001                                                                                                                                                   
severity            : info 

5) Now upgrade to or swap images to 4.10.13 or something in 4.10.z, you will see errors like:
I0721 13:45:34.493941       1 policy_retry.go:65] Network Policy Retry create failed for surya/default-deny, will try again later: failed to create default port groups and acls for policy: surya/default-deny, error: unexpectedly found multiple equivalent ACLs: [{UUID:e00a3879-2ab3-4944-939f-90cf61c11d8f Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a16323395479447859119_ingressDefaultDeny && (arp || nd) Meter:0xc00112af30 Name:0xc00112af40 Options:map[] Priority:1001 Severity:0xc00112af50} {UUID:972c1aad-2f5f-4394-8d28-23e0b997e2ee Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a16323395479447859119_ingressDefaultDeny && arp Meter:0xc00112aea0 Name:0xc00112aeb0 Options:map[] Priority:1001 Severity:0xc00112aec0}]


This is what this bug fixes. if you upgrade to a version with the PR merged, you should not see these errors anymore.

Comment 22 Surya Seetharaman 2022-07-23 08:47:47 UTC
PR merged:  openshift/ovn-kubernetes/pull/1210

Comment 25 Manish Pandey 2022-08-31 17:32:20 UTC
Surya it seems my customer is also facing same issue in his production while upgrade from 4.9 to 4.10.Z.
May i know in which version this fix is to be delivered ? Also can this be backported to 4.10.Z since my customer is currently on 4.10.26.
Because of failure in creation of NetworkPolicy allow-from-other-namespaces customer cannot access Openshift application route .Since hostNetwork ingress pod is not able to connect to app pod due to broken NetworkPolicy.

Comment 26 Surya Seetharaman 2022-08-31 18:29:29 UTC
Technically backports should land all the way into 4.9.z, problem is now I don't know how to open/clone backport bugs into JIRA.

Comment 27 Surya Seetharaman 2022-08-31 19:02:20 UTC
Here is the 4.11 backport bug https://issues.redhat.com//browse/OCPBUGS-772 PR has been opened, once that merges will backport into 4.10 as well. Thanks for your patience.

Comment 28 Manish Pandey 2022-09-01 10:17:34 UTC
Surya till the time this BZ is backported is there any workaround to overcome this issue ? If yes could you please provide me .Since this broken NetworkPolicy is causing outage for customer cluster in production since router pod are unable to connect to app pod hence OpenShift route are inaccessible .

I know there could be one way of rebuild of db using article https://access.redhat.com/articles/6963671 but just thinking if there is some other better way ?

Comment 32 errata-xmlrpc 2023-01-17 19:49:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.