I saw the following while trying to debug the following "unexpectedly found multiple equivalent ACLs" error. Add a generic networkpolicy: kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: allow-same-namespace namespace: nbc9-demo-project spec: podSelector: {} ingress: - from: - podSelector: {} policyTypes: - Ingress $ kubectl get pod ovnkube-master-pk89w -o jsonpath='{range .spec.containers[]}{@.image}' quay.io/openshift/okd-content@sha256:79ee71e045a7b224a132f6c75b4220ec35b9a06049061a6bd9ca9fc976c412e5 [root@dev-nkjpp-master-2 ~]# ovnkube -v I0609 17:33:34.930787 58 ovs.go:93] Maximum command line arguments set to: 191102 Version: 0.3.0 Git commit: 7bf36eea28fe66365d0dfdf8c39e3311ea14d19b Git branch: release-4.10 Go version: go1.16.6 Build date: 2022-05-27 OS/Arch: linux amd64 Which then fails to apply, retries, and when the networkpolicy is deleted, the ovnkube-master pod segfaults: I0609 17:00:26.653710 1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project E0609 17:00:26.656858 1 ovn.go:753] Failed to create network policy nbc9-demo-project/allow-same-namespace, error: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df310 Name:0xc0010df320 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df330} {UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df390 Name:0xc0010df3d0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df3e0}] I0609 17:00:51.437895 1 policy_retry.go:46] Network Policy Retry: nbc9-demo-project/allow-same-namespace retry network policy setup I0609 17:00:51.437935 1 policy_retry.go:63] Network Policy Retry: Creating new policy for nbc9-demo-project/allow-same-namespace I0609 17:00:51.437941 1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project I0609 17:00:51.438174 1 policy_retry.go:65] Network Policy Retry create failed for nbc9-demo-project/allow-same-namespace, will try again later: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc002215e00 Name:0xc002215e70 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc002215e80} {UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0022b0310 Name:0xc0022b03a0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc000070ab0}] I0609 17:01:02.679219 1 policy.go:1174] Deleting network policy allow-same-namespace in namespace nbc9-demo-project E0609 17:01:02.679407 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 249 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1c19c80, 0x2e9a810) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86 panic(0x1c19c80, 0x2e9a810) /usr/lib/golang/src/runtime/panic.go:965 +0x1b9 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51 created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a021d5] goroutine 249 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109 panic(0x1c19c80, 0x2e9a810) /usr/lib/golang/src/runtime/panic.go:965 +0x1b9 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51 created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65 Please let me know if any further information is required. I have a must-gather for this cluster but the file attachment tool in bugzilla won't let me attach anything larger than 19.5MB (the must-gather is 212.1MB)
This is duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2091238#c1 and fix is being worked on.
Hi Nate, thanks for the bz, could you tell me the exact reproduction steps and also how did you end up with error: unexpectedly found multiple equivalent ACLs: [{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc002215e00 Name:0xc002215e70 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc002215e80} {UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0022b0310 Name:0xc0022b03a0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc000070ab0}] ? I know the panic here is a problem of course but also trying to understand the multiple ACLs error. Which exact 4.10 version of OCP is this?
Hi Surya, the Multiple ACLs error happens when i create the network policy given the policy yaml i put in the top of the description. That's all. Create the policy with that yaml, see the failure in the logs, delete the policy, ovn crashes. It's very strange. This is OKD 4.10, not OCP. Your good colleagues on github sent me here: https://github.com/openshift/okd/issues/1257 https://github.com/ovn-org/ovn-kubernetes/issues/3031
Thanks Nate! In your case I know why the panic is happening, I will post a PR to fix this. Let me see if I can reproduce the ACLs issue you are seeing because when I tried to use your yaml on my cluster, creation worked fine: I0610 19:29:43.867352 53 policy.go:715] Processing NetworkPolicy nbc9-demo-project/allow-same-namespace to have 0 local pods... I0610 19:29:43.868605 53 model_client.go:344] Create operations generated as: [{Op:insert Table:Port_Group Row:map[external_ids:{GoMap:map[name:nbc9-demo-project_allow-same-namespace]} name:a17939553712128537144] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996267}] I0610 19:29:43.868740 53 transact.go:41] Configuring OVN: [{Op:insert Table:Port_Group Row:map[external_ids:{GoMap:map[name:nbc9-demo-project_allow-same-namespace]} name:a17939553712128537144] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996267}] I0610 19:29:43.868873 53 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:insert Table:Port_Group Row:map[external_ids:{GoMap:map[name:nbc9-demo-project_allow-same-namespace]} name:a17939553712128537144] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996267}]" I0610 19:29:43.869697 53 cache.go:668] cache "msg"="inserting row" "database"="OVN_Northbound" "table"="Port_Group" "uuid"="9b7f55fc-af82-48b0-a2e9-f6a5c364bcb4" "model"="&{UUID:9b7f55fc-af82-48b0-a2e9-f6a5c364bcb4 ACLs:[] ExternalIDs:map[name:nbc9-demo-project_allow-same-namespace] Name:a17939553712128537144 Ports:[]}" I0610 19:29:43.871151 53 policy.go:380] ACL for network policy: allow-same-namespace, updated to new log level: I0610 19:29:43.871206 53 obj_retry.go:1363] Creating *v1.NetworkPolicy nbc9-demo-project/allow-same-namespace took: 11.452211ms To me it seems like there is some residue of ACLs in your cluster or something.. But let me close this bug as dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2091238 and work on the fix of the nil pointer panic. Is that ok for you?
Or actually on second thoughts, on the other bug as well the symptoms are the same: 2022-06-01T08:17:45.418763259Z I0601 08:17:45.418759 1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx 2022-06-01T08:17:45.419281334Z E0601 08:17:45.419267 1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:801a19c3-0464-451d-b9d3-373e7503f6ba Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && arp Meter:0xc0025e1000 Name:0xc0025e1010 Options:map[] Priority:1001 Severity:0xc0025e1020} {UUID:2b2f0849-0075-4046-b4cf-2dec306c9550 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1070 Name:0xc0025e1080 Options:map[] Priority:1001 Severity:0xc0025e1090}] 2022-06-01T08:17:45.419289685Z I0601 08:17:45.419279 1 policy.go:1092] Adding network policy allow-from-other-namespaces in namespace xxxxx 2022-06-01T08:17:45.419750650Z E0601 08:17:45.419739 1 ovn.go:753] Failed to create network policy xxxxx/allow-from-other-namespaces, error: failed to create default port groups and acls for policy: xxxxx/allow-from-other-namespaces, error: unexpectedly found multiple equivalent ACLs: [{UUID:14c169a9-069f-45cd-86d3-3d2038340105 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && (arp || nd) Meter:0xc0025e16f0 Name:0xc0025e1700 Options:map[] Priority:1001 Severity:0xc0025e1710} {UUID:00202ce5-2851-426a-ba46-16d4f1b06c41 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && arp Meter:0xc0025e1750 Name:0xc0025e1760 Options:map[] Priority:1001 Severity:0xc0025e1770}] 2022-06-01T08:17:45.419759490Z I0601 08:17:45.419748 1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx 2022-06-01T08:17:45.422286498Z E0601 08:17:45.420247 1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:c10e465c-4241-4d8c-8f78-7ceb6e38aeaa Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && arp Meter:0xc0025e1e00 Name:0xc0025e1e10 Options:map[] Priority:1001 Severity:0xc0025e1e20} {UUID:626d3785-8bce-4198-b322-ae747a8c95fb Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1ed0 Name:0xc0025e1ee0 Options:map[] Priority:1001 Severity:0xc0025e1ef0}] 2022-06-01T08:17:45.422286498Z I0601 08:17:45.420261 1 ovn.go:808] Bootstrapping existing policies and cleaning stale policies took 1.049654614s 2022-06-01T08:17:45.422286498Z I0601 08:17:45.420352 1 policy.go:1174] Deleting network policy allow-from-same-namespace in namespace xxxxx 2022-06-01T08:17:45.422286498Z E0601 08:17:45.420411 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) Since creation never worked, we ended up passing a nil pointer during deletion which causes the panic. I'll do a fix for that as part of https://bugzilla.redhat.com/show_bug.cgi?id=2091238 I will keep this bug opened to track the error: "unexpectedly found multiple equivalent ACLs" error
hm seems like we surely introduced this with: https://github.com/openshift/ovn-kubernetes/pull/1043/files [root@bc104a3c4732 ~]# ovn-nbctl list acl | grep @a14044754821019150557_ingressDefaultDeny -C 5 action : allow direction : to-lport external_ids : {default-deny-policy-type=Ingress} label : 0 log : false match : "outport == @a14044754821019150557_ingressDefaultDeny && (arp || nd)" meter : acl-logging name : elk_ARPallowPolicy options : {} priority : 1001 severity : info -- action : allow direction : to-lport external_ids : {default-deny-policy-type=Ingress} label : 0 log : false match : "outport == @a14044754821019150557_ingressDefaultDeny && arp" meter : acl-logging name : elk_ARPallowPolicy options : {} priority : 1001 severity : info
> Please let me know if any further information is required. I have a must-gather for this cluster but the file attachment tool in bugzilla won't let me attach anything larger than 19.5MB (the must-gather is 212.1MB) Could you please attach the following from the must-gather: must-gather.local.6240950779494572355/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/network_logs/leader_nbdb.gz and must-gather.local.6240950779494572355/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/network_logs/leader_sbdb.gz and all logs inside: must-gather.local.6240950779494572355/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/namespaces/openshift-ovn-kubernetes/pods these mustn't be that high a limit and doing individual attachments will help.
Created attachment 1889401 [details] selected items from must-gather.local.6813085809915985852 Thank you very much for looking into this, Please find attached: .../network_logs/leader_sbdb.gz .../network_logs/leader_nbdb.gz and .../namespaces/openshift-ovn-kubernetes/.../logs/*.log Let me know if i missed anything or if you need anything else. I'll poke around some today and see if i can find a place to put the full must-gather. Worst case I can put it on our linux mirror. :)
ok, i found a place to put the full must-gather in case i missed anything in the slimmed down one: https://nbc9-snips.cloud.duke.edu/snips/must-gather.local.6813085809915985852.zip
Thanks Nate! Downloaded it! Projects: 75 API URL: ['https://api.dev.okd4.fitz.cloud.duke.edu:6443'] Platform: ['VSphere'] Cluster ID: ['98250500-3303-4c96-9873-ab4e969da463'] Desired Version: ['4.10.0-0.okd-2022-05-28-062148'] Looking at why the ACL creation failed in your case: _uuid : b9349cb8-99b0-4130-92cf-ab11b4874a11 action : drop direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a12933912868060780448_egressDefaultDeny" meter : acl-logging name : oit-ssi-fluentd_allow-same-namespace options : {apply-after-lb="true"} priority : 1000 severity : info -- _uuid : 5b98f17d-789f-4de1-9beb-36741bfa40d1 action : drop direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a12933912868060780448_egressDefaultDeny" meter : acl-logging name : oit-ssi-fluentd_allow-from-openshift-ingress options : {apply-after-lb="true"} priority : 1000 severity : info OVNK code thinks these two are the same ACLs and we error out. name: allow-from-openshift-ingress namespace: oit-ssi-fluentd resourceVersion: "4266701" uid: 8b86da03-db3e-4385-ace9-75df5564d1bd spec: ingress: - from: - namespaceSelector: matchLabels: policy-group.network.openshift.io/ingress: "" podSelector: {} policyTypes: - Ingress versus name: allow-same-namespace namespace: oit-ssi-fluentd resourceVersion: "4267434" uid: fb8e716c-3c94-447e-b8e7-cc249d68c410 spec: ingress: - from: - podSelector: {} podSelector: {} policyTypes: - Ingress I feel that's a bug on our side. I will look at: // IsEquivalentACL if it has same uuid, or if it has same name // and external ids, or if it has same priority, direction, match // and action. func IsEquivalentACL(existing *nbdb.ACL, searched *nbdb.ACL) bool { if searched.UUID != "" && existing.UUID == searched.UUID { return true } eName := getACLName(existing) sName := getACLName(searched) // TODO if we want to support adding/removing external ids, // we need to compare them differently, perhaps just the common subset if eName != "" && eName == sName && reflect.DeepEqual(existing.ExternalIDs, searched.ExternalIDs) { return true } return existing.Priority == searched.Priority && existing.Direction == searched.Direction && existing.Match == searched.Match && existing.Action == searched.Action } and talk to the team about the logic around "matching ACLs".
I appreciate your expert analysis! Hopefully fixing that bug will help some other poor soul who finds themselves in this weird edge case with me. My understanding from the documentation and examples was that the "allow ingress" rule and "allow same namespace" rule would be needed when in the presence of a "default deny" rule. I'm a little surprised that this specific configuration isn't more common so I'm assuming I've misunderstood something. Please let me know if you need anything else from me. It does sound like in this scenario I should be able to just clear out the Network policies but ... given the above apparent misunderstanding I'm not quite sure how to structure the NetworkPolicy objects such that: - Ingress related traffic allowed - same namespace traffic allowed - everything else not allowed. If i've misunderstood I'll happily accept a pointer to some documentation. I was previously referring to https://docs.okd.io/4.10/networking/network_policy/about-network-policy.html
This bug will fix two things: https://github.com/ovn-org/ovn-kubernetes/pull/3034#issuecomment-1155488213. The actual panic will be fixed via https://bugzilla.redhat.com/show_bug.cgi?id=2091238
Hey Nate: So I spoke to my team about this and I actually see on the 4.10 code: if len(nsInfo.networkPolicies) == 0 { err = oc.createDefaultDenyPGAndACLs(policy.Namespace, policy.Name, nsInfo) if err != nil { nsUnlock() return fmt.Errorf("failed to create default port groups and acls for policy: %s/%s, error: %v", policy.Namespace, policy.Name, err) } } So in the end we only create ONE deny default ACL for ingress and ONE deny ACL for egress even if we have more than one network policy per namespace... I am not sure how you ended up in this situation. I took all your policies from the must-gather and created them on a 4.10 OKD cluster and it works fine for me! As in I do not get TWO egressDefaultDeny's like it did in your cluster. Like you somehow seem to have gotten: _uuid : b9349cb8-99b0-4130-92cf-ab11b4874a11 action : drop direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a12933912868060780448_egressDefaultDeny" meter : acl-logging name : oit-ssi-fluentd_allow-same-namespace options : {apply-after-lb="true"} priority : 1000 severity : info AND -- _uuid : 5b98f17d-789f-4de1-9beb-36741bfa40d1 action : drop direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a12933912868060780448_egressDefaultDeny" meter : acl-logging name : oit-ssi-fluentd_allow-from-openshift-ingress options : {apply-after-lb="true"} priority : 1000 severity : info in your DB. I can't understand how that happened from the logs. The must-gather you attached is missing logs from startup of the pods because of the panic probably and restarts. So I can't really figure out exactly when the allow-from-openshift-ingress and allow-same-namespace ACLs were created for the first time before you ended up with duplicate ACLs...But basically what I am trying to say is we never should have hit that problem to begin with. Here is what I did on my cluster. I created the oit-ssi-fluentd namespace and then created: allow-from-openshift-ingress.yaml allow-from-openshift-monitoring.yaml allow-same-namespace.yaml fluentd-input.yaml network policies. It works perfectly fine for me. I only have one ACL: _uuid : f09c94a7-6dda-4327-9903-39162c439644 -- external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a12933912868060780448_egressDefaultDeny" meter : acl-logging name : oit-ssi-fluentd_allow-same-namespace options : {apply-after-lb="true"} priority : 1000 severity : info So the problem is that you aren't able to clear out the policies because of the other bug on panic with deletion: https://bugzilla.redhat.com/show_bug.cgi?id=2091238. There is one thing you could do which is to cleanup the OVN dbs i.e ssh into the master nodes and rm -rf the ovn dbs and then restart the ovnk masters to clear this up or reinstall OKD fresh and see. Because as per code, we never should have gotten the two acls you have. Not sure if you have more container logs on the master nodes we could look at to see how we reached this state with two ACLs (there must have been a successful create in one of the policies before ovnk master restarted).
Thank you again for the masterful analysis. We really do appreciate it. That scenario does make sense to me as I know we had policies working previously (the fluentd aggregator mentioned in this ACL has been working for months). However this cluster has been updated several times so there's some chance this was introduced during that process. There was quite a bit of rockiness in the 4.9 era. I had considered clearing out the ovndb but wanted to capture this bug report and the issue on github before so that we could make sure it wasn't a deeper flaw. It sounds like you and your team are confident that this shouldn't be an issue going forward which is also reassuring. Thanks again!
Yes, thanks Nate! Meanwhile I have done a proactive fix in case we do run into these issues again: https://github.com/ovn-org/ovn-kubernetes/pull/3038 where we fix these glitches on restart.
*** Bug 2101366 has been marked as a duplicate of this bug. ***
Hey Anurag, downstream merge posted: https://github.com/openshift/ovn-kubernetes/pull/1210 specific commit is https://github.com/openshift/ovn-kubernetes/pull/1210/commits/4d446baf305e16970f541bf0373f08eb7f6c72ca
Steps to verify for QE: 1) Bring up a cluster in 4.9.31 or older 2) Create a sample network policy 3) Check if ACL like this: _uuid : 8907909e-ff7f-48d0-becd-28fc995a7f60 action : allow direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a16323395479447859119_egressDefaultDeny && arp" meter : acl-logging name : surya_ARPallowPolicy options : {apply-after-lb="true"} priority : 1001 severity : info exists by running ovn-nbctl list acl command 4) upgrade to 4.9.32 or just swap out images for ovnk after scaling down CVO (but really this is an upgrade bug) Now repeat step3 and you should see: _uuid : 91392d7e-d2db-4530-a632-4db0f7cb6edf action : allow direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a16323395479447859119_egressDefaultDeny && (arp || nd)" meter : acl-logging name : surya_ARPallowPolicy options : {apply-after-lb="true"} priority : 1001 severity : info _uuid : 8907909e-ff7f-48d0-becd-28fc995a7f60 action : allow direction : from-lport external_ids : {default-deny-policy-type=Egress} label : 0 log : false match : "inport == @a16323395479447859119_egressDefaultDeny && arp" meter : acl-logging name : surya_ARPallowPolicy options : {apply-after-lb="true"} priority : 1001 severity : info 5) Now upgrade to or swap images to 4.10.13 or something in 4.10.z, you will see errors like: I0721 13:45:34.493941 1 policy_retry.go:65] Network Policy Retry create failed for surya/default-deny, will try again later: failed to create default port groups and acls for policy: surya/default-deny, error: unexpectedly found multiple equivalent ACLs: [{UUID:e00a3879-2ab3-4944-939f-90cf61c11d8f Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a16323395479447859119_ingressDefaultDeny && (arp || nd) Meter:0xc00112af30 Name:0xc00112af40 Options:map[] Priority:1001 Severity:0xc00112af50} {UUID:972c1aad-2f5f-4394-8d28-23e0b997e2ee Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a16323395479447859119_ingressDefaultDeny && arp Meter:0xc00112aea0 Name:0xc00112aeb0 Options:map[] Priority:1001 Severity:0xc00112aec0}] This is what this bug fixes. if you upgrade to a version with the PR merged, you should not see these errors anymore.
PR merged: openshift/ovn-kubernetes/pull/1210
Surya it seems my customer is also facing same issue in his production while upgrade from 4.9 to 4.10.Z. May i know in which version this fix is to be delivered ? Also can this be backported to 4.10.Z since my customer is currently on 4.10.26. Because of failure in creation of NetworkPolicy allow-from-other-namespaces customer cannot access Openshift application route .Since hostNetwork ingress pod is not able to connect to app pod due to broken NetworkPolicy.
Technically backports should land all the way into 4.9.z, problem is now I don't know how to open/clone backport bugs into JIRA.
Here is the 4.11 backport bug https://issues.redhat.com//browse/OCPBUGS-772 PR has been opened, once that merges will backport into 4.10 as well. Thanks for your patience.
Surya till the time this BZ is backported is there any workaround to overcome this issue ? If yes could you please provide me .Since this broken NetworkPolicy is causing outage for customer cluster in production since router pod are unable to connect to app pod hence OpenShift route are inaccessible . I know there could be one way of rebuild of db using article https://access.redhat.com/articles/6963671 but just thinking if there is some other better way ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399