Bug 2091238
Summary: | NetworkPolicies: ovnkube-master pods crashing due to panic: "invalid memory address or nil pointer dereference" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Daein Park <dapark> |
Component: | Networking | Assignee: | Surya Seetharaman <surya> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | bugzilla.redhat.com, bverschu, christoph.obexer, fgrosjea, jan.public, surya, wking |
Version: | 4.8 | Keywords: | FastFix |
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
If a network policy does not get created properly due to some errors in the creation path, it won't get added to the internal OVNK cache. This would cause the network policy to be nil and if this policy was then tried for deletion, it would panic on the nil policy
Consequence:
OVNK master leader would go into crashloopbackoff.
Fix:
If the policy was not created properly and a deletion is tried then check if the policy is nil and if so skip deletion since it wasn't created in the first place.
Result:
OVNK master is healthy
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:49:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2104717 |
This occurres on 4.10.15 as well. When looking at the stacktrace from the 4.10 must-gather: 2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc00142eb40, 0xc003d6b580, 0x0) 2022-06-01T08:17:45.422286498Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x369 2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func3({0x1b33da0, 0xc003d6b580}, {0x1b33da0, 0xc002c4c000}) 2022-06-01T08:17:45.422286498Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:776 +0x2aa and comparing this call to deleteNetworkPolicy [1] with the func definition [2]: > func (oc *Controller) deleteNetworkPolicy(policy *knet.NetworkPolicy, np *networkPolicy) error { It looks like a nil pointer for the 2nd networkPolicy is passed in. Which is still nil in the next call to destroyNetworkPolicy [3]: 2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc00142eb40, 0x0, 0x0) 2022-06-01T08:17:45.422286498Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x5d > func (oc *Controller) destroyNetworkPolicy(np *networkPolicy, lastPolicy bool) error { Making the call to Lock [4] to panic. There's a 3th party tool (espejo: https://github.com/vshn/espejo) used to manage network policies in this cluster. This tool does a full reconcile every couple of minutes. During this reconciliation, all NetworkPolicies are updated, but with no actual changes (only the "generation" and "resourceVersion" fields are updated). That may explain why the update handler is called without obvious reasons in the first place [1]. [1] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/ovn.go#L776 [2] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/policy.go#L1173 [3] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/policy.go#L1198 [4] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/policy.go#L1210 Looking at the code and the tracebacks from both 4.9.32 and 4.10.15 in the previous comments, yes its cleat we pass nil value as parameter into: // destroys a particular network policy // if nsInfo is provided, the entire port group will be deleted for ingress/egress directions // lastPolicy indicates if no other policies are using the respective portgroup anymore func (oc *Controller) destroyNetworkPolicy(np *networkPolicy, lastPolicy bool) error { np.Lock() defer np.Unlock() np.deleted = true oc.shutdownHandlers(np) Seems like at np.Lock() is where we panic. Looking into this on why its a problem during updates. np comes from: // try to use the more official np found in nsInfo if foundNp := nsInfo.networkPolicies[policy.Name]; foundNp != nil { np = foundNp } wondering if foundNp and np would also be nil? 4.9.32: 2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc002fcc6c0, 0xc0006cc580, 0x0, 0x0, 0x0) 4.10.15: 2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc00142eb40, 0xc003d6b580, 0x0) yes that's the case here, so we call: } else if err := oc.deleteNetworkPolicy(oldPolicy, nil); err != nil { oc.initRetryPolicyWithDelete(oldPolicy, nil) oc.initRetryPolicy(newPolicy) oc.unSkipRetryPolicy(oldPolicy) klog.Errorf("Failed to delete network policy %s, during update: %v", getPolicyNamespacedName(oldPolicy), err) return } in update, and that means np = nil here: func (oc *Controller) deleteNetworkPolicy(policy *knet.NetworkPolicy, np *networkPolicy) error { then inside this function I am guessing foundNp is also nil somehow, not quiet sure how yet, but we seem to pass that nil into destroyNetworkPolicy: 4.9.32: 2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc002fcc6c0, 0x0, 0xc000fc9901, 0x0, 0x0) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1287 +0x55 4.10.15: 2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc00142eb40, 0x0, 0x0) 2022-06-01T08:17:45.422286498Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x5d so calling lock on nil policy is causing issues. Update: Bram is collecting core dump from cluster for further analysis because the real question is how did: // try to use the more official np found in nsInfo if foundNp := nsInfo.networkPolicies[policy.Name]; foundNp != nil { np = foundNp } foundNp be nil in this case?, the policy.Name key from oldPolicy should have yielded a result... right? core dump will be able to help more here. Looking at the master logs it is clear what happened and what caused the panic: 2022-06-01T08:17:45.418763259Z I0601 08:17:45.418759 1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx 2022-06-01T08:17:45.419281334Z E0601 08:17:45.419267 1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:801a19c3-0464-451d-b9d3-373e7503f6ba Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && arp Meter:0xc0025e1000 Name:0xc0025e1010 Options:map[] Priority:1001 Severity:0xc0025e1020} {UUID:2b2f0849-0075-4046-b4cf-2dec306c9550 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1070 Name:0xc0025e1080 Options:map[] Priority:1001 Severity:0xc0025e1090}] 2022-06-01T08:17:45.419289685Z I0601 08:17:45.419279 1 policy.go:1092] Adding network policy allow-from-other-namespaces in namespace xxxxx 2022-06-01T08:17:45.419750650Z E0601 08:17:45.419739 1 ovn.go:753] Failed to create network policy xxxxx/allow-from-other-namespaces, error: failed to create default port groups and acls for policy: xxxxx/allow-from-other-namespaces, error: unexpectedly found multiple equivalent ACLs: [{UUID:14c169a9-069f-45cd-86d3-3d2038340105 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && (arp || nd) Meter:0xc0025e16f0 Name:0xc0025e1700 Options:map[] Priority:1001 Severity:0xc0025e1710} {UUID:00202ce5-2851-426a-ba46-16d4f1b06c41 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && arp Meter:0xc0025e1750 Name:0xc0025e1760 Options:map[] Priority:1001 Severity:0xc0025e1770}] 2022-06-01T08:17:45.419759490Z I0601 08:17:45.419748 1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx 2022-06-01T08:17:45.422286498Z E0601 08:17:45.420247 1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:c10e465c-4241-4d8c-8f78-7ceb6e38aeaa Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && arp Meter:0xc0025e1e00 Name:0xc0025e1e10 Options:map[] Priority:1001 Severity:0xc0025e1e20} {UUID:626d3785-8bce-4198-b322-ae747a8c95fb Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1ed0 Name:0xc0025e1ee0 Options:map[] Priority:1001 Severity:0xc0025e1ef0}] 2022-06-01T08:17:45.422286498Z I0601 08:17:45.420261 1 ovn.go:808] Bootstrapping existing policies and cleaning stale policies took 1.049654614s 2022-06-01T08:17:45.422286498Z I0601 08:17:45.420352 1 policy.go:1174] Deleting network policy allow-from-same-namespace in namespace xxxxx 2022-06-01T08:17:45.422286498Z E0601 08:17:45.420411 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) It seems like we tried to add network policies which then failed to get created due to another error and then we tried to delete them however they were never added to the nsInfo.networkPolicies cache in the first place! So foundNp yields a nil valuea nd we hit the crash. I'll open a PR to fix this. Sequence of events: 1) https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/ovn.go#L752 That fails: 2) https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/ovn.go#L753 at https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1116 2022-06-01T08:17:44.888114304Z I0601 08:17:44.885468 1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace surya 2022-06-01T08:17:44.888114304Z E0601 08:17:44.885929 1 ovn.go:753] Failed to create network policy surya/allow-from-same-namespace, error: failed to create default port groups and acls for policy: surya/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:3bc36c0e-ee1a-4609-a240-c211f14f379b Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a13985064446031893020_ingressDefaultDeny && arp Meter:0xc002bec640 Name:0xc002bec650 Options:map[] Priority:1001 Severity:0xc002bec660} {UUID:41702278-fb49-4418-abca-6425912d1e63 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a13985064446031893020_ingressDefaultDeny && (arp || nd) Meter:0xc002bec6e0 Name:0xc002bec6f0 Options:map[] Priority:1001 Severity:0xc002bec700}] 3) This is being retried for creation but keeps failing because of dupe ACLs: Match:outport == @a13985064446031893020_ingressDefaultDeny && arp outport == @a13985064446031893020_ingressDefaultDeny && (arp || nd) probably caused by an upgrade from a version that didn't have https://github.com/openshift/ovn-kubernetes/commit/6cafe521f1a13b1e10aae7cbedba92e3a2e9c7a7 to a new version and clearly that PR didn't exist in 4.9.31 but exists in 4.9.32. These issues were reported after upgrade to 4.9.32. 4) Finally an update to this non-created policy comes and we try to delete the oldPolicy and add the newPolicy: https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/ovn.go#L776 But the problem is add of the policy wasn't successful and failed at https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1116 and so we never added the policy to nsInfo.networkPolicies map: https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1143. 5) So every update here triggered a deletion of nil policy: https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1194 bamm panic... Upstream fix posted and in review: https://github.com/ovn-org/ovn-kubernetes/pull/3034 *** Bug 2101366 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |
Created attachment 1884418 [details] ovnkube-master pod full logs Description of problem: After upgrading from v4.9.31 to v4.9.32, the ovnkube-master pods in "openshift-ovn-kubernetes" crash regularly with the following panic messages. Refer the attachments in private for full pod logs and details. ~~~ $ omg get pod NAME READY STATUS RESTARTS AGE ovnkube-master-aaaaa 6/6 Running 19 6h19m ovnkube-master-bbbbb 6/6 Running 8 4h34m ovnkube-master-ccccc 6/6 Running 17 4h37m $ omg logs -c ovnkube-master -p ovnkube-master-aaaaa : 2022-05-27T15:48:27.957882179Z I0527 15:48:27.957839 1 policy.go:1251] Deleting network policy XXXXX in namespace XXXXX 2022-05-27T15:48:27.957950868Z E0527 15:48:27.957928 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) 2022-05-27T15:48:27.957950868Z goroutine 132 [running]: 2022-05-27T15:48:27.957950868Z k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1930dc0, 0x2907f50) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95 2022-05-27T15:48:27.957950868Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86 2022-05-27T15:48:27.957950868Z panic(0x1930dc0, 0x2907f50) 2022-05-27T15:48:27.957950868Z /usr/lib/golang/src/runtime/panic.go:965 +0x1b9 2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc002fcc6c0, 0x0, 0xc000fc9901, 0x0, 0x0) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1287 +0x55 2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc002fcc6c0, 0xc0006cc580, 0x0, 0x0, 0x0) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1275 +0x451 2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func3(0x1b1eec0, 0xc0006cc580, 0x1b1eec0, 0xc001356c00) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:736 +0x116 2022-05-27T15:48:27.957950868Z k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) 2022-05-27T15:48:27.957950868Z /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:238 2022-05-27T15:48:27.957950868Z k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate(0xc001c96f00, 0x1db6840, 0xc0020a9cb0, 0x1b1eec0, 0xc0006cc580, 0x1b1eec0, 0xc001356c00) : ~~~ Version-Release number of selected component (if applicable): OpenShift Container Platform: v4.9.32 How reproducible: This issue happened at the customer's environments, so there is no specific reproducible procedures. Steps to Reproduce: 1. 2. 3. Actual results: The ovnkube-master pods in openshift-ovn-kubernetes namespace crashes regularly with panic messages. Expected results: No crash of the ovnkube-master pods and running well without any issue. Additional info: