Bug 2091238 - NetworkPolicies: ovnkube-master pods crashing due to panic: "invalid memory address or nil pointer dereference"
Summary: NetworkPolicies: ovnkube-master pods crashing due to panic: "invalid memory a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Surya Seetharaman
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 2104717
TreeView+ depends on / blocked
 
Reported: 2022-05-28 03:54 UTC by Daein Park
Modified: 2023-01-17 19:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If a network policy does not get created properly due to some errors in the creation path, it won't get added to the internal OVNK cache. This would cause the network policy to be nil and if this policy was then tried for deletion, it would panic on the nil policy Consequence: OVNK master leader would go into crashloopbackoff. Fix: If the policy was not created properly and a deletion is tried then check if the policy is nil and if so skip deletion since it wasn't created in the first place. Result: OVNK master is healthy
Clone Of:
Environment:
Last Closed: 2023-01-17 19:49:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1166 0 None open Bug 2091238: Fix Panic in Network Policy deletion 2022-06-28 17:09:31 UTC
Github ovn-org ovn-kubernetes pull 3034 0 None Merged Fix Panic in network policy deletion 2022-06-27 19:52:08 UTC
Red Hat Knowledge Base (Solution) 6962905 0 None None None 2022-06-13 09:57:26 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:49:41 UTC

Description Daein Park 2022-05-28 03:54:16 UTC
Created attachment 1884418 [details]
ovnkube-master pod full logs

Description of problem:

After upgrading from v4.9.31 to v4.9.32, the ovnkube-master pods in "openshift-ovn-kubernetes" crash regularly with the following panic messages.
Refer the attachments in private for full pod logs and details.

~~~
$ omg get pod
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-aaaaa   6/6     Running   19         6h19m
ovnkube-master-bbbbb   6/6     Running   8          4h34m
ovnkube-master-ccccc   6/6     Running   17         4h37m

$ omg logs -c ovnkube-master -p ovnkube-master-aaaaa
:
2022-05-27T15:48:27.957882179Z I0527 15:48:27.957839       1 policy.go:1251] Deleting network policy XXXXX in namespace XXXXX
2022-05-27T15:48:27.957950868Z E0527 15:48:27.957928       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-05-27T15:48:27.957950868Z goroutine 132 [running]:
2022-05-27T15:48:27.957950868Z k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1930dc0, 0x2907f50)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
2022-05-27T15:48:27.957950868Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
2022-05-27T15:48:27.957950868Z panic(0x1930dc0, 0x2907f50)
2022-05-27T15:48:27.957950868Z 	/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc002fcc6c0, 0x0, 0xc000fc9901, 0x0, 0x0)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1287 +0x55
2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc002fcc6c0, 0xc0006cc580, 0x0, 0x0, 0x0)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1275 +0x451
2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func3(0x1b1eec0, 0xc0006cc580, 0x1b1eec0, 0xc001356c00)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:736 +0x116
2022-05-27T15:48:27.957950868Z k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:238
2022-05-27T15:48:27.957950868Z k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate(0xc001c96f00, 0x1db6840, 0xc0020a9cb0, 0x1b1eec0, 0xc0006cc580, 0x1b1eec0, 0xc001356c00)
:
~~~

Version-Release number of selected component (if applicable):

OpenShift Container Platform: v4.9.32

How reproducible:

This issue happened at the customer's environments, so there is no specific reproducible procedures.

Steps to Reproduce:
1.
2.
3.

Actual results:

The ovnkube-master pods in openshift-ovn-kubernetes namespace crashes regularly with panic messages.

Expected results:

No crash of the ovnkube-master pods and running well without any issue.

Additional info:

Comment 1 Bram Verschueren 2022-06-01 10:29:14 UTC
This occurres on 4.10.15 as well. When looking at the stacktrace from the 4.10 must-gather:

2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc00142eb40, 0xc003d6b580, 0x0)
2022-06-01T08:17:45.422286498Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x369
2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func3({0x1b33da0, 0xc003d6b580}, {0x1b33da0, 0xc002c4c000})
2022-06-01T08:17:45.422286498Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:776 +0x2aa

and comparing this call to deleteNetworkPolicy [1] with the func definition [2]:

> func (oc *Controller) deleteNetworkPolicy(policy *knet.NetworkPolicy, np *networkPolicy) error {

It looks like a nil pointer for the 2nd networkPolicy is passed in.

Which is still nil in the next call to destroyNetworkPolicy [3]:

2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc00142eb40, 0x0, 0x0)
2022-06-01T08:17:45.422286498Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x5d

> func (oc *Controller) destroyNetworkPolicy(np *networkPolicy, lastPolicy bool) error {

Making the call to Lock [4] to panic.

There's a 3th party tool (espejo: https://github.com/vshn/espejo) used to manage network policies in this cluster. This tool does a full reconcile every couple of minutes. During this reconciliation, all NetworkPolicies are updated, but with no actual changes (only the "generation" and "resourceVersion" fields are updated). That may explain why the update handler is called without obvious reasons in the first place [1].

[1] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/ovn.go#L776
[2] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/policy.go#L1173
[3] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/policy.go#L1198
[4] https://github.com/openshift/ovn-kubernetes/blob/9495a1db74e8166298da8a9a392daaee977e6889/go-controller/pkg/ovn/policy.go#L1210

Comment 2 Surya Seetharaman 2022-06-09 09:34:56 UTC
Looking at the code and the tracebacks from both 4.9.32 and 4.10.15 in the previous comments, yes its cleat we pass nil value as parameter into:

// destroys a particular network policy
// if nsInfo is provided, the entire port group will be deleted for ingress/egress directions
// lastPolicy indicates if no other policies are using the respective portgroup anymore
func (oc *Controller) destroyNetworkPolicy(np *networkPolicy, lastPolicy bool) error {
	np.Lock()
	defer np.Unlock()
	np.deleted = true
	oc.shutdownHandlers(np)

Seems like at np.Lock() is where we panic. Looking into this on why its a problem during updates.

np comes from:

	// try to use the more official np found in nsInfo
	if foundNp := nsInfo.networkPolicies[policy.Name]; foundNp != nil {
		np = foundNp
	}

wondering if foundNp and np would also be nil?

4.9.32: 2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc002fcc6c0, 0xc0006cc580, 0x0, 0x0, 0x0)

4.10.15: 2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc00142eb40, 0xc003d6b580, 0x0)

yes that's the case here, so we call:
} else if err := oc.deleteNetworkPolicy(oldPolicy, nil); err != nil {
					oc.initRetryPolicyWithDelete(oldPolicy, nil)
					oc.initRetryPolicy(newPolicy)
					oc.unSkipRetryPolicy(oldPolicy)
					klog.Errorf("Failed to delete network policy %s, during update: %v",
						getPolicyNamespacedName(oldPolicy), err)
					return
				}
in update, and that means np = nil here: func (oc *Controller) deleteNetworkPolicy(policy *knet.NetworkPolicy, np *networkPolicy) error {
then inside this function I am guessing foundNp is also nil somehow, not quiet sure how yet, but we seem to pass that nil into destroyNetworkPolicy:

4.9.32:
2022-05-27T15:48:27.957950868Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc002fcc6c0, 0x0, 0xc000fc9901, 0x0, 0x0)
2022-05-27T15:48:27.957950868Z 	/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1287 +0x55

4.10.15:
2022-06-01T08:17:45.422286498Z github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc00142eb40, 0x0, 0x0)
2022-06-01T08:17:45.422286498Z  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x5d

so calling lock on nil policy is causing issues.

Comment 3 Surya Seetharaman 2022-06-09 10:15:17 UTC
Update: Bram is collecting core dump from cluster for further analysis because the real question is how did:

// try to use the more official np found in nsInfo
	if foundNp := nsInfo.networkPolicies[policy.Name]; foundNp != nil {
		np = foundNp
	}

foundNp be nil in this case?, the policy.Name key from oldPolicy should have yielded a result... right? core dump will be able to help more here.

Comment 5 Surya Seetharaman 2022-06-10 20:12:03 UTC
Looking at the master logs it is clear what happened and what caused the panic:


2022-06-01T08:17:45.418763259Z I0601 08:17:45.418759       1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx
2022-06-01T08:17:45.419281334Z E0601 08:17:45.419267       1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:801a19c3-0464-451d-b9d3-373e7503f6ba Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && arp Meter:0xc0025e1000 Name:0xc0025e1010 Options:map[] Priority:1001 Severity:0xc0025e1020} {UUID:2b2f0849-0075-4046-b4cf-2dec306c9550 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a11263692294733561997_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1070 Name:0xc0025e1080 Options:map[] Priority:1001 Severity:0xc0025e1090}]
2022-06-01T08:17:45.419289685Z I0601 08:17:45.419279       1 policy.go:1092] Adding network policy allow-from-other-namespaces in namespace xxxxx
2022-06-01T08:17:45.419750650Z E0601 08:17:45.419739       1 ovn.go:753] Failed to create network policy xxxxx/allow-from-other-namespaces, error: failed to create default port groups and acls for policy: xxxxx/allow-from-other-namespaces, error: unexpectedly found multiple equivalent ACLs: [{UUID:14c169a9-069f-45cd-86d3-3d2038340105 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && (arp || nd) Meter:0xc0025e16f0 Name:0xc0025e1700 Options:map[] Priority:1001 Severity:0xc0025e1710} {UUID:00202ce5-2851-426a-ba46-16d4f1b06c41 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a4969899092748592978_ingressDefaultDeny && arp Meter:0xc0025e1750 Name:0xc0025e1760 Options:map[] Priority:1001 Severity:0xc0025e1770}]
2022-06-01T08:17:45.419759490Z I0601 08:17:45.419748       1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace xxxxx
2022-06-01T08:17:45.422286498Z E0601 08:17:45.420247       1 ovn.go:753] Failed to create network policy xxxxx/allow-from-same-namespace, error: failed to create default port groups and acls for policy: xxxxx/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:c10e465c-4241-4d8c-8f78-7ceb6e38aeaa Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && arp Meter:0xc0025e1e00 Name:0xc0025e1e10 Options:map[] Priority:1001 Severity:0xc0025e1e20} {UUID:626d3785-8bce-4198-b322-ae747a8c95fb Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a2533190148303767946_ingressDefaultDeny && (arp || nd) Meter:0xc0025e1ed0 Name:0xc0025e1ee0 Options:map[] Priority:1001 Severity:0xc0025e1ef0}]
2022-06-01T08:17:45.422286498Z I0601 08:17:45.420261       1 ovn.go:808] Bootstrapping existing policies and cleaning stale policies took 1.049654614s
2022-06-01T08:17:45.422286498Z I0601 08:17:45.420352       1 policy.go:1174] Deleting network policy allow-from-same-namespace in namespace xxxxx
2022-06-01T08:17:45.422286498Z E0601 08:17:45.420411       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

It seems like we tried to add network policies which then failed to get created due to another error and then we tried to delete them however they were never added to the nsInfo.networkPolicies cache in the first place! So foundNp yields a nil valuea nd we hit the crash. I'll open a PR to fix this.

Comment 6 Surya Seetharaman 2022-06-10 21:48:12 UTC
Sequence of events:

1) https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/ovn.go#L752
That fails:
2) https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/ovn.go#L753
at https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1116

2022-06-01T08:17:44.888114304Z I0601 08:17:44.885468       1 policy.go:1092] Adding network policy allow-from-same-namespace in namespace surya
2022-06-01T08:17:44.888114304Z E0601 08:17:44.885929       1 ovn.go:753] Failed to create network policy surya/allow-from-same-namespace, error: failed to create default port groups and acls for policy: surya/allow-from-same-namespace, error: unexpectedly found multiple equivalent ACLs: [{UUID:3bc36c0e-ee1a-4609-a240-c211f14f379b Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a13985064446031893020_ingressDefaultDeny && arp Meter:0xc002bec640 Name:0xc002bec650 Options:map[] Priority:1001 Severity:0xc002bec660} {UUID:41702278-fb49-4418-abca-6425912d1e63 Action:allow Direction:to-lport ExternalIDs:map[default-deny-policy-type:Ingress] Label:0 Log:false Match:outport == @a13985064446031893020_ingressDefaultDeny && (arp || nd) Meter:0xc002bec6e0 Name:0xc002bec6f0 Options:map[] Priority:1001 Severity:0xc002bec700}]

3) This is being retried for creation but keeps failing because of dupe ACLs:

Match:outport == @a13985064446031893020_ingressDefaultDeny && arp
outport == @a13985064446031893020_ingressDefaultDeny && (arp || nd)

probably caused by an upgrade from a version that didn't have https://github.com/openshift/ovn-kubernetes/commit/6cafe521f1a13b1e10aae7cbedba92e3a2e9c7a7 to a new version and clearly that PR didn't exist in 4.9.31 but exists in 4.9.32. These issues were reported after upgrade to 4.9.32.

4) Finally an update to this non-created policy comes and we try to delete the oldPolicy and add the newPolicy: https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/ovn.go#L776

But the problem is add of the policy wasn't successful and failed at https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1116 and so we never added the policy to nsInfo.networkPolicies map: https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1143.

5) So every update here triggered a deletion of nil policy: https://github.com/openshift/ovn-kubernetes/blob/92f783c82b360d582753e61856968f25f9bde18d/go-controller/pkg/ovn/policy.go#L1194

bamm panic...

Comment 8 Surya Seetharaman 2022-06-14 17:28:05 UTC
Upstream fix posted and in review: https://github.com/ovn-org/ovn-kubernetes/pull/3034

Comment 12 Andreas Karis 2022-06-27 12:30:16 UTC
*** Bug 2101366 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2023-01-17 19:49:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.