Bug 1997050 - CNO panic: runtime error: invalid memory address or nil pointer dereference
Summary: CNO panic: runtime error: invalid memory address or nil pointer dereference
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Douglas Smith
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-24 10:20 UTC by errordeveloper@gmail.com
Modified: 2021-10-18 17:48 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:48:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1188 0 None None None 2021-08-25 09:29:32 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:48:37 UTC

Description errordeveloper@gmail.com 2021-08-24 10:20:22 UTC
Description of problem:

CNO crashed using UPI with `networkType: Cilium`.

Version-Release number of selected component (if applicable):

network-operator version 4.9.0-202108191637.p0.git.2547b8f.assembly.stream-2547b8f-2547b8f21304a6fc5b9a32d3341c4c091612940a

How reproducible:


I have reproduced this using custom setup (UPI in AWS using https://github.com/cilium/openshift-terraform-upi/), but I am pretty sure this is what's happen in CI here: 
 https://github.com/openshift/release/pull/21293.

Additional info:


$ kubectl --kubeconfig tfc-scripts/test-ocp49fc0-3.kubeconfig logs -n openshift-network-operator                         network-operator-55759b6967-72bnq   -f -p
W0824 09:44:09.240328       1 cmd.go:204] Using insecure, self-signed certificates
I0824 09:44:09.590115       1 observer_polling.go:159] Starting file observer
I0824 09:44:09.615399       1 builder.go:240] network-operator version 4.9.0-202108191637.p0.git.2547b8f.assembly.stream-2547b8f-2547b8f21304a6fc5b9a32d3341c4c091612940a
W0824 09:44:09.933440       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0824 09:44:09.933457       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
I0824 09:44:09.933960       1 leaderelection.go:243] attempting to acquire leader lease openshift-network-operator/network-operator-lock...
I0824 09:44:09.936895       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0824 09:44:09.936908       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0824 09:44:09.936924       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0824 09:44:09.936924       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0824 09:44:09.936981       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0824 09:44:09.937517       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0824 09:44:09.938306       1 secure_serving.go:197] Serving securely on [::]:9104
I0824 09:44:09.938351       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/serving-cert-753254612/tls.crt::/tmp/serving-cert-753254612/tls.key
I0824 09:44:09.938814       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0824 09:44:10.037732       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController 
I0824 09:44:10.037824       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0824 09:44:10.037825       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0824 09:46:38.274694       1 leaderelection.go:253] successfully acquired lease openshift-network-operator/network-operator-lock
I0824 09:46:38.274886       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-network-operator", Name:"network-operator-lock", UID:"bd30022a-0c4e-43e0-97cb-f970f4f91fa7", APIVersion:"v1", ResourceVersion:"7016", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-52-221_28b7af0e-afaf-497c-a43f-037e9e0e5a8d became leader
I0824 09:46:38.303038       1 operator.go:65] Adding controller-runtime controllers
I0824 09:46:38.306480       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-network-operator", Name:"network-operator", UID:"dc34e71e-7f4c-4ef6-ad56-f07bca2624bf", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "LoggingSyncer" resync interval is set to 0s which might lead to client request throttling
I0824 09:46:38.306544       1 client.go:146] Starting informers...
I0824 09:46:38.306558       1 client.go:157] Waiting for informers to sync...
I0824 09:46:38.407133       1 client.go:178] Informers started and synced
I0824 09:46:38.407157       1 operator.go:89] Starting controller-manager
I0824 09:46:38.407237       1 base_controller.go:66] Waiting for caches to sync for LoggingSyncer
I0824 09:46:38.407319       1 base_controller.go:72] Caches are synced for LoggingSyncer 
I0824 09:46:38.407333       1 base_controller.go:109] Starting #1 worker of LoggingSyncer controller ...
I0824 09:46:38.407381       1 base_controller.go:66] Waiting for caches to sync for ManagementStateController
I0824 09:46:38.407409       1 base_controller.go:72] Caches are synced for ManagementStateController 
I0824 09:46:38.407436       1 base_controller.go:109] Starting #1 worker of ManagementStateController controller ...
I0824 09:46:38.407612       1 reflector.go:219] Starting reflector *v1.Deployment (9h38m12.516370409s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.407815       1 reflector.go:219] Starting reflector *v1.OperatorPKI (9h42m51.692172522s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408146       1 reflector.go:219] Starting reflector *v1.IngressController (10h6m3.679258725s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408360       1 reflector.go:219] Starting reflector *v1.EgressRouter (10h52m0.392855353s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408412       1 reflector.go:219] Starting reflector *v1.Proxy (9h11m31.874163869s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408525       1 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (9h10m48.918786869s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408728       1 reflector.go:219] Starting reflector *v1.DaemonSet (10h58m44.699357645s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408820       1 reflector.go:219] Starting reflector *v1.Network (9h10m9.553305611s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.408840       1 log.go:184] Controller "ConnectivityCheckController" resync interval is set to 0s which might lead to client request throttling
I0824 09:46:38.408956       1 reflector.go:219] Starting reflector *v1.ConfigMap (9h40m44.631127996s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409112       1 reflector.go:219] Starting reflector *v1.Network (9h48m31.108198886s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409166       1 base_controller.go:66] Waiting for caches to sync for ConnectivityCheckController
I0824 09:46:38.409209       1 reflector.go:219] Starting reflector *v1.CustomResourceDefinition (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409284       1 reflector.go:219] Starting reflector *unstructured.Unstructured (12h0m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409346       1 reflector.go:219] Starting reflector *v1.Pod (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409415       1 reflector.go:219] Starting reflector *v1.Endpoints (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409469       1 reflector.go:219] Starting reflector *v1.Endpoints (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409578       1 reflector.go:219] Starting reflector *v1.Infrastructure (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409255       1 reflector.go:219] Starting reflector *v1.ClusterVersion (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409620       1 reflector.go:219] Starting reflector *v1.Service (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409301       1 reflector.go:219] Starting reflector *v1.Service (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409726       1 reflector.go:219] Starting reflector *v1.Endpoints (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.409184       1 reflector.go:219] Starting reflector *v1.Service (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167
I0824 09:46:38.509318       1 log.go:184] Reconciling Network.operator.openshift.io cluster
I0824 09:46:38.509431       1 log.go:184] Reconciling Network.config.openshift.io cluster
I0824 09:46:38.509535       1 log.go:184] Reconciling proxy 'cluster'
I0824 09:46:38.509662       1 log.go:184] httpProxy, httpsProxy and noProxy not defined for proxy 'cluster'; validation will be skipped
I0824 09:46:38.509748       1 log.go:184] Detected uplink MTU 9001
I0824 09:46:38.509881       1 log.go:184] Starting render phase
I0824 09:46:38.510299       1 reflector.go:219] Starting reflector *v1.ClusterOperator (9h31m28.143199109s) from k8s.io/client-go.1/tools/cache/reflector.go:167
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x240 pc=0x1c7cd73]

goroutine 777 [running]:
github.com/openshift/cluster-network-operator/pkg/network.Render(0xc0003e5718, 0x0, 0x2697255, 0x9, 0x1, 0x0, 0x0, 0x0, 0x0)
	github.com/openshift/cluster-network-operator/pkg/network/render.go:34 +0xf3
github.com/openshift/cluster-network-operator/pkg/controller/operconfig.(*ReconcileOperConfig).Reconcile(0xc000d332c0, 0x2a2c898, 0xc000a08810, 0x0, 0x0, 0xc00093cca4, 0x7, 0xc000a08810, 0xc00003e000, 0x2477be0, ...)
	github.com/openshift/cluster-network-operator/pkg/controller/operconfig/operconfig_controller.go:210 +0x9e6
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000dbdcc0, 0x2a2c7f0, 0xc0006b4680, 0x23fbe20, 0xc00086a240)
	sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000dbdcc0, 0x2a2c7f0, 0xc0006b4680, 0xc000c1ef00)
	sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000dbf8e0, 0xc000dbdcc0, 0x2a2c7f0, 0xc0006b4680)
	sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:210 +0x425

$ kubectl --kubeconfig tfc-scripts/test-ocp49fc0-3.kubeconfig get network.operator cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2021-08-24T09:10:10Z"
  generation: 2
  managedFields: { }
  name: cluster
  resourceVersion: "2753"
  uid: 3ccf4c6b-5f74-423d-baec-2164ad1cd1a5
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    type: Cilium
  deployKubeProxy: true
  disableNetworkDiagnostics: false
  logLevel: Normal
  managementState: Managed
  observedConfig: null
  operatorLogLevel: Normal
  serviceNetwork:
  - 172.30.0.0/16
  unsupportedConfigOverrides: null
status:
  conditions:
  - lastTransitionTime: "2021-08-24T09:12:13Z"
    status: "False"
    type: ManagementStateDegraded
  readyReplicas: 0

Comment 1 errordeveloper@gmail.com 2021-08-24 10:24:41 UTC
By the way, I set subcomponent to 'multus' because there was no CNO category, and the stacktrace points at multus-related code... I think adding CNO subcomponent would be quite handy.

Comment 2 Vadim Rutkovsky 2021-08-24 13:28:37 UTC
Reproducible in CI as well - https://github.com/openshift/release/pull/21293

Comment 3 errordeveloper@gmail.com 2021-08-25 08:26:49 UTC
After a quick look at recent changes, I found this:


diff --git a/pkg/network/render.go b/pkg/network/render.go
index 6295d816..ef1b4508 100644
--- a/pkg/network/render.go
+++ b/pkg/network/render.go
@@ -31,7 +31,7 @@ func Render(conf *operv1.NetworkSpec, bootstrapResult *bootstrap.BootstrapResult
        objs = append(objs, o...)
 
        // render MultusAdmissionController
-       o, err = renderMultusAdmissionController(conf, manifestDir)
+       o, err = renderMultusAdmissionController(conf, manifestDir, bootstrapResult.ExternalControlPlane)
        if err != nil {
                return nil, err
        }
@@ -559,7 +559,7 @@ func renderAdditionalNetworks(conf *operv1.NetworkSpec, manifestDir string) ([]*
 }



(see https://github.com/openshift/cluster-network-operator/commit/403c4ff06400be0879c81d310da4600b497159fb#diff-e68f94c90e1c01972187483984b3da57967f404e4e2a85a48433af6ca9d34c8cR34)

The issue is that `bootstrapResult` is passed unchecked. It's possible the are other changes that relate to this,
this is just the first one I've noticed that can be connected to stack trace.

Comment 4 errordeveloper@gmail.com 2021-08-25 09:22:30 UTC
Looks like an easy fix: https://github.com/openshift/cluster-network-operator/pull/1188

Comment 8 Weibin Liang 2021-09-03 19:24:59 UTC
errordeveloper 

I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I use openshift-install to install a AWS IPI cluster, and installation failed, the installation failed log  is attached. 

So far this is the way QE can try to verify this bug, but I do not think it is a correct way. 

Could you help to verify this bug? Thanks!

Comment 10 Vadim Rutkovsky 2021-09-06 09:07:56 UTC
(In reply to Weibin Liang from comment #8)
> errordeveloper 
> 
> I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I
> use openshift-install to install a AWS IPI cluster, and installation failed,
> the installation failed log  is attached. 

You'd also have to add your own CNI provider. See https://github.com/openshift/release/blob/master/ci-operator/step-registry/cilium/conf/cilium-conf-commands.sh

Comment 12 Weibin Liang 2021-09-07 20:54:36 UTC
(In reply to Vadim Rutkovsky from comment #10)
> (In reply to Weibin Liang from comment #8)
> > errordeveloper 
> > 
> > I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I
> > use openshift-install to install a AWS IPI cluster, and installation failed,
> > the installation failed log  is attached. 
> 
> You'd also have to add your own CNI provider. See
> https://github.com/openshift/release/blob/master/ci-operator/step-registry/
> cilium/conf/cilium-conf-commands.sh

errordeveloper
Thanks for your suggestion, follow above steps to add Cilium CNI plugin, but the installation still failed.
The failed log(Cilium-log.txt) is attached,  please help to check if I still miss anything for the installation, thanks!

Comment 14 errordeveloper@gmail.com 2021-09-09 12:52:46 UTC
@

Comment 15 errordeveloper@gmail.com 2021-09-09 12:59:54 UTC
Weibin Liang, I think you can reproduce this without installing Cilium as such. Please take a look at changes in https://github.com/openshift/cluster-network-operator/commit/0d9f70fa1e1af6b18385065f645a778679d7edcc.
Prior to 0d9f70fa1, `TestRenderUnknownNetwork` was calling `Render(prev, &bootstrap.BootstrapResult{}, manifestDir)`, and it passed.
However, in reality `Bootstrap` returned `nil, nil` for unknown, so that meant `Render(prev, nil, manifestDir)` and it panicked. So the fix was to test what `Bootstrap` returns. If you change the code to make `Bootstrap` return `nil, nil` again and run unit tests, the unit test will panic.

Comment 16 Vadim Rutkovsky 2021-09-09 13:05:47 UTC
>I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I use openshift-install to install a AWS IPI cluster, and installation failed, the installation failed log  is attached. 
> So far this is the way QE can try to verify this bug, but I do not think it is a correct way. 

Cilium requires additional ports opened, so AWS IPI install would need LB changes. We use Azure on CI to work this around.

Lets move this to VERIFIED as https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium/1435320456222609408 is now reaching test phase again

Comment 18 errata-xmlrpc 2021-10-18 17:48:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.