Description of problem: CNO crashed using UPI with `networkType: Cilium`. Version-Release number of selected component (if applicable): network-operator version 4.9.0-202108191637.p0.git.2547b8f.assembly.stream-2547b8f-2547b8f21304a6fc5b9a32d3341c4c091612940a How reproducible: I have reproduced this using custom setup (UPI in AWS using https://github.com/cilium/openshift-terraform-upi/), but I am pretty sure this is what's happen in CI here: https://github.com/openshift/release/pull/21293. Additional info: $ kubectl --kubeconfig tfc-scripts/test-ocp49fc0-3.kubeconfig logs -n openshift-network-operator network-operator-55759b6967-72bnq -f -p W0824 09:44:09.240328 1 cmd.go:204] Using insecure, self-signed certificates I0824 09:44:09.590115 1 observer_polling.go:159] Starting file observer I0824 09:44:09.615399 1 builder.go:240] network-operator version 4.9.0-202108191637.p0.git.2547b8f.assembly.stream-2547b8f-2547b8f21304a6fc5b9a32d3341c4c091612940a W0824 09:44:09.933440 1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected. W0824 09:44:09.933457 1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected. I0824 09:44:09.933960 1 leaderelection.go:243] attempting to acquire leader lease openshift-network-operator/network-operator-lock... I0824 09:44:09.936895 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I0824 09:44:09.936908 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0824 09:44:09.936924 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController I0824 09:44:09.936924 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0824 09:44:09.936981 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0824 09:44:09.937517 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0824 09:44:09.938306 1 secure_serving.go:197] Serving securely on [::]:9104 I0824 09:44:09.938351 1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/serving-cert-753254612/tls.crt::/tmp/serving-cert-753254612/tls.key I0824 09:44:09.938814 1 tlsconfig.go:240] Starting DynamicServingCertificateController I0824 09:44:10.037732 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController I0824 09:44:10.037824 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0824 09:44:10.037825 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0824 09:46:38.274694 1 leaderelection.go:253] successfully acquired lease openshift-network-operator/network-operator-lock I0824 09:46:38.274886 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-network-operator", Name:"network-operator-lock", UID:"bd30022a-0c4e-43e0-97cb-f970f4f91fa7", APIVersion:"v1", ResourceVersion:"7016", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-52-221_28b7af0e-afaf-497c-a43f-037e9e0e5a8d became leader I0824 09:46:38.303038 1 operator.go:65] Adding controller-runtime controllers I0824 09:46:38.306480 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-network-operator", Name:"network-operator", UID:"dc34e71e-7f4c-4ef6-ad56-f07bca2624bf", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "LoggingSyncer" resync interval is set to 0s which might lead to client request throttling I0824 09:46:38.306544 1 client.go:146] Starting informers... I0824 09:46:38.306558 1 client.go:157] Waiting for informers to sync... I0824 09:46:38.407133 1 client.go:178] Informers started and synced I0824 09:46:38.407157 1 operator.go:89] Starting controller-manager I0824 09:46:38.407237 1 base_controller.go:66] Waiting for caches to sync for LoggingSyncer I0824 09:46:38.407319 1 base_controller.go:72] Caches are synced for LoggingSyncer I0824 09:46:38.407333 1 base_controller.go:109] Starting #1 worker of LoggingSyncer controller ... I0824 09:46:38.407381 1 base_controller.go:66] Waiting for caches to sync for ManagementStateController I0824 09:46:38.407409 1 base_controller.go:72] Caches are synced for ManagementStateController I0824 09:46:38.407436 1 base_controller.go:109] Starting #1 worker of ManagementStateController controller ... I0824 09:46:38.407612 1 reflector.go:219] Starting reflector *v1.Deployment (9h38m12.516370409s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.407815 1 reflector.go:219] Starting reflector *v1.OperatorPKI (9h42m51.692172522s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408146 1 reflector.go:219] Starting reflector *v1.IngressController (10h6m3.679258725s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408360 1 reflector.go:219] Starting reflector *v1.EgressRouter (10h52m0.392855353s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408412 1 reflector.go:219] Starting reflector *v1.Proxy (9h11m31.874163869s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408525 1 reflector.go:219] Starting reflector *v1.CertificateSigningRequest (9h10m48.918786869s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408728 1 reflector.go:219] Starting reflector *v1.DaemonSet (10h58m44.699357645s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408820 1 reflector.go:219] Starting reflector *v1.Network (9h10m9.553305611s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.408840 1 log.go:184] Controller "ConnectivityCheckController" resync interval is set to 0s which might lead to client request throttling I0824 09:46:38.408956 1 reflector.go:219] Starting reflector *v1.ConfigMap (9h40m44.631127996s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409112 1 reflector.go:219] Starting reflector *v1.Network (9h48m31.108198886s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409166 1 base_controller.go:66] Waiting for caches to sync for ConnectivityCheckController I0824 09:46:38.409209 1 reflector.go:219] Starting reflector *v1.CustomResourceDefinition (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409284 1 reflector.go:219] Starting reflector *unstructured.Unstructured (12h0m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409346 1 reflector.go:219] Starting reflector *v1.Pod (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409415 1 reflector.go:219] Starting reflector *v1.Endpoints (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409469 1 reflector.go:219] Starting reflector *v1.Endpoints (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409578 1 reflector.go:219] Starting reflector *v1.Infrastructure (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409255 1 reflector.go:219] Starting reflector *v1.ClusterVersion (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409620 1 reflector.go:219] Starting reflector *v1.Service (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409301 1 reflector.go:219] Starting reflector *v1.Service (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409726 1 reflector.go:219] Starting reflector *v1.Endpoints (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.409184 1 reflector.go:219] Starting reflector *v1.Service (10m0s) from k8s.io/client-go.1/tools/cache/reflector.go:167 I0824 09:46:38.509318 1 log.go:184] Reconciling Network.operator.openshift.io cluster I0824 09:46:38.509431 1 log.go:184] Reconciling Network.config.openshift.io cluster I0824 09:46:38.509535 1 log.go:184] Reconciling proxy 'cluster' I0824 09:46:38.509662 1 log.go:184] httpProxy, httpsProxy and noProxy not defined for proxy 'cluster'; validation will be skipped I0824 09:46:38.509748 1 log.go:184] Detected uplink MTU 9001 I0824 09:46:38.509881 1 log.go:184] Starting render phase I0824 09:46:38.510299 1 reflector.go:219] Starting reflector *v1.ClusterOperator (9h31m28.143199109s) from k8s.io/client-go.1/tools/cache/reflector.go:167 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x240 pc=0x1c7cd73] goroutine 777 [running]: github.com/openshift/cluster-network-operator/pkg/network.Render(0xc0003e5718, 0x0, 0x2697255, 0x9, 0x1, 0x0, 0x0, 0x0, 0x0) github.com/openshift/cluster-network-operator/pkg/network/render.go:34 +0xf3 github.com/openshift/cluster-network-operator/pkg/controller/operconfig.(*ReconcileOperConfig).Reconcile(0xc000d332c0, 0x2a2c898, 0xc000a08810, 0x0, 0x0, 0xc00093cca4, 0x7, 0xc000a08810, 0xc00003e000, 0x2477be0, ...) github.com/openshift/cluster-network-operator/pkg/controller/operconfig/operconfig_controller.go:210 +0x9e6 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000dbdcc0, 0x2a2c7f0, 0xc0006b4680, 0x23fbe20, 0xc00086a240) sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:298 +0x30d sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000dbdcc0, 0x2a2c7f0, 0xc0006b4680, 0xc000c1ef00) sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:253 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000dbf8e0, 0xc000dbdcc0, 0x2a2c7f0, 0xc0006b4680) sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:214 +0x6b created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 sigs.k8s.io/controller-runtime.0-beta.0/pkg/internal/controller/controller.go:210 +0x425 $ kubectl --kubeconfig tfc-scripts/test-ocp49fc0-3.kubeconfig get network.operator cluster -o yaml apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2021-08-24T09:10:10Z" generation: 2 managedFields: { } name: cluster resourceVersion: "2753" uid: 3ccf4c6b-5f74-423d-baec-2164ad1cd1a5 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: type: Cilium deployKubeProxy: true disableNetworkDiagnostics: false logLevel: Normal managementState: Managed observedConfig: null operatorLogLevel: Normal serviceNetwork: - 172.30.0.0/16 unsupportedConfigOverrides: null status: conditions: - lastTransitionTime: "2021-08-24T09:12:13Z" status: "False" type: ManagementStateDegraded readyReplicas: 0
By the way, I set subcomponent to 'multus' because there was no CNO category, and the stacktrace points at multus-related code... I think adding CNO subcomponent would be quite handy.
Reproducible in CI as well - https://github.com/openshift/release/pull/21293
After a quick look at recent changes, I found this: diff --git a/pkg/network/render.go b/pkg/network/render.go index 6295d816..ef1b4508 100644 --- a/pkg/network/render.go +++ b/pkg/network/render.go @@ -31,7 +31,7 @@ func Render(conf *operv1.NetworkSpec, bootstrapResult *bootstrap.BootstrapResult objs = append(objs, o...) // render MultusAdmissionController - o, err = renderMultusAdmissionController(conf, manifestDir) + o, err = renderMultusAdmissionController(conf, manifestDir, bootstrapResult.ExternalControlPlane) if err != nil { return nil, err } @@ -559,7 +559,7 @@ func renderAdditionalNetworks(conf *operv1.NetworkSpec, manifestDir string) ([]* } (see https://github.com/openshift/cluster-network-operator/commit/403c4ff06400be0879c81d310da4600b497159fb#diff-e68f94c90e1c01972187483984b3da57967f404e4e2a85a48433af6ca9d34c8cR34) The issue is that `bootstrapResult` is passed unchecked. It's possible the are other changes that relate to this, this is just the first one I've noticed that can be connected to stack trace.
Looks like an easy fix: https://github.com/openshift/cluster-network-operator/pull/1188
errordeveloper I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I use openshift-install to install a AWS IPI cluster, and installation failed, the installation failed log is attached. So far this is the way QE can try to verify this bug, but I do not think it is a correct way. Could you help to verify this bug? Thanks!
(In reply to Weibin Liang from comment #8) > errordeveloper > > I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I > use openshift-install to install a AWS IPI cluster, and installation failed, > the installation failed log is attached. You'd also have to add your own CNI provider. See https://github.com/openshift/release/blob/master/ci-operator/step-registry/cilium/conf/cilium-conf-commands.sh
(In reply to Vadim Rutkovsky from comment #10) > (In reply to Weibin Liang from comment #8) > > errordeveloper > > > > I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I > > use openshift-install to install a AWS IPI cluster, and installation failed, > > the installation failed log is attached. > > You'd also have to add your own CNI provider. See > https://github.com/openshift/release/blob/master/ci-operator/step-registry/ > cilium/conf/cilium-conf-commands.sh errordeveloper Thanks for your suggestion, follow above steps to add Cilium CNI plugin, but the installation still failed. The failed log(Cilium-log.txt) is attached, please help to check if I still miss anything for the installation, thanks!
@
Weibin Liang, I think you can reproduce this without installing Cilium as such. Please take a look at changes in https://github.com/openshift/cluster-network-operator/commit/0d9f70fa1e1af6b18385065f645a778679d7edcc. Prior to 0d9f70fa1, `TestRenderUnknownNetwork` was calling `Render(prev, &bootstrap.BootstrapResult{}, manifestDir)`, and it passed. However, in reality `Bootstrap` returned `nil, nil` for unknown, so that meant `Render(prev, nil, manifestDir)` and it panicked. So the fix was to test what `Bootstrap` returns. If you change the code to make `Bootstrap` return `nil, nil` again and run unit tests, the unit test will panic.
>I tried to use "networkType: Cilium" in cluster-network-02-config.yml when I use openshift-install to install a AWS IPI cluster, and installation failed, the installation failed log is attached. > So far this is the way QE can try to verify this bug, but I do not think it is a correct way. Cilium requires additional ports opened, so AWS IPI install would need LB changes. We use Azure on CI to work this around. Lets move this to VERIFIED as https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium/1435320456222609408 is now reaching test phase again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759