Description of problem: After an upgrade from 4.1.23 -> 4.2.4, the network operator is found to be crashlooping. Version-Release number of selected component (if applicable): OCP 4.2.4 How reproducible: observed on VSphere platform Steps to Reproduce: 1. oc edit infrastructure cluster 2. delete status.platformStatus 3. update proxy configuration 4. openshift-network-operator begins crashlooping with below stack trace Actual results: 2019/11/16 22:32:36 Go Version: go1.11.13 2019/11/16 22:32:36 Go OS/Arch: linux/amd64 2019/11/16 22:32:36 operator-sdk Version: v0.4.1 2019/11/16 22:32:36 overriding kubernetes api to https://api-int.eng.openshift.tcc.etn.com:6443 2019/11/16 22:32:37 Registering Components. 2019/11/16 22:32:37 Configuring Controllers 2019/11/16 22:32:37 Starting the Cmd. 2019/11/16 22:32:37 Reconciling Network.config.openshift.io cluster 2019/11/16 22:32:37 Reconciling Network.operator.openshift.io cluster 2019/11/16 22:32:37 Reconciling update for openshift-service-ca from openshift-network-operator/cluster 2019/11/16 22:32:37 Reconciling proxy 'cluster' 2019/11/16 22:32:37 Reconciling configmap from v4-0-config-system-trusted-ca-bundle/openshift-authentication 2019/11/16 22:32:37 Starting render phase 2019/11/16 22:32:37 Render phase done, rendered 35 objects 2019/11/16 22:32:37 Updated ClusterOperator with conditions: - lastTransitionTime: "2019-09-18T21:59:41Z" status: "False" type: Degraded - lastTransitionTime: "2019-11-15T06:45:51Z" status: "False" type: Progressing - lastTransitionTime: "2019-08-15T14:39:42Z" status: "True" type: Available - lastTransitionTime: "2019-11-15T05:35:58Z" status: "True" type: Upgradeable 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-kube-apiserver 2019/11/16 22:32:37 reconciling (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster 2019/11/16 22:32:37 update was successful 2019/11/16 22:32:37 reconciling (apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition) /network-attachment-definitions.k8s.cni.cncf.io 2019/11/16 22:32:37 update was successful 2019/11/16 22:32:37 reconciling (/v1, Kind=Namespace) /openshift-multus 2019/11/16 22:32:37 update was successful 2019/11/16 22:32:37 reconciling (rbac.authorization.k8s.io/v1, Kind=ClusterRole) /multus 2019/11/16 22:32:37 Updated ClusterOperator with conditions: - lastTransitionTime: "2019-09-18T21:59:41Z" status: "False" type: Degraded - lastTransitionTime: "2019-11-15T06:45:51Z" status: "False" type: Progressing - lastTransitionTime: "2019-08-15T14:39:42Z" status: "True" type: Available - lastTransitionTime: "2019-11-15T05:35:58Z" status: "True" type: Upgradeable 2019/11/16 22:32:37 Reconciling configmap from telemeter-trusted-ca-bundle/openshift-monitoring 2019/11/16 22:32:37 update was successful 2019/11/16 22:32:37 reconciling (/v1, Kind=ServiceAccount) openshift-multus/multus 2019/11/16 22:32:37 update was successful 2019/11/16 22:32:37 reconciling (rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding) /multus 2019/11/16 22:32:37 Reconciling configmap from openshift-global-ca/openshift-controller-manager 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-apiserver-operator 2019/11/16 22:32:37 Reconciling configmap from trusted-ca/openshift-image-registry 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-authentication-operator 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-console 2019/11/16 22:32:37 Reconciling configmap from marketplace-trusted-ca/openshift-marketplace 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-insights 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-service-catalog-controller-manager 2019/11/16 22:32:37 Reconciling configmap from trusted-ca-bundle/openshift-config-managed 2019/11/16 22:32:37 trusted-ca-bundle changed, updating 13 configMaps 2019/11/16 22:32:37 Reconciling configmap from alertmanager-trusted-ca-bundle/openshift-monitoring 2019/11/16 22:32:37 Reconciling configmap from cco-trusted-ca/openshift-cloud-credential-operator E1116 22:32:38.065543 1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72 /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65 /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:522 /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513 /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:82 /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/signal_unix.go:390 /go/src/github.com/openshift/cluster-network-operator/pkg/util/proxyconfig/no_proxy.go:75 /go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/status.go:24 /go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/controller.go:195 /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:213 /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1144982] goroutine 451 [running]: github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108 panic(0x12e0d60, 0x22aa360) /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513 +0x1b9 github.com/openshift/cluster-network-operator/pkg/util/proxyconfig.MergeUserSystemNoProxy(0xc0006761a0, 0xc0004b9e00, 0xc0006764e0, 0xc001eb07e0, 0x9, 0xc0020b5168, 0xeff99e, 0xc0007022f0) /go/src/github.com/openshift/cluster-network-operator/pkg/util/proxyconfig/no_proxy.go:75 +0x422 github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig.(*ReconcileProxyConfig).syncProxyStatus(0xc000290f00, 0xc0006761a0, 0xc0004b9e00, 0xc0006764e0, 0xc001eb07e0, 0x147d8b8, 0x11) /go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/status.go:24 +0x3e8 github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig.(*ReconcileProxyConfig).Reconcile(0xc000290f00, 0x0, 0x0, 0xc00038d520, 0x7, 0x22c1620, 0x49, 0x4057bb, 0xc00008b1a0) /go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/controller.go:195 +0x35bb github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000540000, 0x0) /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:213 +0x1d3 github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1() /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36 github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000028890) /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54 github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000028890, 0x3b9aca00, 0x0, 0x1, 0xc00008a1e0) /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000028890, 0x3b9aca00, 0xc00008a1e0) /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x32a Expected results: Network operator should reconcile and apply proxy settings. Additional info: I noticed that this was working on a scratch bare-metal installation, so I copied its settings to my real test environment on VSphere. I was able to get the operator to reconcile by doing "oc edit infrastructure cluster" and copying the following from my bare-metal installation: status: platformStatus: type: None With that added the network reconciliation was able to complete. I don't know what sort of breakage may result from this change, so I then removed platformStatus: status: apiServerInternalURI: https://api-int.eng.openshift.<domain>:6443 apiServerURL: https://api.eng.openshift.<domain>:6443 etcdDiscoveryDomain: eng.openshift.<domain> infrastructureName: eng-vv66d platform: VSphere After making another test change to the proxy configuration, the network operator again began crashlooping.
https://github.com/openshift/cluster-network-operator/blob/release-4.2/vendor/github.com/openshift/api/config/v1/types_infrastructure.go mentions that the InfrastructureStatus.platform field has been deprecated in favor of InfrastructureStatus.platformStatus.type. Given that platformstatus is noted as 'optional', openshift-network-operator should handle the case where it's missing. In the meantime it's not clear as a user if I *should* be modifying status fields on the infrastructure/cluster CR instance, because usually status fields are hands-off from a user standpoint.
As you say, status should be treated as read-only. Is there a reason why you are trying to modify them? I'm looking at manuals if there's something in docs for some sort of procedure that has that but not finding it. Can you comment?
This is answered by the name of this bug: “network operator crashes when infra.status.platformStatus is missing”. The network operator crashes without a value there. It is unable to apply proxy settings.
Yikes, missed that, thanks. I will try to repro and work on a fix.
Much appreciated. I’m happy to provide additional info if there’s any that might help. A fresh 4.2 install seems to include the value; a cluster upgraded 4.1.23 -> 4.2.4 did not. The network operator crash relates to applying proxy settings. And that’s the motivation behind wanting to go 4.1 -> 4.2 so aggressively. We don’t have an option to allow direct internet access. Remote IPs for CDNs change often so we have frequent issues pulling container images. So we don’t want to put critical apps on OCP without proxy support, but our existing prod cluster is 4.1.
FYI I've also filed https://bugzilla.redhat.com/show_bug.cgi?id=1779299 against the CVO. I'm doubtful that new features are extensively tested on upgraded clusters, as opposed to freshly built clusters. I'd expect either the CVO or a responsible operator to populate the new field on upgrades to ease the potential burden on other components, like the network operator, to include such test cases.
The workaround should be fine. FWIW, the fix landed in master, it is pending being picked up by QE. Also backported to 4.2 (will also backport to 4.3, it seems we already branched).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581