1773870 – network operator crashes when infra.status.platformStatus is missing

Bug 1773870 - network operator crashes when infra.status.platformStatus is missing

Summary: network operator crashes when infra.status.platformStatus is missing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ricardo Carrillo Cruz
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1778235 1781558 1787765
TreeView+	depends on / blocked

Reported:	2019-11-19 08:05 UTC by Chet Hosey
Modified:	2023-10-06 18:47 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1778235 1781558 1787765 (view as bug list)
Environment:
Last Closed:	2020-05-13 21:52:51 UTC
Target Upstream Version:
Embargoed:
Flags:	ricarril: needinfo- ricarril: needinfo- ricarril: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 406	0	'None'	closed	Bug 1773870: Check if infra.Status.PlatformStatus exists before accessing it	2021-02-08 03:09:59 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:52:58 UTC

Internal Links: 1779299

Description Chet Hosey 2019-11-19 08:05:39 UTC

Description of problem:

After an upgrade from 4.1.23 -> 4.2.4, the network operator is found to be crashlooping.


Version-Release number of selected component (if applicable): OCP 4.2.4


How reproducible:

observed on VSphere platform


Steps to Reproduce:
1. oc edit infrastructure cluster
2. delete status.platformStatus
3. update proxy configuration
4. openshift-network-operator begins crashlooping with below stack trace

Actual results:

2019/11/16 22:32:36 Go Version: go1.11.13
2019/11/16 22:32:36 Go OS/Arch: linux/amd64
2019/11/16 22:32:36 operator-sdk Version: v0.4.1
2019/11/16 22:32:36 overriding kubernetes api to https://api-int.eng.openshift.tcc.etn.com:6443
2019/11/16 22:32:37 Registering Components.
2019/11/16 22:32:37 Configuring Controllers
2019/11/16 22:32:37 Starting the Cmd.
2019/11/16 22:32:37 Reconciling Network.config.openshift.io cluster
2019/11/16 22:32:37 Reconciling Network.operator.openshift.io cluster
2019/11/16 22:32:37 Reconciling update for openshift-service-ca from openshift-network-operator/cluster
2019/11/16 22:32:37 Reconciling proxy 'cluster'
2019/11/16 22:32:37 Reconciling configmap from  v4-0-config-system-trusted-ca-bundle/openshift-authentication
2019/11/16 22:32:37 Starting render phase
2019/11/16 22:32:37 Render phase done, rendered 35 objects
2019/11/16 22:32:37 Updated ClusterOperator with conditions:
- lastTransitionTime: "2019-09-18T21:59:41Z"
  status: "False"
  type: Degraded
- lastTransitionTime: "2019-11-15T06:45:51Z"
  status: "False"
  type: Progressing
- lastTransitionTime: "2019-08-15T14:39:42Z"
  status: "True"
  type: Available
- lastTransitionTime: "2019-11-15T05:35:58Z"
  status: "True"
  type: Upgradeable
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-kube-apiserver
2019/11/16 22:32:37 reconciling (/v1, Kind=ConfigMap) openshift-network-operator/applied-cluster
2019/11/16 22:32:37 update was successful
2019/11/16 22:32:37 reconciling (apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition) /network-attachment-definitions.k8s.cni.cncf.io
2019/11/16 22:32:37 update was successful
2019/11/16 22:32:37 reconciling (/v1, Kind=Namespace) /openshift-multus
2019/11/16 22:32:37 update was successful
2019/11/16 22:32:37 reconciling (rbac.authorization.k8s.io/v1, Kind=ClusterRole) /multus
2019/11/16 22:32:37 Updated ClusterOperator with conditions:
- lastTransitionTime: "2019-09-18T21:59:41Z"
  status: "False"
  type: Degraded
- lastTransitionTime: "2019-11-15T06:45:51Z"
  status: "False"
  type: Progressing
- lastTransitionTime: "2019-08-15T14:39:42Z"
  status: "True"
  type: Available
- lastTransitionTime: "2019-11-15T05:35:58Z"
  status: "True"
  type: Upgradeable
2019/11/16 22:32:37 Reconciling configmap from  telemeter-trusted-ca-bundle/openshift-monitoring
2019/11/16 22:32:37 update was successful
2019/11/16 22:32:37 reconciling (/v1, Kind=ServiceAccount) openshift-multus/multus
2019/11/16 22:32:37 update was successful
2019/11/16 22:32:37 reconciling (rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding) /multus
2019/11/16 22:32:37 Reconciling configmap from  openshift-global-ca/openshift-controller-manager
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-apiserver-operator
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca/openshift-image-registry
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-authentication-operator
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-console
2019/11/16 22:32:37 Reconciling configmap from  marketplace-trusted-ca/openshift-marketplace
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-insights
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-service-catalog-controller-manager
2019/11/16 22:32:37 Reconciling configmap from  trusted-ca-bundle/openshift-config-managed
2019/11/16 22:32:37 trusted-ca-bundle changed, updating 13 configMaps
2019/11/16 22:32:37 Reconciling configmap from  alertmanager-trusted-ca-bundle/openshift-monitoring
2019/11/16 22:32:37 Reconciling configmap from  cco-trusted-ca/openshift-cloud-credential-operator
E1116 22:32:38.065543       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:522
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/cluster-network-operator/pkg/util/proxyconfig/no_proxy.go:75
/go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/status.go:24
/go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/controller.go:195
/go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:213
/go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1144982]

goroutine 451 [running]:
github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x12e0d60, 0x22aa360)
        /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513 +0x1b9
github.com/openshift/cluster-network-operator/pkg/util/proxyconfig.MergeUserSystemNoProxy(0xc0006761a0, 0xc0004b9e00, 0xc0006764e0, 0xc001eb07e0, 0x9, 0xc0020b5168, 0xeff99e, 0xc0007022f0)
        /go/src/github.com/openshift/cluster-network-operator/pkg/util/proxyconfig/no_proxy.go:75 +0x422
github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig.(*ReconcileProxyConfig).syncProxyStatus(0xc000290f00, 0xc0006761a0, 0xc0004b9e00, 0xc0006764e0, 0xc001eb07e0, 0x147d8b8, 0x11)
        /go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/status.go:24 +0x3e8
github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig.(*ReconcileProxyConfig).Reconcile(0xc000290f00, 0x0, 0x0, 0xc00038d520, 0x7, 0x22c1620, 0x49, 0x4057bb, 0xc00008b1a0)
        /go/src/github.com/openshift/cluster-network-operator/pkg/controller/proxyconfig/controller.go:195 +0x35bb
github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000540000, 0x0)
        /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:213 +0x1d3
github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
        /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000028890)
        /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000028890, 0x3b9aca00, 0x0, 0x1, 0xc00008a1e0)
        /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000028890, 0x3b9aca00, 0xc00008a1e0)
        /go/src/github.com/openshift/cluster-network-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /go/src/github.com/openshift/cluster-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x32a



Expected results:

Network operator should reconcile and apply proxy settings.

Additional info:

I noticed that this was working on a scratch bare-metal installation, so I copied its settings to my real test environment on VSphere.

I was able to get the operator to reconcile by doing "oc edit infrastructure cluster" and copying the following from my bare-metal installation:

    status:
      platformStatus:
        type: None

With that added the network reconciliation was able to complete.

I don't know what sort of breakage may result from this change, so I then removed platformStatus:

    status:
      apiServerInternalURI: https://api-int.eng.openshift.<domain>:6443
      apiServerURL: https://api.eng.openshift.<domain>:6443
      etcdDiscoveryDomain: eng.openshift.<domain>
      infrastructureName: eng-vv66d
      platform: VSphere

After making another test change to the proxy configuration, the network operator again began crashlooping.

Comment 1 Chet Hosey 2019-11-19 20:00:48 UTC

https://github.com/openshift/cluster-network-operator/blob/release-4.2/vendor/github.com/openshift/api/config/v1/types_infrastructure.go mentions that the InfrastructureStatus.platform field has been deprecated in favor of InfrastructureStatus.platformStatus.type.

Given that platformstatus is noted as 'optional', openshift-network-operator should handle the case where it's missing. In the meantime it's not clear as a user if I *should* be modifying status fields on the infrastructure/cluster CR instance, because usually status fields are hands-off from a user standpoint.

Comment 4 Ricardo Carrillo Cruz 2019-11-26 11:20:39 UTC

As you say, status should be treated as read-only.
Is there a reason why you are trying to modify them?
I'm looking at manuals if there's something in docs for some sort of procedure that has that but not finding it.
Can you comment?

Comment 5 Chet Hosey 2019-11-26 14:29:05 UTC

This is answered by the name of this bug: “network operator crashes when infra.status.platformStatus is missing”.

The network operator crashes without a value there. It is unable to apply proxy settings.

Comment 6 Ricardo Carrillo Cruz 2019-11-26 15:06:39 UTC

Yikes, missed that, thanks.
I will try to repro and work on a fix.

Comment 7 Chet Hosey 2019-11-26 16:13:20 UTC

Much appreciated. I’m happy to provide additional info if there’s any that might help. A fresh 4.2 install seems to include the value; a cluster upgraded 4.1.23 -> 4.2.4 did not.

The network operator crash relates to applying proxy settings. And that’s the motivation behind wanting to go 4.1 -> 4.2 so aggressively. We don’t have an option to allow direct internet access.

Remote IPs for CDNs change often so we have frequent issues pulling container images. So we don’t want to put critical apps on OCP without proxy support, but our existing prod cluster is 4.1.

Comment 8 Chet Hosey 2019-12-03 17:24:29 UTC

FYI I've also filed https://bugzilla.redhat.com/show_bug.cgi?id=1779299 against the CVO.

I'm doubtful that new features are extensively tested on upgraded clusters, as opposed to freshly built clusters. I'd expect either the CVO or a responsible operator to populate the new field on upgrades to ease the potential burden on other components, like the network operator, to include such test cases.

Comment 13 Ricardo Carrillo Cruz 2019-12-10 09:14:05 UTC

The workaround should be fine.
FWIW, the fix landed in master, it is pending being picked up by QE.
Also backported to 4.2 (will also backport to 4.3, it seems we already branched).

Comment 17 errata-xmlrpc 2020-05-13 21:52:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.