Bug 2095229

Summary:	ingress-operator pod in CrashLoopBackOff in 4.11 after upgrade starting in 4.6 due to go panic
Product:	OpenShift Container Platform	Reporter:	Jon Uriarte <juriarte>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Arvind iyengar <aiyengar>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	aos-bugs, hongli, mmasters
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:17:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jon Uriarte 2022-06-09 10:48:18 UTC

Description of problem:

ingress-operator pod keeps restarting after 4.10 to 4.11 upgrade due to a go "invalid memory address or nil pointer dereference" panic:

NAMESPACE                   NAME                               READY  STATUS            RESTARTS       AGE
openshift-ingress-operator  ingress-operator-76fb9cbb6c-65dgd  1/2    CrashLoopBackOff  15 (3m39s ago) 57m

$ oc -n openshift-ingress-operator logs -p ingress-operator-76fb9cbb6c-65dgd -c ingress-operator
[...]
2022-06-09T10:26:09.039Z        INFO    operator.init.controller.dns_controller controller/controller.go:234    Starting workers        {&quot;worker count&quot;: 1}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x14 pc=0x15bcff6]

goroutine 1781 [running]:
github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.desiredRouterDeployment(0xc000cce000, {0x7ffdc46ba87a, 0x76}, 0xc001cf0ea0, 0xc000a72d80, 0xc0011097c8?, 0xc000026a80, 0x0, 0x0, 0xc0002e3400)
        /ingress-operator/pkg/operator/controller/ingress/deployment.go:661 +0x3d16
github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureRouterDeployment(0xc000beba40, 0xc000cce000, 0x2799770?, 0x10?, 0xc001109848?, 0x40b0d6?, 0x40?, 0x22dc480?)
        /ingress-operator/pkg/operator/controller/ingress/deployment.go:125 +0x2ba
github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureIngressController(0xc000beba40, 0xc000cce000, 0xc000c78870?, 0x0?, 0xb?, 0x2336b88?, 0x2?)
        /ingress-operator/pkg/operator/controller/ingress/controller.go:851 +0x654
github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).Reconcile(0xc000beba40, {0x27b4578, 0xc000c78870}, {{{0xc0006a84e0?, 0x220e3a0?}, {0xc000ee0e00?, 0x30?}}})
        /ingress-operator/pkg/operator/controller/ingress/controller.go:261 +0xad2
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000ad6840, {0x27b4578, 0xc000c787b0}, {{{0xc0006a84e0?, 0x220e3a0?}, {0xc000ee0e00?, 0x4041f4?}}})
        /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x27e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ad6840, {0x27b44d0, 0xc000bef980}, {0x20af780?, 0xc000e45680?})
        /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x349
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ad6840, {0x27b44d0, 0xc000bef980})
        /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x31c

The complete upgrade chain is as follows:
 4.6.0-0.nightly-2022-06-08-054151  ->
 4.7.0-0.nightly-2022-06-08-093003  ->
 4.8.0-0.nightly-2022-06-08-100908  ->
 4.9.0-0.nightly-2022-06-08-150705  ->
 4.10.0-0.nightly-2022-06-08-150219 ->
 4.11.0-0.nightly-2022-06-06-201913


Same pod remains ok through all the upgrades until it reaches 4.11 version.

The ingress operator is ok:
NAME     VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                         
ingress  4.11.0-0.nightly-2022-06-06-201913   True        False         False      9h

OpenShift release version: 4.11.0-0.nightly-2022-06-06-201913


Cluster Platform: OSP 16.2.2


How reproducible: always (3 times out of 3)


Steps to Reproduce (in detail):
1. Install OCP 4.6 (with OpenShiftSDN in this case)
2. Upgrade to 4.7 -> 4.8 -> 4.9 -> 4.10 -> 4.11


Actual results: ingress-operator pod in CrashLoopBackOff status


Expected results: successful 4.10 to 4.11 upgrade


Impact of the problem: unknown


Additional info:

upgrade history:
    - completionTime: "2022-06-09T09:10:58Z"
      image: registry.ci.openshift.org/ocp/release@sha256:49af2f8eeef5a24b2418aa6ba0be097a6e74bc747d35d403afe51ff1b173fa0b
      startedTime: "2022-06-09T07:33:17Z"
      state: Completed
      verified: true
      version: 4.11.0-0.nightly-2022-06-06-201913
    - completionTime: "2022-06-09T05:31:05Z"
      image: registry.ci.openshift.org/ocp/release@sha256:6bb01826e3996b4b792c0eed75316cfd55fd45f87fdd08a54d4953311c6ae985
      startedTime: "2022-06-09T04:05:37Z"
      state: Completed
      verified: false
      version: 4.10.0-0.nightly-2022-06-08-150219
    - completionTime: "2022-06-09T03:57:07Z"
      image: registry.ci.openshift.org/ocp/release@sha256:331d14da907366908786c489f4192973531a8ed819fee816ae4dcc7a710d1025
      startedTime: "2022-06-09T02:12:52Z"
      state: Completed
      verified: false
      version: 4.9.0-0.nightly-2022-06-08-150705
    - completionTime: "2022-06-09T01:57:07Z"
      image: registry.ci.openshift.org/ocp/release@sha256:a6a8d24bdf18f090b642dccd0d9a3d86b01e59c568629b2b7a27c549e10a00e9
      startedTime: "2022-06-09T00:30:33Z"
      state: Completed
      verified: false
      version: 4.8.0-0.nightly-2022-06-08-100908
    - completionTime: "2022-06-09T00:22:34Z"
      image: registry.ci.openshift.org/ocp/release@sha256:cb8217b51d438c4e082b8a88b918700f10363a71e18ab5dbb58f5ce61ba318d7
      startedTime: "2022-06-08T22:28:08Z"
      state: Completed
      verified: false
      version: 4.7.0-0.nightly-2022-06-08-093003
    - completionTime: "2022-06-08T22:07:23Z"
      image: registry.ci.openshift.org/ocp/release@sha256:1c94ff2760667cbb6f130619e3b0ee5d0c2d3ede4dcdb3fd27c55c7fee5853c3
      startedTime: "2022-06-08T21:25:20Z"
      state: Completed
      verified: false
      version: 4.6.0-0.nightly-2022-06-08-054151

must-gather data:
ClusterID: e34de7bb-52f5-44e1-8072-8f278d4e1f15
ClusterVersion: Stable at "4.11.0-0.nightly-2022-06-06-201913"
ClusterOperators:
        All healthy and stable



$ oc -n openshift-ingress-operator get pod ingress-operator-76fb9cbb6c-65dgd -o yaml
apiVersion: v1                                                                                                                                                                                                                               
kind: Pod                               
metadata:
[...]
containerStatuses:
  - containerID: cri-o://7376e7c9c1ce064d55b70b0a4961f9da1e4c1e1d295ab32f6aaa2ce2b9d5ede3
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb372d527cb475812d1ee16a5fa6499ade4f0afc56f8b427eb539736700dea71
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb372d527cb475812d1ee16a5fa6499ade4f0afc56f8b427eb539736700dea71
    lastState:
      terminated:
        containerID: cri-o://7376e7c9c1ce064d55b70b0a4961f9da1e4c1e1d295ab32f6aaa2ce2b9d5ede3
        exitCode: 2
        finishedAt: &quot;2022-06-09T10:20:57Z&quot;
        message: &quot;rator/controller/ingress/deployment.go:661 +0x3d16\ngithub.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureRouterDeployment(0xc000c5f950, 0xc000956300, 0x2799770?, 0x10?, 0xc001
75d848?, 0x40b0d6?, 0x40?, 0x22dc480?)\n\t/ingress-operator/pkg/operator/controller/ingress/deployment.go:125 +0x2ba\ngithub.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureIngressController(0xc0
00c5f950, 0xc000956300, 0xc0017042d0?, 0x0?, 0xb?, 0x2336b88?, 0x2?)\n\t/ingress-operator/pkg/operator/controller/ingress/controller.go:851 +0x654\ngithub.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler
).Reconcile(0xc000c5f950, {0x27b4578, 0xc0017042d0}, {{{0xc0004659a0?, 0x220e3a0?}, {0xc000d92620?, 0x30?}}})\n\t/ingress-operator/pkg/operator/controller/ingress/controller.go:261 +0xad2\nsigs.k8s.io/controller-runtime/pkg/internal/contr
oller.(*Controller).Reconcile(0xc000a76e70, {0x27b4578, 0xc001704180}, {{{0xc0004659a0?, 0x220e3a0?}, {0xc000d92620?, 0x4041f4?}}})\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x27
e\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000a76e70, {0x27b44d0, 0xc000528e40}, {0x20af780?, 0xc000c6d340?})\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/control
ler/controller.go:311 +0x349\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000a76e70, {0x27b44d0, 0xc000528e40})\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro
ller/controller.go:266 +0x1d9\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85\ncreated by sig
s.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x31c\n&quot;
        reason: Error
        startedAt: &quot;2022-06-09T10:20:50Z&quot;
    name: ingress-operator
    ready: false
    restartCount: 20
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=ingress-operator pod=ingress-operator-76fb9cbb6c-65dgd_openshift-ingress-operator(87e1e203-2869-4714-a030-f8f42ab31f64)
        reason: CrashLoopBackOff
[...]

** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 6 Arvind iyengar 2022-06-22 10:10:39 UTC

Verified in "4.11.0-0.nightly-2022-06-21-040754". Upgrading cluster from 4.10 to the fixed nightly release on an OSP16 environment, the process completes successful with no failure or ingress operator pod crashes:
--------
Pre-upgrade:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.18   True        False         12m     Cluster version is 4.10.18

oc get infrastructures.config.openshift.io cluster -ojsonpath='{.spec}' | jq .  
{
  "cloudConfig": {
    "key": "config",
    "name": "cloud-provider-config"
  },
  "platformSpec": {
    "type": "OpenStack"
  }
}

oc -n openshift-ingress-operator get ingresscontroller default -ojsonpath='{.status.endpointPublishingStrategy}' | jq . 
{
  "hostNetwork": {
    "protocol": "TCP"
  },
  "type": "HostNetwork"
}


Post upgrade:
oc get clusterversion  
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-21-040754   True        False         14s     Cluster version is 4.11.0-0.nightly-2022-06-21-040754

oc -n openshift-ingress-operator get all 
NAME                                    READY   STATUS    RESTARTS   AGE
pod/ingress-operator-5d548f9467-bmflw   2/2     Running   0          14m

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/metrics   ClusterIP   172.30.15.98   <none>        9393/TCP   125m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ingress-operator   1/1     1            1           125m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/ingress-operator-5d548f9467   1         1         1       43m
replicaset.apps/ingress-operator-7899578f6    0         0         0       125m


oc -n openshift-ingress-operator logs pod/ingress-operator-5d548f9467-bmflw -c ingress-operator
2022-06-22T05:53:20.053Z	INFO	operator.main	ingress-operator/start.go:63	using operator namespace	{"namespace": "openshift-ingress-operator"}
I0622 05:53:24.193309       1 request.go:665] Waited for 1.045122844s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
2022-06-22T05:53:25.610Z	INFO	operator.main	ingress-operator/start.go:63	registering Prometheus metrics for canary_controller
2022-06-22T05:53:25.610Z	INFO	operator.main	ingress-operator/start.go:63	registering Prometheus metrics for ingress_controller
2022-06-22T05:53:25.610Z	INFO	operator.init	runtime/asm_amd64.s:1571	starting metrics listener	{"addr": "127.0.0.1:60000"}
2022-06-22T05:53:25.610Z	INFO	operator.main	ingress-operator/start.go:63	watching file	{"filename": "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"}
2022-06-22T05:53:28.120Z	INFO	operator.init.controller-runtime.metrics	metrics/listener.go:44	Metrics server is starting to listen	{"addr": ":8080"}
I0622 05:53:28.121036       1 base_controller.go:67] Waiting for caches to sync for spread-default-router-pods
2022-06-22T05:53:28.148Z	ERROR	operator.init	ingress-operator/start.go:197	failed to handle single node 4.11 upgrade logic	{"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"}
2022-06-22T05:53:28.149Z	INFO	operator.init	runtime/asm_amd64.s:1571	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
I0622 05:53:28.221804       1 base_controller.go:73] Caches are synced for spread-default-router-pods
--------

Comment 8 errata-xmlrpc 2022-08-10 11:17:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069