Description of problem: ingress-operator pod keeps restarting after 4.10 to 4.11 upgrade due to a go "invalid memory address or nil pointer dereference" panic: NAMESPACE NAME READY STATUS RESTARTS AGE openshift-ingress-operator ingress-operator-76fb9cbb6c-65dgd 1/2 CrashLoopBackOff 15 (3m39s ago) 57m $ oc -n openshift-ingress-operator logs -p ingress-operator-76fb9cbb6c-65dgd -c ingress-operator [...] 2022-06-09T10:26:09.039Z INFO operator.init.controller.dns_controller controller/controller.go:234 Starting workers {"worker count": 1} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x14 pc=0x15bcff6] goroutine 1781 [running]: github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.desiredRouterDeployment(0xc000cce000, {0x7ffdc46ba87a, 0x76}, 0xc001cf0ea0, 0xc000a72d80, 0xc0011097c8?, 0xc000026a80, 0x0, 0x0, 0xc0002e3400) /ingress-operator/pkg/operator/controller/ingress/deployment.go:661 +0x3d16 github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureRouterDeployment(0xc000beba40, 0xc000cce000, 0x2799770?, 0x10?, 0xc001109848?, 0x40b0d6?, 0x40?, 0x22dc480?) /ingress-operator/pkg/operator/controller/ingress/deployment.go:125 +0x2ba github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureIngressController(0xc000beba40, 0xc000cce000, 0xc000c78870?, 0x0?, 0xb?, 0x2336b88?, 0x2?) /ingress-operator/pkg/operator/controller/ingress/controller.go:851 +0x654 github.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).Reconcile(0xc000beba40, {0x27b4578, 0xc000c78870}, {{{0xc0006a84e0?, 0x220e3a0?}, {0xc000ee0e00?, 0x30?}}}) /ingress-operator/pkg/operator/controller/ingress/controller.go:261 +0xad2 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000ad6840, {0x27b4578, 0xc000c787b0}, {{{0xc0006a84e0?, 0x220e3a0?}, {0xc000ee0e00?, 0x4041f4?}}}) /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x27e sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ad6840, {0x27b44d0, 0xc000bef980}, {0x20af780?, 0xc000e45680?}) /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x349 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ad6840, {0x27b44d0, 0xc000bef980}) /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x31c The complete upgrade chain is as follows: 4.6.0-0.nightly-2022-06-08-054151 -> 4.7.0-0.nightly-2022-06-08-093003 -> 4.8.0-0.nightly-2022-06-08-100908 -> 4.9.0-0.nightly-2022-06-08-150705 -> 4.10.0-0.nightly-2022-06-08-150219 -> 4.11.0-0.nightly-2022-06-06-201913 Same pod remains ok through all the upgrades until it reaches 4.11 version. The ingress operator is ok: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.11.0-0.nightly-2022-06-06-201913 True False False 9h OpenShift release version: 4.11.0-0.nightly-2022-06-06-201913 Cluster Platform: OSP 16.2.2 How reproducible: always (3 times out of 3) Steps to Reproduce (in detail): 1. Install OCP 4.6 (with OpenShiftSDN in this case) 2. Upgrade to 4.7 -> 4.8 -> 4.9 -> 4.10 -> 4.11 Actual results: ingress-operator pod in CrashLoopBackOff status Expected results: successful 4.10 to 4.11 upgrade Impact of the problem: unknown Additional info: upgrade history: - completionTime: "2022-06-09T09:10:58Z" image: registry.ci.openshift.org/ocp/release@sha256:49af2f8eeef5a24b2418aa6ba0be097a6e74bc747d35d403afe51ff1b173fa0b startedTime: "2022-06-09T07:33:17Z" state: Completed verified: true version: 4.11.0-0.nightly-2022-06-06-201913 - completionTime: "2022-06-09T05:31:05Z" image: registry.ci.openshift.org/ocp/release@sha256:6bb01826e3996b4b792c0eed75316cfd55fd45f87fdd08a54d4953311c6ae985 startedTime: "2022-06-09T04:05:37Z" state: Completed verified: false version: 4.10.0-0.nightly-2022-06-08-150219 - completionTime: "2022-06-09T03:57:07Z" image: registry.ci.openshift.org/ocp/release@sha256:331d14da907366908786c489f4192973531a8ed819fee816ae4dcc7a710d1025 startedTime: "2022-06-09T02:12:52Z" state: Completed verified: false version: 4.9.0-0.nightly-2022-06-08-150705 - completionTime: "2022-06-09T01:57:07Z" image: registry.ci.openshift.org/ocp/release@sha256:a6a8d24bdf18f090b642dccd0d9a3d86b01e59c568629b2b7a27c549e10a00e9 startedTime: "2022-06-09T00:30:33Z" state: Completed verified: false version: 4.8.0-0.nightly-2022-06-08-100908 - completionTime: "2022-06-09T00:22:34Z" image: registry.ci.openshift.org/ocp/release@sha256:cb8217b51d438c4e082b8a88b918700f10363a71e18ab5dbb58f5ce61ba318d7 startedTime: "2022-06-08T22:28:08Z" state: Completed verified: false version: 4.7.0-0.nightly-2022-06-08-093003 - completionTime: "2022-06-08T22:07:23Z" image: registry.ci.openshift.org/ocp/release@sha256:1c94ff2760667cbb6f130619e3b0ee5d0c2d3ede4dcdb3fd27c55c7fee5853c3 startedTime: "2022-06-08T21:25:20Z" state: Completed verified: false version: 4.6.0-0.nightly-2022-06-08-054151 must-gather data: ClusterID: e34de7bb-52f5-44e1-8072-8f278d4e1f15 ClusterVersion: Stable at "4.11.0-0.nightly-2022-06-06-201913" ClusterOperators: All healthy and stable $ oc -n openshift-ingress-operator get pod ingress-operator-76fb9cbb6c-65dgd -o yaml apiVersion: v1 kind: Pod metadata: [...] containerStatuses: - containerID: cri-o://7376e7c9c1ce064d55b70b0a4961f9da1e4c1e1d295ab32f6aaa2ce2b9d5ede3 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb372d527cb475812d1ee16a5fa6499ade4f0afc56f8b427eb539736700dea71 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb372d527cb475812d1ee16a5fa6499ade4f0afc56f8b427eb539736700dea71 lastState: terminated: containerID: cri-o://7376e7c9c1ce064d55b70b0a4961f9da1e4c1e1d295ab32f6aaa2ce2b9d5ede3 exitCode: 2 finishedAt: "2022-06-09T10:20:57Z" message: "rator/controller/ingress/deployment.go:661 +0x3d16\ngithub.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureRouterDeployment(0xc000c5f950, 0xc000956300, 0x2799770?, 0x10?, 0xc001 75d848?, 0x40b0d6?, 0x40?, 0x22dc480?)\n\t/ingress-operator/pkg/operator/controller/ingress/deployment.go:125 +0x2ba\ngithub.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler).ensureIngressController(0xc0 00c5f950, 0xc000956300, 0xc0017042d0?, 0x0?, 0xb?, 0x2336b88?, 0x2?)\n\t/ingress-operator/pkg/operator/controller/ingress/controller.go:851 +0x654\ngithub.com/openshift/cluster-ingress-operator/pkg/operator/controller/ingress.(*reconciler ).Reconcile(0xc000c5f950, {0x27b4578, 0xc0017042d0}, {{{0xc0004659a0?, 0x220e3a0?}, {0xc000d92620?, 0x30?}}})\n\t/ingress-operator/pkg/operator/controller/ingress/controller.go:261 +0xad2\nsigs.k8s.io/controller-runtime/pkg/internal/contr oller.(*Controller).Reconcile(0xc000a76e70, {0x27b4578, 0xc001704180}, {{{0xc0004659a0?, 0x220e3a0?}, {0xc000d92620?, 0x4041f4?}}})\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x27 e\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000a76e70, {0x27b44d0, 0xc000528e40}, {0x20af780?, 0xc000c6d340?})\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/control ler/controller.go:311 +0x349\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000a76e70, {0x27b44d0, 0xc000528e40})\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro ller/controller.go:266 +0x1d9\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85\ncreated by sig s.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x31c\n" reason: Error startedAt: "2022-06-09T10:20:50Z" name: ingress-operator ready: false restartCount: 20 started: false state: waiting: message: back-off 5m0s restarting failed container=ingress-operator pod=ingress-operator-76fb9cbb6c-65dgd_openshift-ingress-operator(87e1e203-2869-4714-a030-f8f42ab31f64) reason: CrashLoopBackOff [...] ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
Verified in "4.11.0-0.nightly-2022-06-21-040754". Upgrading cluster from 4.10 to the fixed nightly release on an OSP16 environment, the process completes successful with no failure or ingress operator pod crashes: -------- Pre-upgrade: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.18 True False 12m Cluster version is 4.10.18 oc get infrastructures.config.openshift.io cluster -ojsonpath='{.spec}' | jq . { "cloudConfig": { "key": "config", "name": "cloud-provider-config" }, "platformSpec": { "type": "OpenStack" } } oc -n openshift-ingress-operator get ingresscontroller default -ojsonpath='{.status.endpointPublishingStrategy}' | jq . { "hostNetwork": { "protocol": "TCP" }, "type": "HostNetwork" } Post upgrade: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-21-040754 True False 14s Cluster version is 4.11.0-0.nightly-2022-06-21-040754 oc -n openshift-ingress-operator get all NAME READY STATUS RESTARTS AGE pod/ingress-operator-5d548f9467-bmflw 2/2 Running 0 14m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/metrics ClusterIP 172.30.15.98 <none> 9393/TCP 125m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/ingress-operator 1/1 1 1 125m NAME DESIRED CURRENT READY AGE replicaset.apps/ingress-operator-5d548f9467 1 1 1 43m replicaset.apps/ingress-operator-7899578f6 0 0 0 125m oc -n openshift-ingress-operator logs pod/ingress-operator-5d548f9467-bmflw -c ingress-operator 2022-06-22T05:53:20.053Z INFO operator.main ingress-operator/start.go:63 using operator namespace {"namespace": "openshift-ingress-operator"} I0622 05:53:24.193309 1 request.go:665] Waited for 1.045122844s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s 2022-06-22T05:53:25.610Z INFO operator.main ingress-operator/start.go:63 registering Prometheus metrics for canary_controller 2022-06-22T05:53:25.610Z INFO operator.main ingress-operator/start.go:63 registering Prometheus metrics for ingress_controller 2022-06-22T05:53:25.610Z INFO operator.init runtime/asm_amd64.s:1571 starting metrics listener {"addr": "127.0.0.1:60000"} 2022-06-22T05:53:25.610Z INFO operator.main ingress-operator/start.go:63 watching file {"filename": "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"} 2022-06-22T05:53:28.120Z INFO operator.init.controller-runtime.metrics metrics/listener.go:44 Metrics server is starting to listen {"addr": ":8080"} I0622 05:53:28.121036 1 base_controller.go:67] Waiting for caches to sync for spread-default-router-pods 2022-06-22T05:53:28.148Z ERROR operator.init ingress-operator/start.go:197 failed to handle single node 4.11 upgrade logic {"error": "unable to update ingress config \"cluster\": ingresses.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-ingress-operator:ingress-operator\" cannot patch resource \"ingresses/status\" in API group \"config.openshift.io\" at the cluster scope"} 2022-06-22T05:53:28.149Z INFO operator.init runtime/asm_amd64.s:1571 Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} I0622 05:53:28.221804 1 base_controller.go:73] Caches are synced for spread-default-router-pods --------
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069