Description of problem: I am upgrading from 4.4.12 to 4.5.2 and the machine-config-operator is consistently running into a CrashLoopBackOff state. I've deleted the pod a few times with no success in getting the pod to do anything different. I don't seem to have a clear way to remediate my cluster so I'm stuck in the upgrade to 4.5.2. Version-Release number of selected component (if applicable): OpenShift 4.5.2 (upgrade from 4.4.12) How reproducible: I kicked off the upgrade on my cluster in a connected install. Prior to upgrading, all of my nodes and operators were healthy. I'm not sure if this is reproducible as I don't have a way to re-test this. Steps to Reproduce: 1. Press upgrade to 4.5.2 in UI Actual results: My machine-config-operator pod logs show the following panic and stack trace: $ oc logs machine-config-operator-594c89d579-6nz56 -p I0716 20:13:42.464336 1 start.go:46] Version: 4.5.2 (Raw: v4.5.0-202007131801.p0-dirty, Hash: 4173030d89fbf4a7a0976d1665491a4d9a6e54f1) I0716 20:13:42.467768 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-config-operator/machine-config... E0716 20:15:40.301124 1 event.go:316] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"e5e382bf-c5e4-11ea-8005-001a4a0ab023", ResourceVersion:"1029551", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63730336548, loc:(*time.Location)(0x2530700)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-594c89d579-6nz56_908a2725-b997-43a0-83e2-3cebc6a000e8\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2020-07-16T20:15:40Z\",\"renewTime\":\"2020-07-16T20:15:40Z\",\"leaderTransitions\":82}"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-594c89d579-6nz56_908a2725-b997-43a0-83e2-3cebc6a000e8 became leader' I0716 20:15:40.301308 1 leaderelection.go:252] successfully acquired lease openshift-machine-config-operator/machine-config I0716 20:15:40.933008 1 operator.go:265] Starting MachineConfigOperator E0716 20:15:40.975285 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 276 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1577500, 0x25113c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x1577500, 0x25113c0) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift/machine-config-operator/pkg/operator.isCloudConfigRequired(...) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:105 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncCloudConfig(0xc00017b8c0, 0xc0016d0480, 0xc00073a340, 0x8, 0xe) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:120 +0x237 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc00017b8c0, 0x0, 0xc03959fad0, 0x25112261be3) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:255 +0x865 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc00017b8c0, 0xc000019c98, 0x6, 0x6, 0x43f631, 0xc0002a8600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:59 +0x177 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc00017b8c0, 0xc000577170, 0x30, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:357 +0x37e github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc00017b8c0, 0xc0005e2600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:313 +0x102 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc00017b8c0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:302 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00019cda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00019cda0, 0x19960e0, 0xc000979080, 0xc00060e001, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00019cda0, 0x3b9aca00, 0x0, 0x1, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc00019cda0, 0x3b9aca00, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:271 +0x41f panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13faff7] goroutine 276 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105 panic(0x1577500, 0x25113c0) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift/machine-config-operator/pkg/operator.isCloudConfigRequired(...) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:105 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncCloudConfig(0xc00017b8c0, 0xc0016d0480, 0xc00073a340, 0x8, 0xe) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:120 +0x237 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc00017b8c0, 0x0, 0xc03959fad0, 0x25112261be3) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:255 +0x865 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc00017b8c0, 0xc000019c98, 0x6, 0x6, 0x43f631, 0xc0002a8600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:59 +0x177 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc00017b8c0, 0xc000577170, 0x30, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:357 +0x37e github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc00017b8c0, 0xc0005e2600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:313 +0x102 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc00017b8c0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:302 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00019cda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00019cda0, 0x19960e0, 0xc000979080, 0xc00060e001, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00019cda0, 0x3b9aca00, 0x0, 0x1, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc00019cda0, 0x3b9aca00, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:271 +0x41f My cluster operators are stuck towards the end of the upgrade: $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.2 True False False 2d5h cloud-credential 4.5.2 True False False 2d5h cluster-autoscaler 4.5.2 True False False 2d5h config-operator 4.5.2 True False False 6h55m console 4.5.2 True False False 51m csi-snapshot-controller 4.5.2 True False False 53m dns 4.5.2 True False False 2d5h etcd 4.5.2 True False False 2d2h image-registry 4.5.2 True False False 2d4h ingress 4.5.2 True False False 2d5h insights 4.5.2 True False False 2d3h kube-apiserver 4.5.2 True False False 2d5h kube-controller-manager 4.5.2 True False False 2d5h kube-scheduler 4.5.2 True False False 2d2h kube-storage-version-migrator 4.5.2 True False False 46m machine-api 4.5.2 True False False 2d5h machine-approver 4.5.2 True False False 6h51m machine-config 4.4.12 True True False 2d marketplace 4.5.2 True False False 47m monitoring 4.5.2 True False False 50m network 4.5.2 True False False 2d5h node-tuning 4.5.2 True False False 6h50m openshift-apiserver 4.5.2 True False False 2d openshift-controller-manager 4.5.2 True False False 2d5h openshift-samples 4.5.2 True False False 6h50m operator-lifecycle-manager 4.5.2 True False False 2d5h operator-lifecycle-manager-catalog 4.5.2 True False False 2d5h operator-lifecycle-manager-packageserver 4.5.2 True False False 47m service-ca 4.5.2 True False False 2d5h service-catalog-apiserver 4.4.12 True False False 2d5h service-catalog-controller-manager 4.4.12 True False False 2d5h storage 4.5.2 True False False 6h50m Expected results: The 4.5.2 machine-config-operator should not have encountered a panic and stack trace above, and should have properly rolled out new machineconfigs to all the nodes. Additional info:
Hi Kevin, Can you please attach a must-gather from this cluster?
I've kicked off a few tests to see if I can replicate that way in the meantime while we wait for must-gather.
Update: I ran 3 tests from 4.4.12 -> 4.5.2 and they all passed.
Hi Kirsten, I created support case #02705174 to attach a large must-gather from this cluster. Also of note, I attempted and failed to run the must-gather two times before I succeeded. Not entirely sure if it's related. $ oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da [must-gather ] OUT namespace/openshift-must-gather-kgx49 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-dtvgp created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da created [must-gather-tzmwz] OUT gather did not start: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-kgx49/pods/must-gather-tzmwz: unexpected EOF [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-dtvgp deleted [must-gather ] OUT namespace/openshift-must-gather-kgx49 deleted error: gather did not start for pod must-gather-tzmwz: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-kgx49/pods/must-gather-tzmwz: unexpected EOF $ oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da [must-gather ] OUT namespace/openshift-must-gather-29sqc created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-hrzwx created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da created [must-gather-7gdhh] OUT gather did not start: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-29sqc/pods/must-gather-7gdhh: unexpected EOF [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-hrzwx deleted [must-gather ] OUT namespace/openshift-must-gather-29sqc deleted error: gather did not start for pod must-gather-7gdhh: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-29sqc/pods/must-gather-7gdhh: unexpected EOF Kevin
Notes going through the logs: Looking at MCP : - lastTransitionTime: "2020-07-17T00:23:15Z" message: All nodes are updating to rendered-worker-237e0016060efcefe9cebacf7a047840 reason: "" status: "True" type: Updating configuration: name: rendered-worker-237e0016060efcefe9cebacf7a047840 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-chrony-configuration - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-e9125565-c5e4-11ea-8005-001a4a0ab023-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 0 machineCount: 3 observedGeneration: 8 readyMachineCount: 2 unavailableMachineCount: 1 updatedMachineCount: 3 3 updated machines but 1 unavailable? But looking at MCC logs the day before: 2020-07-16T19:42:38.865102277Z I0716 19:42:38.865026 1 status.go:82] Pool worker: All nodes are updated with rendered-worker-237e0016060efcefe9cebacf7a047840 ... Many hours later? After the pool was finished? ... 2020-07-17T00:23:10.432537403Z I0717 00:23:10.432234 1 node_controller.go:433] Pool worker: node worker3.ocp4.csa.gsslab.rdu2.redhat.com is now reporting unready: node worker3.ocp4.csa.gsslab.rdu2.redhat.com is reporting Unschedulable
NOTE: in the above post ^^^ rendered-worker-237e0016060efcefe9cebacf7a047840 looks to be an update to 4.4.12 Another weird thing is that the kubelet_service.log just... cuts off? ``` Jul 16 04:11:43.514356 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:43.514273 1369 prober.go:129] Readiness probe for "console-7cb4fbcc9-6tvmv_openshift-console(8a5e9555-7561-4dd7-862c-5cf687c349f9):console" succeeded Jul 16 04:11:43.631471 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:43.631388 1369 prober.go:129] Readiness probe for "marketplace-operator-684775c9cd-7cbjj_openshift-marketplace(08616998-805e-4839-aa31-44bac2c7410a):marketplace-operator" succeeded Jul 16 04:11:43.710163 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:43.710109 1369 prober.go:129] Readiness probe for "oauth-openshift-5755d79585-9mq77_openshift-authentication(a79bde14-22f9-4647-a9fe-814bd1e8433c):oauth-openshift" succeeded Jul 16 04:11:44.095075 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:44.095028 1369 prober.go:129] Liveness probe for "kube-apiserver-master1.ocp4.csa.gsslab.rdu2.redhat.com_openshift-kube-apiserver(b045dd46f58ef123b734cb62cd0d4b36):kube-apiserver" succeeded Jul 16 04:11:44.177008 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:44.176957 1369 prober.go:129] Readiness pro ``` Even tho the masters seem to have updated to 4.4.12 just fine? - lastTransitionTime: "2020-07-16T20:38:51Z" message: All nodes are updated with rendered-master-25beb3d199cfc42b52b3c2034c96497c reason: "" status: "True" type: Updated So the masters we alive but kubelet_service.log stopped?
This shouldnt be related to the worker pool however. Investigating further.
To add some context, this cluster was built from scratch three days ago starting at 4.1.0 and I stepped through a number of upgrades from stable channels, all successful with no issues (displayed in 'oc get clusterversion'). I left the cluster fully functional on 4.4.12 for a couple of days before I attempted the 4.5.2 upgrade when it became available yesterday. The worker3 which was reporting Unschedulable was actually a result of my comment #4 where I wasn't able to run a must-gather on the node for some reason, cordoned it to attempt to run must-gather from a different node, and found that pod spun up on worker3 again anyways but this time with success (which I uploaded). So you can ignore the Unschedulable node.
Created attachment 1701585 [details] Web console for this OCP cluster
I've also attached a screenshot of the web console showing the current state of my OpenShift cluster. The machine-config-operator pod is down, but everything else is up. Also, here's the state of each of the nodes: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master1.ocp4.csa.gsslab.rdu2.redhat.com Ready master 3d5h v1.17.1+a1af596 10.10.179.161 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 master2.ocp4.csa.gsslab.rdu2.redhat.com Ready master 3d5h v1.17.1+a1af596 10.10.179.180 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 master3.ocp4.csa.gsslab.rdu2.redhat.com Ready master 3d5h v1.17.1+a1af596 10.10.179.175 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 worker1.ocp4.csa.gsslab.rdu2.redhat.com Ready worker 3d5h v1.17.1+a1af596 10.10.179.176 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 worker2.ocp4.csa.gsslab.rdu2.redhat.com Ready worker 3d5h v1.17.1+a1af596 10.10.179.177 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 worker3.ocp4.csa.gsslab.rdu2.redhat.com Ready worker 3d5h v1.17.1+a1af596 10.10.179.178 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8
my must-gather: https://drive.google.com/file/d/1EwgsYjv7UVpif1EhwjO6oPpTsiyzKvyy/view?usp=sharing
For ref, Kevin's cluster is RHEV installed as baremetal and Jason's is baremetal.
I think the `syncCloudConfig()` bit here is key. On bare metal that won't exist. The code looks like it's trying to handle it not existing but the bug likely lies there. https://github.com/openshift/machine-config-operator/blob/1f52e483b93ffd88ba7d8217b273357e61e0cc6a/pkg/operator/sync.go#L131 last touched this.
Sorry I meant https://github.com/openshift/machine-config-operator/commit/e7455dcb4e0150e00f78e0ae4954b73047d1bf75
@Colin The weird thing is that the baremetal team has already updated that 2months ago and removed baremetal from that (along with ovirt): https://github.com/openshift/machine-config-operator/commit/7c6e1ba9dbcec56f02f13b071664e160d9552b16
must-gather includes: $ grep -r 'layer not known' host_service_logs/masters/crio_service.log:Jul 16 19:37:59.241925 master1.ocp4.csa.gsslab.rdu2.redhat.com crio[1321]: time="2020-07-16 19:37:59.232234035Z" level=warning msg="failed to stop container k8s_packageserver_packageserver-54646bfd7d-58h7p_openshift-operator-lifecycle-manager_633f410c-6d1b-48d4-9277-7157e922ea49_0 in pod sandbox 253281d149544387f315b280380987d361a9f7539351c13ab13b39d731a4ff4b: layer not known" id=af68b800-ff16-47df-881c-fc50099950b0 which is suspicious for bug 1857224. Although I'm not clear yet on how the sync corruption discussed there would cause "failed to stop container" errors instead of "failed to create container" errors.
@wking That error seems to coincide with the upgrade to 4.4.12 (which finished around 19:45) not the subsequent upgrade to 4.5.2 (as best as I can follow the logs)
Just double checking the configs in infrastructure/cluser.yaml: ``` spec: cloudConfig: name: "" status: ... platform: None ``` config.io.openshift/infrastructures.yaml: ``` spec: cloudConfig: name: "" status: ... platform: None ``` I believe `switch infra.Status.PlatformStatus.Type` is the problem. if that DNE I get the same panic!
Copying over my summary from the PR: infra.Status.PlatformStatus is *PlatformStatus and in <4.5.x baremetal setups this entire thing is empty and only platform is set to none. So when we hit the switch statement in the MCO we panic bc pointers, so check that it's really there before you do the switch stmt case comparisions. Before my fix, the unit test I added failed with same panic, now it passes and func returns false. This behavior was seen in bm clusters updating from 4.4.12->4.5.2, which is also when Platform was deprecated in favor of PlatformStatus and the MCO was missing the check. The checks do exist in the MCC transitioning them to the new type. As for why we saw this with users but not in CI, AFAIK there isn't an e2e metal upgrade job anywhere and I believe the existing metal job would just install a 4.5.x cluster with the new PlatformStatus.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
For clusters such as mine with an upgrade in limbo due to this machine-config-operator bug, once this errata is published how shall we recover? Shall we just force upgrade to the 4.5.latest?
Draft impact statement, to be updated as we get more information: Who is impacted? - Customers upgrading to 4.5.2 with platform: none which is some subset of baremetal deployments What is the impact? Is it serious enough to warrant blocking edges? - When the upgrade is rolling out the MCO panics and the upgrade is blocked. This will happen to every `platform: none` deployment. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? - We are currently investigating remediation, but so far have no confirmed fix. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? - No this is new in 4.5.2
To test this fix, you will need a cluster upgraded from 4.1 -> 4.4.12, verify that it has platform:None and no platformStatus set in the infrastucture object, then upgrade to master which contains the fix. The expectation is that you should upgrade successfully and not hit the above MCO panic.
We can reproduce it with a baremetal on OSP cluster upgraded from 4.1.41-> 4.2.36-> 4.3.29 -> 4.4.12-> 4.5.2. For details, please refer to [1]. Upgrade across 3 y-versions such as upgrading from 4.4.12 -> 4.6 is not officially supported. That means, to test it, we will need to upgrade a cluster from 4.1.41-> 4.2.36-> 4.3.29 -> 4.4.12-> 4.5.2 -> 4.6. However, when it comes to 4.5, it will fail definitely. To bypass the issue, we thought it was able to edit infrastucture object and remove platformStatus to mimic this case. It did work on a 4.4 cluster. With platformStatus removed on a fresh installed 4.4 cluster, upgrade failed. For details, please refer to [2]. We tried the similar operations on a fresh installed 4.5 cluster, but platformStatus was unable to remove. So without the fix in 4.5, we're getting stuck here. [1]https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/3931/console [2]https://gitlab.cee.redhat.com/openshift-qe/qe-40-blog/-/blob/master/gpei/BZ%231858026_reproduce.md
Thanks @yangyang for the update it does look like you reproduced the bug correctly. I'm wondering if there is a way to take a 4.5 ci build from the 4.5 PR.. Let me try this out, I'm pretty sure I've done it in the past.. Will update shortly.
I think the most expedient thing will be to merge the fix into 4.5 since it's nearly impossible to test in 4.6 and QE has a confirmed reproducer. I'm going to override the bugzilla/valid-bug based on this reasoning.
SGTM
*** Bug 1859781 has been marked as a duplicate of this bug. ***
Verified upgrade from 4.4.13 -> 4.5.0-0.nightly-2020-07-24-091850 using the reproducer of removing `platformStatus.type=None`. [root@helper openshift]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.13 True False 16m Cluster version is 4.4.13 [root@helper openshift]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.13 True False False 17m cloud-credential 4.4.13 True False False 62m cluster-autoscaler 4.4.13 True False False 37m console 4.4.13 True False False 20m csi-snapshot-controller 4.4.13 True False False 22m dns 4.4.13 True False False 45m etcd 4.4.13 True False False 44m image-registry 4.4.13 True False False 38m ingress 4.4.13 True False False 22m insights 4.4.13 True False False 38m kube-apiserver 4.4.13 True False False 43m kube-controller-manager 4.4.13 True False False 44m kube-scheduler 4.4.13 True False False 43m kube-storage-version-migrator 4.4.13 True False False 22m machine-api 4.4.13 True False False 38m machine-config 4.4.13 True False False 45m marketplace 4.4.13 True False False 37m monitoring 4.4.13 True False False 20m network 4.4.13 True False False 46m node-tuning 4.4.13 True False False 46m openshift-apiserver 4.4.13 True False False 40m openshift-controller-manager 4.4.13 True False False 37m openshift-samples 4.4.13 True False False 36m operator-lifecycle-manager 4.4.13 True False False 45m operator-lifecycle-manager-catalog 4.4.13 True False False 45m operator-lifecycle-manager-packageserver 4.4.13 True False False 40m service-ca 4.4.13 True False False 46m service-catalog-apiserver 4.4.13 True False False 46m service-catalog-controller-manager 4.4.13 True False False 46m storage 4.4.13 True False False 37m [root@helper openshift]# oc get infrastructure -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2020-07-24T12:18:06Z" generation: 1 name: cluster resourceVersion: "430" selfLink: /apis/config.openshift.io/v1/infrastructures/cluster uid: 09e21c21-e9ab-4686-880a-7ab31e0ac80f spec: cloudConfig: name: "" status: apiServerInternalURI: https://api-int.ocp4.example.com:6443 apiServerURL: https://api.ocp4.example.com:6443 etcdDiscoveryDomain: ocp4.example.com infrastructureName: ocp4-j52w2 platform: None platformStatus: type: None kind: List metadata: resourceVersion: "" selfLink: "" [root@helper openshift]# oc edit infrastructure infrastructure.config.openshift.io/cluster edited [root@helper openshift]# oc get infrastructure -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2020-07-24T12:18:06Z" generation: 2 name: cluster resourceVersion: "32704" selfLink: /apis/config.openshift.io/v1/infrastructures/cluster uid: 09e21c21-e9ab-4686-880a-7ab31e0ac80f spec: cloudConfig: name: "" status: apiServerInternalURI: https://api-int.ocp4.example.com:6443 apiServerURL: https://api.ocp4.example.com:6443 etcdDiscoveryDomain: ocp4.example.com infrastructureName: ocp4-j52w2 platform: None kind: List metadata: resourceVersion: "" selfLink: "" [root@helper openshift]# oc adm upgrade --force --allow-explicit-upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-07-24-091850 Updating to release image registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-07-24-091850 [root@helper openshift]# watch oc get clusterversion [root@helper openshift]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.13 True True 10s Working towards registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-07-24-091850: downloading update [root@helper openshift]# watch oc get clusterversion [root@helper openshift]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.13 True True 24s Unable to apply 4.5.0-0.nightly-2020-07-24-091850: the workload openshift-cluster-version/cluster-version-operator has not yet successfully rolled out [root@helper openshift]# watch oc get clusterversion [root@helper openshift]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.13 True False False 19m cloud-credential 4.4.13 True False False 63m cluster-autoscaler 4.4.13 True False False 38m config-operator console 4.4.13 True False False 21m csi-snapshot-controller 4.4.13 True False False 23m dns 4.4.13 True False False 46m etcd 4.4.13 True False False 45m image-registry 4.4.13 True False False 39m ingress 4.4.13 True False False 24m insights 4.4.13 True False False 39m kube-apiserver 4.4.13 True False False 45m kube-controller-manager 4.4.13 True False False 45m kube-scheduler 4.4.13 True False False 45m kube-storage-version-migrator 4.4.13 True False False 23m machine-api 4.4.13 True False False 39m machine-approver machine-config 4.4.13 True False False 46m marketplace 4.4.13 True False False 39m monitoring 4.4.13 True False False 21m network 4.4.13 True False False 48m node-tuning 4.4.13 True False False 48m openshift-apiserver 4.4.13 True False False 41m openshift-controller-manager 4.4.13 True False False 39m openshift-samples 4.4.13 True False False 38m operator-lifecycle-manager 4.4.13 True False False 47m operator-lifecycle-manager-catalog 4.4.13 True False False 47m operator-lifecycle-manager-packageserver 4.4.13 True False False 42m service-ca 4.4.13 True False False 48m service-catalog-apiserver 4.4.13 True False False 48m service-catalog-controller-manager 4.4.13 True False False 48m storage 4.4.13 True False False 39m [root@helper openshift]# watch oc get clusterversion [root@helper openshift]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-07-24-091850 True False False 38m cloud-credential 4.5.0-0.nightly-2020-07-24-091850 True False False 82m cluster-autoscaler 4.5.0-0.nightly-2020-07-24-091850 True False False 57m config-operator 4.5.0-0.nightly-2020-07-24-091850 True False False 16m console 4.5.0-0.nightly-2020-07-24-091850 True False False 8m10s csi-snapshot-controller 4.5.0-0.nightly-2020-07-24-091850 True False False 42m dns 4.5.0-0.nightly-2020-07-24-091850 True True False 65m etcd 4.5.0-0.nightly-2020-07-24-091850 True False False 64m image-registry 4.5.0-0.nightly-2020-07-24-091850 True False False 58m ingress 4.5.0-0.nightly-2020-07-24-091850 True False False 43m insights 4.5.0-0.nightly-2020-07-24-091850 True False False 58m kube-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 64m kube-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 64m kube-scheduler 4.5.0-0.nightly-2020-07-24-091850 True False False 64m kube-storage-version-migrator 4.5.0-0.nightly-2020-07-24-091850 True False False 10m machine-api 4.5.0-0.nightly-2020-07-24-091850 True False False 58m machine-approver 4.5.0-0.nightly-2020-07-24-091850 True False False 11m machine-config 4.4.13 True False False 6m30s marketplace 4.5.0-0.nightly-2020-07-24-091850 True False False 9m13s monitoring 4.5.0-0.nightly-2020-07-24-091850 True False False 7m39s network 4.5.0-0.nightly-2020-07-24-091850 True False False 67m node-tuning 4.5.0-0.nightly-2020-07-24-091850 True False False 10m openshift-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 10m openshift-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 58m openshift-samples 4.5.0-0.nightly-2020-07-24-091850 True False False 9m13s operator-lifecycle-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 66m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-07-24-091850 True False False 66m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-07-24-091850 True False False 8m59s service-ca 4.5.0-0.nightly-2020-07-24-091850 True False False 66m service-catalog-apiserver 4.4.13 True False False 67m service-catalog-controller-manager 4.4.13 True False False 67m storage 4.5.0-0.nightly-2020-07-24-091850 True False False 11m [root@helper openshift]# oc -n openshift-machine-config-operator get pods NAME READY STATUS RESTARTS AGE etcd-quorum-guard-54896968c-kzxpc 1/1 Running 0 65m etcd-quorum-guard-54896968c-prcl7 1/1 Running 0 65m etcd-quorum-guard-54896968c-xlnz2 1/1 Running 0 65m machine-config-controller-5b89ddfc68-zd8mb 1/1 Running 1 66m machine-config-daemon-68xgq 2/2 Running 0 67m machine-config-daemon-7b2dx 2/2 Running 0 45m machine-config-daemon-j6nz7 2/2 Running 0 45m machine-config-daemon-vlglq 2/2 Running 0 67m machine-config-daemon-vzhb6 2/2 Running 0 67m machine-config-operator-59bbb54b9c-nb7td 1/1 Running 0 64s machine-config-server-llnz2 1/1 Running 0 66m machine-config-server-wpwrv 1/1 Running 0 66m machine-config-server-z76jk 1/1 Running 0 66m [root@helper openshift]# oc -n openshift-machine-config-operator logs -f machine-config-operator-59bbb54b9c-nb7td I0724 13:40:37.239693 1 start.go:46] Version: 4.5.0-0.nightly-2020-07-24-091850 (Raw: v4.5.0-202007240519.p0-dirty, Hash: 99eb744f5094224edb60d88ca85d607ab151ebdf) I0724 13:40:37.244312 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-config-operator/machine-config... ^C [root@helper openshift]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-07-24-091850 True False 2m39s Cluster version is 4.5.0-0.nightly-2020-07-24-091850 [root@helper openshift]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-07-24-091850 True False False 65m cloud-credential 4.5.0-0.nightly-2020-07-24-091850 True False False 109m cluster-autoscaler 4.5.0-0.nightly-2020-07-24-091850 True False False 84m config-operator 4.5.0-0.nightly-2020-07-24-091850 True False False 43m console 4.5.0-0.nightly-2020-07-24-091850 True False False 15m csi-snapshot-controller 4.5.0-0.nightly-2020-07-24-091850 True False False 20m dns 4.5.0-0.nightly-2020-07-24-091850 True False False 92m etcd 4.5.0-0.nightly-2020-07-24-091850 True False False 91m image-registry 4.5.0-0.nightly-2020-07-24-091850 True False False 85m ingress 4.5.0-0.nightly-2020-07-24-091850 True False False 69m insights 4.5.0-0.nightly-2020-07-24-091850 True False False 85m kube-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 91m kube-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 91m kube-scheduler 4.5.0-0.nightly-2020-07-24-091850 True False False 91m kube-storage-version-migrator 4.5.0-0.nightly-2020-07-24-091850 True False False 17m machine-api 4.5.0-0.nightly-2020-07-24-091850 True False False 85m machine-approver 4.5.0-0.nightly-2020-07-24-091850 True False False 38m machine-config 4.5.0-0.nightly-2020-07-24-091850 True False False 5m22s marketplace 4.5.0-0.nightly-2020-07-24-091850 True False False 14m monitoring 4.5.0-0.nightly-2020-07-24-091850 True False False 34m network 4.5.0-0.nightly-2020-07-24-091850 True False False 93m node-tuning 4.5.0-0.nightly-2020-07-24-091850 True False False 36m openshift-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 7m23s openshift-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 84m openshift-samples 4.5.0-0.nightly-2020-07-24-091850 True False False 36m operator-lifecycle-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 92m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-07-24-091850 True False False 92m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-07-24-091850 True False False 6m58s service-ca 4.5.0-0.nightly-2020-07-24-091850 True False False 93m storage 4.5.0-0.nightly-2020-07-24-091850 True False False 37m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196