Bug 1858026
| Summary: | Panic in machine-config-operator when attempting to upgrade to 4.5.2 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kevin Chung <kechung> | ||||
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> | ||||
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | urgent | ||||||
| Priority: | urgent | CC: | alchan, amurdaca, aos-bugs, jbrooks, kgarriso, mkrejci, mnguyen, nschuetz, rioliu, sdodson, sregidor, vpagar, walters, wking, xtian | ||||
| Version: | 4.5 | Keywords: | Upgrades | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.6.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1858907 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-10-27 16:15:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1858907 | ||||||
| Attachments: |
|
||||||
Hi Kevin, Can you please attach a must-gather from this cluster? I've kicked off a few tests to see if I can replicate that way in the meantime while we wait for must-gather. Update: I ran 3 tests from 4.4.12 -> 4.5.2 and they all passed. Hi Kirsten, I created support case #02705174 to attach a large must-gather from this cluster. Also of note, I attempted and failed to run the must-gather two times before I succeeded. Not entirely sure if it's related. $ oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da [must-gather ] OUT namespace/openshift-must-gather-kgx49 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-dtvgp created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da created [must-gather-tzmwz] OUT gather did not start: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-kgx49/pods/must-gather-tzmwz: unexpected EOF [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-dtvgp deleted [must-gather ] OUT namespace/openshift-must-gather-kgx49 deleted error: gather did not start for pod must-gather-tzmwz: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-kgx49/pods/must-gather-tzmwz: unexpected EOF $ oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da [must-gather ] OUT namespace/openshift-must-gather-29sqc created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-hrzwx created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:61198ba5bd46fc26b3d40d83a2fb7f859614f516a7896404b70fa468c8efa5da created [must-gather-7gdhh] OUT gather did not start: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-29sqc/pods/must-gather-7gdhh: unexpected EOF [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-hrzwx deleted [must-gather ] OUT namespace/openshift-must-gather-29sqc deleted error: gather did not start for pod must-gather-7gdhh: Get https://api.ocp4.csa.gsslab.rdu2.redhat.com:6443/api/v1/namespaces/openshift-must-gather-29sqc/pods/must-gather-7gdhh: unexpected EOF Kevin Notes going through the logs:
Looking at MCP :
- lastTransitionTime: "2020-07-17T00:23:15Z"
message: All nodes are updating to rendered-worker-237e0016060efcefe9cebacf7a047840
reason: ""
status: "True"
type: Updating
configuration:
name: rendered-worker-237e0016060efcefe9cebacf7a047840
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-chrony-configuration
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-e9125565-c5e4-11ea-8005-001a4a0ab023-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
degradedMachineCount: 0
machineCount: 3
observedGeneration: 8
readyMachineCount: 2
unavailableMachineCount: 1
updatedMachineCount: 3
3 updated machines but 1 unavailable?
But looking at MCC logs the day before:
2020-07-16T19:42:38.865102277Z I0716 19:42:38.865026 1 status.go:82] Pool worker: All nodes are updated with rendered-worker-237e0016060efcefe9cebacf7a047840
...
Many hours later? After the pool was finished?
...
2020-07-17T00:23:10.432537403Z I0717 00:23:10.432234 1 node_controller.go:433] Pool worker: node worker3.ocp4.csa.gsslab.rdu2.redhat.com is now reporting unready: node worker3.ocp4.csa.gsslab.rdu2.redhat.com is reporting Unschedulable
NOTE: in the above post ^^^ rendered-worker-237e0016060efcefe9cebacf7a047840 looks to be an update to 4.4.12
Another weird thing is that the kubelet_service.log just... cuts off?
```
Jul 16 04:11:43.514356 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:43.514273 1369 prober.go:129] Readiness probe for "console-7cb4fbcc9-6tvmv_openshift-console(8a5e9555-7561-4dd7-862c-5cf687c349f9):console" succeeded
Jul 16 04:11:43.631471 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:43.631388 1369 prober.go:129] Readiness probe for "marketplace-operator-684775c9cd-7cbjj_openshift-marketplace(08616998-805e-4839-aa31-44bac2c7410a):marketplace-operator" succeeded
Jul 16 04:11:43.710163 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:43.710109 1369 prober.go:129] Readiness probe for "oauth-openshift-5755d79585-9mq77_openshift-authentication(a79bde14-22f9-4647-a9fe-814bd1e8433c):oauth-openshift" succeeded
Jul 16 04:11:44.095075 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:44.095028 1369 prober.go:129] Liveness probe for "kube-apiserver-master1.ocp4.csa.gsslab.rdu2.redhat.com_openshift-kube-apiserver(b045dd46f58ef123b734cb62cd0d4b36):kube-apiserver" succeeded
Jul 16 04:11:44.177008 master1.ocp4.csa.gsslab.rdu2.redhat.com hyperkube[1369]: I0716 04:11:44.176957 1369 prober.go:129] Readiness pro
```
Even tho the masters seem to have updated to 4.4.12 just fine?
- lastTransitionTime: "2020-07-16T20:38:51Z"
message: All nodes are updated with rendered-master-25beb3d199cfc42b52b3c2034c96497c
reason: ""
status: "True"
type: Updated
So the masters we alive but kubelet_service.log stopped?
This shouldnt be related to the worker pool however. Investigating further. To add some context, this cluster was built from scratch three days ago starting at 4.1.0 and I stepped through a number of upgrades from stable channels, all successful with no issues (displayed in 'oc get clusterversion'). I left the cluster fully functional on 4.4.12 for a couple of days before I attempted the 4.5.2 upgrade when it became available yesterday. The worker3 which was reporting Unschedulable was actually a result of my comment #4 where I wasn't able to run a must-gather on the node for some reason, cordoned it to attempt to run must-gather from a different node, and found that pod spun up on worker3 again anyways but this time with success (which I uploaded). So you can ignore the Unschedulable node. Created attachment 1701585 [details]
Web console for this OCP cluster
I've also attached a screenshot of the web console showing the current state of my OpenShift cluster. The machine-config-operator pod is down, but everything else is up. Also, here's the state of each of the nodes: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master1.ocp4.csa.gsslab.rdu2.redhat.com Ready master 3d5h v1.17.1+a1af596 10.10.179.161 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 master2.ocp4.csa.gsslab.rdu2.redhat.com Ready master 3d5h v1.17.1+a1af596 10.10.179.180 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 master3.ocp4.csa.gsslab.rdu2.redhat.com Ready master 3d5h v1.17.1+a1af596 10.10.179.175 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 worker1.ocp4.csa.gsslab.rdu2.redhat.com Ready worker 3d5h v1.17.1+a1af596 10.10.179.176 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 worker2.ocp4.csa.gsslab.rdu2.redhat.com Ready worker 3d5h v1.17.1+a1af596 10.10.179.177 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 worker3.ocp4.csa.gsslab.rdu2.redhat.com Ready worker 3d5h v1.17.1+a1af596 10.10.179.178 <none> Red Hat Enterprise Linux CoreOS 44.81.202007070223-0 (Ootpa) 4.18.0-147.20.1.el8_1.x86_64 cri-o://1.17.4-19.rhaos4.4.gitfb8131a.el8 For ref, Kevin's cluster is RHEV installed as baremetal and Jason's is baremetal. I think the `syncCloudConfig()` bit here is key. On bare metal that won't exist. The code looks like it's trying to handle it not existing but the bug likely lies there. https://github.com/openshift/machine-config-operator/blob/1f52e483b93ffd88ba7d8217b273357e61e0cc6a/pkg/operator/sync.go#L131 last touched this. Sorry I meant https://github.com/openshift/machine-config-operator/commit/e7455dcb4e0150e00f78e0ae4954b73047d1bf75 @Colin The weird thing is that the baremetal team has already updated that 2months ago and removed baremetal from that (along with ovirt): https://github.com/openshift/machine-config-operator/commit/7c6e1ba9dbcec56f02f13b071664e160d9552b16 must-gather includes: $ grep -r 'layer not known' host_service_logs/masters/crio_service.log:Jul 16 19:37:59.241925 master1.ocp4.csa.gsslab.rdu2.redhat.com crio[1321]: time="2020-07-16 19:37:59.232234035Z" level=warning msg="failed to stop container k8s_packageserver_packageserver-54646bfd7d-58h7p_openshift-operator-lifecycle-manager_633f410c-6d1b-48d4-9277-7157e922ea49_0 in pod sandbox 253281d149544387f315b280380987d361a9f7539351c13ab13b39d731a4ff4b: layer not known" id=af68b800-ff16-47df-881c-fc50099950b0 which is suspicious for bug 1857224. Although I'm not clear yet on how the sync corruption discussed there would cause "failed to stop container" errors instead of "failed to create container" errors. @wking That error seems to coincide with the upgrade to 4.4.12 (which finished around 19:45) not the subsequent upgrade to 4.5.2 (as best as I can follow the logs) Just double checking the configs in infrastructure/cluser.yaml:
```
spec:
cloudConfig:
name: ""
status:
...
platform: None
```
config.io.openshift/infrastructures.yaml:
```
spec:
cloudConfig:
name: ""
status:
...
platform: None
```
I believe
`switch infra.Status.PlatformStatus.Type`
is the problem. if that DNE I get the same panic!
Copying over my summary from the PR: infra.Status.PlatformStatus is *PlatformStatus and in <4.5.x baremetal setups this entire thing is empty and only platform is set to none. So when we hit the switch statement in the MCO we panic bc pointers, so check that it's really there before you do the switch stmt case comparisions. Before my fix, the unit test I added failed with same panic, now it passes and func returns false. This behavior was seen in bm clusters updating from 4.4.12->4.5.2, which is also when Platform was deprecated in favor of PlatformStatus and the MCO was missing the check. The checks do exist in the MCC transitioning them to the new type. As for why we saw this with users but not in CI, AFAIK there isn't an e2e metal upgrade job anywhere and I believe the existing metal job would just install a 4.5.x cluster with the new PlatformStatus. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 For clusters such as mine with an upgrade in limbo due to this machine-config-operator bug, once this errata is published how shall we recover? Shall we just force upgrade to the 4.5.latest? Draft impact statement, to be updated as we get more information: Who is impacted? - Customers upgrading to 4.5.2 with platform: none which is some subset of baremetal deployments What is the impact? Is it serious enough to warrant blocking edges? - When the upgrade is rolling out the MCO panics and the upgrade is blocked. This will happen to every `platform: none` deployment. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? - We are currently investigating remediation, but so far have no confirmed fix. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? - No this is new in 4.5.2 To test this fix, you will need a cluster upgraded from 4.1 -> 4.4.12, verify that it has platform:None and no platformStatus set in the infrastucture object, then upgrade to master which contains the fix. The expectation is that you should upgrade successfully and not hit the above MCO panic. We can reproduce it with a baremetal on OSP cluster upgraded from 4.1.41-> 4.2.36-> 4.3.29 -> 4.4.12-> 4.5.2. For details, please refer to [1]. Upgrade across 3 y-versions such as upgrading from 4.4.12 -> 4.6 is not officially supported. That means, to test it, we will need to upgrade a cluster from 4.1.41-> 4.2.36-> 4.3.29 -> 4.4.12-> 4.5.2 -> 4.6. However, when it comes to 4.5, it will fail definitely. To bypass the issue, we thought it was able to edit infrastucture object and remove platformStatus to mimic this case. It did work on a 4.4 cluster. With platformStatus removed on a fresh installed 4.4 cluster, upgrade failed. For details, please refer to [2]. We tried the similar operations on a fresh installed 4.5 cluster, but platformStatus was unable to remove. So without the fix in 4.5, we're getting stuck here. [1]https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/3931/console [2]https://gitlab.cee.redhat.com/openshift-qe/qe-40-blog/-/blob/master/gpei/BZ%231858026_reproduce.md Thanks @yangyang for the update it does look like you reproduced the bug correctly. I'm wondering if there is a way to take a 4.5 ci build from the 4.5 PR.. Let me try this out, I'm pretty sure I've done it in the past.. Will update shortly. I think the most expedient thing will be to merge the fix into 4.5 since it's nearly impossible to test in 4.6 and QE has a confirmed reproducer. I'm going to override the bugzilla/valid-bug based on this reasoning. SGTM *** Bug 1859781 has been marked as a duplicate of this bug. *** Verified upgrade from 4.4.13 -> 4.5.0-0.nightly-2020-07-24-091850 using the reproducer of removing `platformStatus.type=None`.
[root@helper openshift]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.13 True False 16m Cluster version is 4.4.13
[root@helper openshift]# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.4.13 True False False 17m
cloud-credential 4.4.13 True False False 62m
cluster-autoscaler 4.4.13 True False False 37m
console 4.4.13 True False False 20m
csi-snapshot-controller 4.4.13 True False False 22m
dns 4.4.13 True False False 45m
etcd 4.4.13 True False False 44m
image-registry 4.4.13 True False False 38m
ingress 4.4.13 True False False 22m
insights 4.4.13 True False False 38m
kube-apiserver 4.4.13 True False False 43m
kube-controller-manager 4.4.13 True False False 44m
kube-scheduler 4.4.13 True False False 43m
kube-storage-version-migrator 4.4.13 True False False 22m
machine-api 4.4.13 True False False 38m
machine-config 4.4.13 True False False 45m
marketplace 4.4.13 True False False 37m
monitoring 4.4.13 True False False 20m
network 4.4.13 True False False 46m
node-tuning 4.4.13 True False False 46m
openshift-apiserver 4.4.13 True False False 40m
openshift-controller-manager 4.4.13 True False False 37m
openshift-samples 4.4.13 True False False 36m
operator-lifecycle-manager 4.4.13 True False False 45m
operator-lifecycle-manager-catalog 4.4.13 True False False 45m
operator-lifecycle-manager-packageserver 4.4.13 True False False 40m
service-ca 4.4.13 True False False 46m
service-catalog-apiserver 4.4.13 True False False 46m
service-catalog-controller-manager 4.4.13 True False False 46m
storage 4.4.13 True False False 37m
[root@helper openshift]# oc get infrastructure -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
creationTimestamp: "2020-07-24T12:18:06Z"
generation: 1
name: cluster
resourceVersion: "430"
selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
uid: 09e21c21-e9ab-4686-880a-7ab31e0ac80f
spec:
cloudConfig:
name: ""
status:
apiServerInternalURI: https://api-int.ocp4.example.com:6443
apiServerURL: https://api.ocp4.example.com:6443
etcdDiscoveryDomain: ocp4.example.com
infrastructureName: ocp4-j52w2
platform: None
platformStatus:
type: None
kind: List
metadata:
resourceVersion: ""
selfLink: ""
[root@helper openshift]# oc edit infrastructure
infrastructure.config.openshift.io/cluster edited
[root@helper openshift]# oc get infrastructure -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
creationTimestamp: "2020-07-24T12:18:06Z"
generation: 2
name: cluster
resourceVersion: "32704"
selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
uid: 09e21c21-e9ab-4686-880a-7ab31e0ac80f
spec:
cloudConfig:
name: ""
status:
apiServerInternalURI: https://api-int.ocp4.example.com:6443
apiServerURL: https://api.ocp4.example.com:6443
etcdDiscoveryDomain: ocp4.example.com
infrastructureName: ocp4-j52w2
platform: None
kind: List
metadata:
resourceVersion: ""
selfLink: ""
[root@helper openshift]# oc adm upgrade --force --allow-explicit-upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-07-24-091850
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-07-24-091850
[root@helper openshift]# watch oc get clusterversion
[root@helper openshift]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.13 True True 10s Working towards registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-07-24-091850: downloading update
[root@helper openshift]# watch oc get clusterversion
[root@helper openshift]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.13 True True 24s Unable to apply 4.5.0-0.nightly-2020-07-24-091850: the workload openshift-cluster-version/cluster-version-operator has not yet successfully rolled out
[root@helper openshift]# watch oc get clusterversion
[root@helper openshift]# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.4.13 True False False 19m
cloud-credential 4.4.13 True False False 63m
cluster-autoscaler 4.4.13 True False False 38m
config-operator
console 4.4.13 True False False 21m
csi-snapshot-controller 4.4.13 True False False 23m
dns 4.4.13 True False False 46m
etcd 4.4.13 True False False 45m
image-registry 4.4.13 True False False 39m
ingress 4.4.13 True False False 24m
insights 4.4.13 True False False 39m
kube-apiserver 4.4.13 True False False 45m
kube-controller-manager 4.4.13 True False False 45m
kube-scheduler 4.4.13 True False False 45m
kube-storage-version-migrator 4.4.13 True False False 23m
machine-api 4.4.13 True False False 39m
machine-approver
machine-config 4.4.13 True False False 46m
marketplace 4.4.13 True False False 39m
monitoring 4.4.13 True False False 21m
network 4.4.13 True False False 48m
node-tuning 4.4.13 True False False 48m
openshift-apiserver 4.4.13 True False False 41m
openshift-controller-manager 4.4.13 True False False 39m
openshift-samples 4.4.13 True False False 38m
operator-lifecycle-manager 4.4.13 True False False 47m
operator-lifecycle-manager-catalog 4.4.13 True False False 47m
operator-lifecycle-manager-packageserver 4.4.13 True False False 42m
service-ca 4.4.13 True False False 48m
service-catalog-apiserver 4.4.13 True False False 48m
service-catalog-controller-manager 4.4.13 True False False 48m
storage 4.4.13 True False False 39m
[root@helper openshift]# watch oc get clusterversion
[root@helper openshift]# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.5.0-0.nightly-2020-07-24-091850 True False False 38m
cloud-credential 4.5.0-0.nightly-2020-07-24-091850 True False False 82m
cluster-autoscaler 4.5.0-0.nightly-2020-07-24-091850 True False False 57m
config-operator 4.5.0-0.nightly-2020-07-24-091850 True False False 16m
console 4.5.0-0.nightly-2020-07-24-091850 True False False 8m10s
csi-snapshot-controller 4.5.0-0.nightly-2020-07-24-091850 True False False 42m
dns 4.5.0-0.nightly-2020-07-24-091850 True True False 65m
etcd 4.5.0-0.nightly-2020-07-24-091850 True False False 64m
image-registry 4.5.0-0.nightly-2020-07-24-091850 True False False 58m
ingress 4.5.0-0.nightly-2020-07-24-091850 True False False 43m
insights 4.5.0-0.nightly-2020-07-24-091850 True False False 58m
kube-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 64m
kube-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 64m
kube-scheduler 4.5.0-0.nightly-2020-07-24-091850 True False False 64m
kube-storage-version-migrator 4.5.0-0.nightly-2020-07-24-091850 True False False 10m
machine-api 4.5.0-0.nightly-2020-07-24-091850 True False False 58m
machine-approver 4.5.0-0.nightly-2020-07-24-091850 True False False 11m
machine-config 4.4.13 True False False 6m30s
marketplace 4.5.0-0.nightly-2020-07-24-091850 True False False 9m13s
monitoring 4.5.0-0.nightly-2020-07-24-091850 True False False 7m39s
network 4.5.0-0.nightly-2020-07-24-091850 True False False 67m
node-tuning 4.5.0-0.nightly-2020-07-24-091850 True False False 10m
openshift-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 10m
openshift-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 58m
openshift-samples 4.5.0-0.nightly-2020-07-24-091850 True False False 9m13s
operator-lifecycle-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 66m
operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-07-24-091850 True False False 66m
operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-07-24-091850 True False False 8m59s
service-ca 4.5.0-0.nightly-2020-07-24-091850 True False False 66m
service-catalog-apiserver 4.4.13 True False False 67m
service-catalog-controller-manager 4.4.13 True False False 67m
storage 4.5.0-0.nightly-2020-07-24-091850 True False False 11m
[root@helper openshift]# oc -n openshift-machine-config-operator get pods
NAME READY STATUS RESTARTS AGE
etcd-quorum-guard-54896968c-kzxpc 1/1 Running 0 65m
etcd-quorum-guard-54896968c-prcl7 1/1 Running 0 65m
etcd-quorum-guard-54896968c-xlnz2 1/1 Running 0 65m
machine-config-controller-5b89ddfc68-zd8mb 1/1 Running 1 66m
machine-config-daemon-68xgq 2/2 Running 0 67m
machine-config-daemon-7b2dx 2/2 Running 0 45m
machine-config-daemon-j6nz7 2/2 Running 0 45m
machine-config-daemon-vlglq 2/2 Running 0 67m
machine-config-daemon-vzhb6 2/2 Running 0 67m
machine-config-operator-59bbb54b9c-nb7td 1/1 Running 0 64s
machine-config-server-llnz2 1/1 Running 0 66m
machine-config-server-wpwrv 1/1 Running 0 66m
machine-config-server-z76jk 1/1 Running 0 66m
[root@helper openshift]# oc -n openshift-machine-config-operator logs -f machine-config-operator-59bbb54b9c-nb7td
I0724 13:40:37.239693 1 start.go:46] Version: 4.5.0-0.nightly-2020-07-24-091850 (Raw: v4.5.0-202007240519.p0-dirty, Hash: 99eb744f5094224edb60d88ca85d607ab151ebdf)
I0724 13:40:37.244312 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
^C
[root@helper openshift]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.0-0.nightly-2020-07-24-091850 True False 2m39s Cluster version is 4.5.0-0.nightly-2020-07-24-091850
[root@helper openshift]# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.5.0-0.nightly-2020-07-24-091850 True False False 65m
cloud-credential 4.5.0-0.nightly-2020-07-24-091850 True False False 109m
cluster-autoscaler 4.5.0-0.nightly-2020-07-24-091850 True False False 84m
config-operator 4.5.0-0.nightly-2020-07-24-091850 True False False 43m
console 4.5.0-0.nightly-2020-07-24-091850 True False False 15m
csi-snapshot-controller 4.5.0-0.nightly-2020-07-24-091850 True False False 20m
dns 4.5.0-0.nightly-2020-07-24-091850 True False False 92m
etcd 4.5.0-0.nightly-2020-07-24-091850 True False False 91m
image-registry 4.5.0-0.nightly-2020-07-24-091850 True False False 85m
ingress 4.5.0-0.nightly-2020-07-24-091850 True False False 69m
insights 4.5.0-0.nightly-2020-07-24-091850 True False False 85m
kube-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 91m
kube-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 91m
kube-scheduler 4.5.0-0.nightly-2020-07-24-091850 True False False 91m
kube-storage-version-migrator 4.5.0-0.nightly-2020-07-24-091850 True False False 17m
machine-api 4.5.0-0.nightly-2020-07-24-091850 True False False 85m
machine-approver 4.5.0-0.nightly-2020-07-24-091850 True False False 38m
machine-config 4.5.0-0.nightly-2020-07-24-091850 True False False 5m22s
marketplace 4.5.0-0.nightly-2020-07-24-091850 True False False 14m
monitoring 4.5.0-0.nightly-2020-07-24-091850 True False False 34m
network 4.5.0-0.nightly-2020-07-24-091850 True False False 93m
node-tuning 4.5.0-0.nightly-2020-07-24-091850 True False False 36m
openshift-apiserver 4.5.0-0.nightly-2020-07-24-091850 True False False 7m23s
openshift-controller-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 84m
openshift-samples 4.5.0-0.nightly-2020-07-24-091850 True False False 36m
operator-lifecycle-manager 4.5.0-0.nightly-2020-07-24-091850 True False False 92m
operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-07-24-091850 True False False 92m
operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-07-24-091850 True False False 6m58s
service-ca 4.5.0-0.nightly-2020-07-24-091850 True False False 93m
storage 4.5.0-0.nightly-2020-07-24-091850 True False False 37m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Description of problem: I am upgrading from 4.4.12 to 4.5.2 and the machine-config-operator is consistently running into a CrashLoopBackOff state. I've deleted the pod a few times with no success in getting the pod to do anything different. I don't seem to have a clear way to remediate my cluster so I'm stuck in the upgrade to 4.5.2. Version-Release number of selected component (if applicable): OpenShift 4.5.2 (upgrade from 4.4.12) How reproducible: I kicked off the upgrade on my cluster in a connected install. Prior to upgrading, all of my nodes and operators were healthy. I'm not sure if this is reproducible as I don't have a way to re-test this. Steps to Reproduce: 1. Press upgrade to 4.5.2 in UI Actual results: My machine-config-operator pod logs show the following panic and stack trace: $ oc logs machine-config-operator-594c89d579-6nz56 -p I0716 20:13:42.464336 1 start.go:46] Version: 4.5.2 (Raw: v4.5.0-202007131801.p0-dirty, Hash: 4173030d89fbf4a7a0976d1665491a4d9a6e54f1) I0716 20:13:42.467768 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-config-operator/machine-config... E0716 20:15:40.301124 1 event.go:316] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"e5e382bf-c5e4-11ea-8005-001a4a0ab023", ResourceVersion:"1029551", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63730336548, loc:(*time.Location)(0x2530700)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-594c89d579-6nz56_908a2725-b997-43a0-83e2-3cebc6a000e8\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2020-07-16T20:15:40Z\",\"renewTime\":\"2020-07-16T20:15:40Z\",\"leaderTransitions\":82}"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-594c89d579-6nz56_908a2725-b997-43a0-83e2-3cebc6a000e8 became leader' I0716 20:15:40.301308 1 leaderelection.go:252] successfully acquired lease openshift-machine-config-operator/machine-config I0716 20:15:40.933008 1 operator.go:265] Starting MachineConfigOperator E0716 20:15:40.975285 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 276 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1577500, 0x25113c0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x1577500, 0x25113c0) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift/machine-config-operator/pkg/operator.isCloudConfigRequired(...) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:105 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncCloudConfig(0xc00017b8c0, 0xc0016d0480, 0xc00073a340, 0x8, 0xe) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:120 +0x237 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc00017b8c0, 0x0, 0xc03959fad0, 0x25112261be3) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:255 +0x865 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc00017b8c0, 0xc000019c98, 0x6, 0x6, 0x43f631, 0xc0002a8600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:59 +0x177 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc00017b8c0, 0xc000577170, 0x30, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:357 +0x37e github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc00017b8c0, 0xc0005e2600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:313 +0x102 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc00017b8c0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:302 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00019cda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00019cda0, 0x19960e0, 0xc000979080, 0xc00060e001, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00019cda0, 0x3b9aca00, 0x0, 0x1, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc00019cda0, 0x3b9aca00, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:271 +0x41f panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13faff7] goroutine 276 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105 panic(0x1577500, 0x25113c0) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift/machine-config-operator/pkg/operator.isCloudConfigRequired(...) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:105 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncCloudConfig(0xc00017b8c0, 0xc0016d0480, 0xc00073a340, 0x8, 0xe) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:120 +0x237 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncRenderConfig(0xc00017b8c0, 0x0, 0xc03959fad0, 0x25112261be3) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:255 +0x865 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).syncAll(0xc00017b8c0, 0xc000019c98, 0x6, 0x6, 0x43f631, 0xc0002a8600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/sync.go:59 +0x177 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).sync(0xc00017b8c0, 0xc000577170, 0x30, 0x0, 0x0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:357 +0x37e github.com/openshift/machine-config-operator/pkg/operator.(*Operator).processNextWorkItem(0xc00017b8c0, 0xc0005e2600) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:313 +0x102 github.com/openshift/machine-config-operator/pkg/operator.(*Operator).worker(0xc00017b8c0) /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:302 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00019cda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00019cda0, 0x19960e0, 0xc000979080, 0xc00060e001, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00019cda0, 0x3b9aca00, 0x0, 0x1, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0xe2 k8s.io/apimachinery/pkg/util/wait.Until(0xc00019cda0, 0x3b9aca00, 0xc0000aa300) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d created by github.com/openshift/machine-config-operator/pkg/operator.(*Operator).Run /go/src/github.com/openshift/machine-config-operator/pkg/operator/operator.go:271 +0x41f My cluster operators are stuck towards the end of the upgrade: $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.2 True False False 2d5h cloud-credential 4.5.2 True False False 2d5h cluster-autoscaler 4.5.2 True False False 2d5h config-operator 4.5.2 True False False 6h55m console 4.5.2 True False False 51m csi-snapshot-controller 4.5.2 True False False 53m dns 4.5.2 True False False 2d5h etcd 4.5.2 True False False 2d2h image-registry 4.5.2 True False False 2d4h ingress 4.5.2 True False False 2d5h insights 4.5.2 True False False 2d3h kube-apiserver 4.5.2 True False False 2d5h kube-controller-manager 4.5.2 True False False 2d5h kube-scheduler 4.5.2 True False False 2d2h kube-storage-version-migrator 4.5.2 True False False 46m machine-api 4.5.2 True False False 2d5h machine-approver 4.5.2 True False False 6h51m machine-config 4.4.12 True True False 2d marketplace 4.5.2 True False False 47m monitoring 4.5.2 True False False 50m network 4.5.2 True False False 2d5h node-tuning 4.5.2 True False False 6h50m openshift-apiserver 4.5.2 True False False 2d openshift-controller-manager 4.5.2 True False False 2d5h openshift-samples 4.5.2 True False False 6h50m operator-lifecycle-manager 4.5.2 True False False 2d5h operator-lifecycle-manager-catalog 4.5.2 True False False 2d5h operator-lifecycle-manager-packageserver 4.5.2 True False False 47m service-ca 4.5.2 True False False 2d5h service-catalog-apiserver 4.4.12 True False False 2d5h service-catalog-controller-manager 4.4.12 True False False 2d5h storage 4.5.2 True False False 6h50m Expected results: The 4.5.2 machine-config-operator should not have encountered a panic and stack trace above, and should have properly rolled out new machineconfigs to all the nodes. Additional info: