1785737 – Upgrade from 4.2.12 to 4.3 failed; MCO reports "controller version mismatch"

Bug 1785737 - Upgrade from 4.2.12 to 4.3 failed; MCO reports "controller version mismatch"

Summary: Upgrade from 4.2.12 to 4.3 failed; MCO reports "controller version mismatch"

Keywords:
Status:	CLOSED DUPLICATE of bug 1778904
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Micah Abbott
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-20 19:47 UTC by Miciah Dashiel Butler Masters
Modified:	2020-01-30 15:34 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-30 15:34:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Miciah Dashiel Butler Masters 2019-12-20 19:47:20 UTC

Description of problem:

Upgrading from 4.2.12 to 4.3.0-0.nightly-2019-12-20-145137 got stuck at 13%:

    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-20-145137: 13% complete

The mechine-config clusteroperator reported the following status conditions:

                    "conditions": [
                        {
                            "lastTransitionTime": "2019-12-20T16:16:38Z",
                            "message": "Cluster not available for 4.3.0-0.nightly-2019-12-20-145137",
                            "status": "False",
                            "type": "Available"
                        },
                        {
                            "lastTransitionTime": "2019-12-20T16:00:59Z",
                            "message": "Working towards 4.3.0-0.nightly-2019-12-20-145137",
                            "status": "True",
                            "type": "Progressing"
                        },
                        {
                            "lastTransitionTime": "2019-12-20T16:16:38Z",
                            "message": "Unable to apply 4.3.0-0.nightly-2019-12-20-145137: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-f0cd2de7cae40c363de564c65600efa1 expected 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a has d780d197a9c5848ba786982c0c4aaa7487297046, retrying",
                            "reason": "RequiredPoolsFailed",
                            "status": "True",
                            "type": "Degraded"
                        },
                        {
                            "lastTransitionTime": "2019-12-20T15:22:23Z",
                            "reason": "AsExpected",
                            "status": "True",
                            "type": "Upgradeable"
                        }
                    ],

Two of the nodes report "Kubelet stopped posting node status." in all their status conditions.

See https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004

Additional info:

The "controller version mismatch" error looks similar to bug 1781141 ("controller version mismatch" when upgrading from 4.2.9 to 4.3).

Comment 1 Kirsten Garrison 2019-12-20 22:37:01 UTC

I dont see this error in the MCC: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-controller-64798bf44b-vwxfv_machine-config-controller.log

Looking here it seems that it was in the middle of the upgrade and nothing is degraded: one worker and one master are unavailable. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004/artifacts/e2e-aws-upgrade/machineconfigpools.json



The problem seems to lie in kube-apiserver?:
Dec 20 16:15:09.190: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-20-145137: 84% complete, waiting on machine-config
Dec 20 16:15:19.186: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: the cluster operator kube-apiserver is degraded
Dec 20 16:15:19.186: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-200.ec2.internal" not ready 
...
Dec 20 16:23:39.186: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-20-145137: 13% complete
Dec 20 16:23:49.185: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: the cluster operator kube-apiserver is degraded
Dec 20 16:23:49.185: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-200.ec2.internal" not ready
Dec 20 16:23:59.186: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: the cluster operator kube-apiserver is degraded

Also seeing https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004:
Dec 20 16:15:18.481 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-200.ec2.internal" not ready
Dec 20 16:16:38.498 E clusteroperator/machine-config changed Degraded to True: RequiredPoolsFailed: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-f0cd2de7cae40c363de564c65600efa1 expected 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a has d780d197a9c5848ba786982c0c4aaa7487297046, retrying

Comment 2 Kirsten Garrison 2019-12-20 22:38:02 UTC

Actually is this a dupe of : https://bugzilla.redhat.com/show_bug.cgi?id=1778904

Comment 3 Micah Abbott 2020-01-30 15:34:27 UTC

Per comment #2, closing as DUPLICATE

*** This bug has been marked as a duplicate of bug 1778904 ***

Note You need to log in before you can comment on or make changes to this bug.