Bug 1785737 - Upgrade from 4.2.12 to 4.3 failed; MCO reports "controller version mismatch"
Summary: Upgrade from 4.2.12 to 4.3 failed; MCO reports "controller version mismatch"
Keywords:
Status: CLOSED DUPLICATE of bug 1778904
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-20 19:47 UTC by Miciah Dashiel Butler Masters
Modified: 2020-01-30 15:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-30 15:34:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Miciah Dashiel Butler Masters 2019-12-20 19:47:20 UTC
Description of problem:

Upgrading from 4.2.12 to 4.3.0-0.nightly-2019-12-20-145137 got stuck at 13%:

    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.nightly-2019-12-20-145137: 13% complete

The mechine-config clusteroperator reported the following status conditions:

                    "conditions": [
                        {
                            "lastTransitionTime": "2019-12-20T16:16:38Z",
                            "message": "Cluster not available for 4.3.0-0.nightly-2019-12-20-145137",
                            "status": "False",
                            "type": "Available"
                        },
                        {
                            "lastTransitionTime": "2019-12-20T16:00:59Z",
                            "message": "Working towards 4.3.0-0.nightly-2019-12-20-145137",
                            "status": "True",
                            "type": "Progressing"
                        },
                        {
                            "lastTransitionTime": "2019-12-20T16:16:38Z",
                            "message": "Unable to apply 4.3.0-0.nightly-2019-12-20-145137: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-f0cd2de7cae40c363de564c65600efa1 expected 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a has d780d197a9c5848ba786982c0c4aaa7487297046, retrying",
                            "reason": "RequiredPoolsFailed",
                            "status": "True",
                            "type": "Degraded"
                        },
                        {
                            "lastTransitionTime": "2019-12-20T15:22:23Z",
                            "reason": "AsExpected",
                            "status": "True",
                            "type": "Upgradeable"
                        }
                    ],

Two of the nodes report "Kubelet stopped posting node status." in all their status conditions.

See https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004

Additional info:

The "controller version mismatch" error looks similar to bug 1781141 ("controller version mismatch" when upgrading from 4.2.9 to 4.3).

Comment 1 Kirsten Garrison 2019-12-20 22:37:01 UTC
I dont see this error in the MCC: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-controller-64798bf44b-vwxfv_machine-config-controller.log

Looking here it seems that it was in the middle of the upgrade and nothing is degraded: one worker and one master are unavailable. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004/artifacts/e2e-aws-upgrade/machineconfigpools.json



The problem seems to lie in kube-apiserver?:
Dec 20 16:15:09.190: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-20-145137: 84% complete, waiting on machine-config
Dec 20 16:15:19.186: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: the cluster operator kube-apiserver is degraded
Dec 20 16:15:19.186: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-200.ec2.internal" not ready 
...
Dec 20 16:23:39.186: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-20-145137: 13% complete
Dec 20 16:23:49.185: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: the cluster operator kube-apiserver is degraded
Dec 20 16:23:49.185: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-200.ec2.internal" not ready
Dec 20 16:23:59.186: INFO: cluster upgrade is Progressing: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: the cluster operator kube-apiserver is degraded

Also seeing https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/13004:
Dec 20 16:15:18.481 E clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-200.ec2.internal" not ready
Dec 20 16:16:38.498 E clusteroperator/machine-config changed Degraded to True: RequiredPoolsFailed: Unable to apply 4.3.0-0.nightly-2019-12-20-145137: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-f0cd2de7cae40c363de564c65600efa1 expected 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a has d780d197a9c5848ba786982c0c4aaa7487297046, retrying

Comment 2 Kirsten Garrison 2019-12-20 22:38:02 UTC
Actually is this a dupe of : https://bugzilla.redhat.com/show_bug.cgi?id=1778904

Comment 3 Micah Abbott 2020-01-30 15:34:27 UTC
Per comment #2, closing as DUPLICATE

*** This bug has been marked as a duplicate of bug 1778904 ***


Note You need to log in before you can comment on or make changes to this bug.