Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1791061

Summary: RequiredPoolsFailed: Unable to apply 4.3.0-0.ci-2020-01-14-000624, controller version mismatch for rendered-master
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED DUPLICATE QA Contact: Michael Nguyen <mnguyen>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-14 21:20:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-01-14 18:56:11 UTC
In 4.2.14 -> 4.3.0-0.ci-2020-01-14-000624 update CI [1]:

  Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.3.0-0.ci-2020-01-14-000624: 12% complete

with:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14213/build-log.txt | grep 'changed Degraded'
Jan 14 01:12:37.308 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: error creating samples: imagestreams.image.openshift.io "jboss-datagrid73-openshift" is forbidden: not yet ready to handle request;imagestream update error: imagestreams.image.openshift.io "jboss-datagrid73-openshift" is forbidden: not yet ready to handle request;
Jan 14 01:12:57.040 W clusteroperator/openshift-samples changed Degraded to False
Jan 14 01:13:15.183 E clusteroperator/ingress changed Degraded to True: IngressControllersDegraded: Some ingresscontrollers are degraded: default
Jan 14 01:13:23.556 W clusteroperator/ingress changed Degraded to False
Jan 14 01:13:51.517 E clusteroperator/monitoring changed Degraded to True: UpdatingconfigurationsharingFailed: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Thanos Querier host: getting Route object failed: routes.route.openshift.io "thanos-querier" not found
Jan 14 01:15:16.767 W clusteroperator/monitoring changed Degraded to False
Jan 14 01:26:26.278 E clusteroperator/authentication changed Degraded to True: RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: net/http: TLS handshake timeout
Jan 14 01:26:30.044 W clusteroperator/authentication changed Degraded to False
Jan 14 01:35:35.143 E clusteroperator/ingress changed Degraded to True: IngressControllersDegraded: Some ingresscontrollers are degraded: default
Jan 14 01:36:25.626 W clusteroperator/ingress changed Degraded to False
Jan 14 01:36:40.340 E clusteroperator/monitoring changed Degraded to True: UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: waiting for Prometheus Route to become ready failed: waiting for RouteReady of prometheus-k8s: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s)
Jan 14 01:39:03.581 E clusteroperator/kube-controller-manager changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-153-178.ec2.internal" not ready since 2020-01-14 01:37:03 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
Jan 14 01:39:03.587 E clusteroperator/kube-scheduler changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master node(s) "ip-10-0-153-178.ec2.internal" not ready
Jan 14 01:39:03.597 E clusteroperator/kube-apiserver changed Degraded to True: NodeControllerDegradedMasterNodesReady: NodeControllerDegraded: The master nodes not ready: node "ip-10-0-153-178.ec2.internal" not ready since 2020-01-14 01:37:03 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
Jan 14 01:47:04.033 E clusteroperator/machine-config changed Degraded to True: RequiredPoolsFailed: Unable to apply 4.3.0-0.ci-2020-01-14-000624: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-82fe37e8ade46de11fede8dd6e975b61 expected 311a01e83034273599889c0b778ab29c3d2d00d8 has d780d197a9c5848ba786982c0c4aaa7487297046, retrying
Jan 14 01:49:03.428 E clusteroperator/network changed Degraded to True: RolloutHung: DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-01-14T01:37:04Z\nDaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2020-01-14T01:37:04Z\nDaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2020-01-14T01:37:04Z

Seems similar to bug 1782152 (4.2) and bug 1782149 (4.3), but those had "rendered-masetr...not found", not "controller version mismatch".

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14213

Comment 1 Antonio Murdaca 2020-01-14 21:20:02 UTC
This is https://bugzilla.redhat.com/show_bug.cgi?id=1786993 which also requires https://bugzilla.redhat.com/show_bug.cgi?id=1789565

*** This bug has been marked as a duplicate of bug 1786993 ***

Comment 2 W. Trevor King 2020-01-15 04:51:25 UTC
Is this really a dup of bug 1786993?  Bug 1786993 was fixed by [1], and I see this same failure in 4.2.14 -> 4.3.0-0.ci-2020-01-14-234604 [2].  That target has:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2020-01-14-234604 | grep machine-config
  machine-config-operator                        https://github.com/openshift/machine-config-operator                        311a01e83034273599889c0b778ab29c3d2d00d8

Which is just before [1]:

$ git --no-pager log --oneline -2 origin/release-4.3
25bb6aeb (origin/release-4.3) Merge pull request #1359 from runcom/osimageurl-race-43
311a01e8 Merge pull request #1361 from rphillips/fixes/1787581_4.3

So 1786993 is still possible, but maybe this is a dup of the 4.2 1789565 (still POST).  I'll keep watching and see if this turns up on any later 4.3 targets...

[1]: https://github.com/openshift/machine-config-operator/pull/1359
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14334