level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-vkz0f4ki-77109.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..." level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.138.27:6443/.well-known/oauth-authorization-server endpoint data" level=info msg="Cluster operator authentication Available is False with : " level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-rc.4" level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment" level=error msg="Cluster operator etcd Degraded is True with NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" level=info msg="Cluster operator insights Disabled is False with : " level=error msg="Cluster operator kube-apiserver Degraded is True with NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 5 is pending" level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 2; 1 nodes are at revision 5" level=error msg="Cluster operator kube-controller-manager Degraded is True with NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 3; 1 nodes are at revision 4; 1 nodes are at revision 6; 0 nodes have achieved new revision 7" level=error msg="Cluster operator kube-scheduler Degraded is True with NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" level=info msg="Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 4; 2 nodes are at revision 6" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.4.0-rc.4" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)" level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)" level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-multus/multus\" rollout is not making progress - last change 2020-03-26T01:17:35Z\nDaemonSet \"openshift-sdn/ovs\" rollout is not making progress - last change 2020-03-26T01:17:36Z\nDaemonSet \"openshift-sdn/sdn\" rollout is not making progress - last change 2020-03-26T01:17:36Z" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 1 nodes)" level=error msg="Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)" 2020/03/26 01:44:56 Container setup in pod e2e-aws-upgrade failed, exit code 1, reason Error 2020/03/26 01:58:53 Copied 124.40MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade 2020/03/26 01:58:53 Releasing lease for "aws-quota-slice" 2020/03/26 01:58:53 No custom metadata found and prow metadata already exists. Not updating the metadata. 2020/03/26 01:58:54 Ran for 1h7m39s error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-vkz0f4ki/e2e-aws-upgrade failed after 1h6m4s (failed containers: setup): ContainerFailed one or more containers exited Container setup exited with code 1, reason Error --- raded is True with RequiredPoolsFailed: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)" level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=error msg="Cluster operator monitoring Degraded is True with UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)" level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-multus/multus\" rollout is not making progress - last change 2020-03-26T01:17:35Z\nDaemonSet \"openshift-sdn/ovs\" rollout is not making progress - last change 2020-03-26T01:17:36Z\nDaemonSet \"openshift-sdn/sdn\" rollout is not making progress - last change 2020-03-26T01:17:36Z" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 1 nodes)" level=error msg="Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23129
The master pool was ready and completed the upgrade: From MCC logs: I0326 01:12:38.648998 1 status.go:82] Pool master: All nodes are updated with rendered-master-d03f1a11947f94bcb114b096e51147a3 Something after that happened and the node (a master) is gone: I0326 01:17:35.815305 1 node_controller.go:433] Pool master: node ip-10-0-138-27.us-west-2.compute.internal is now reporting unready: node ip-10-0-138-27.us-west-2.compute.internal is reporting OutOfDisk=Unknown The MCO hasn't caused that from looking at the logs. conditions: [ { lastHeartbeatTime: "2020-03-26T01:16:24Z", lastTransitionTime: "2020-03-26T01:17:35Z", message: "Kubelet stopped posting node status.", reason: "NodeStatusUnknown", status: "Unknown", type: "MemoryPressure" }, { lastHeartbeatTime: "2020-03-26T01:16:24Z", lastTransitionTime: "2020-03-26T01:17:35Z", message: "Kubelet stopped posting node status.", reason: "NodeStatusUnknown", status: "Unknown", type: "DiskPressure" }, { lastHeartbeatTime: "2020-03-26T01:16:24Z", lastTransitionTime: "2020-03-26T01:17:35Z", message: "Kubelet stopped posting node status.", reason: "NodeStatusUnknown", status: "Unknown", type: "PIDPressure" }, { lastHeartbeatTime: "2020-03-26T01:16:24Z", lastTransitionTime: "2020-03-26T01:17:35Z", message: "Kubelet stopped posting node status.", reason: "NodeStatusUnknown", status: "Unknown", type: "Ready" } ], How often do we see this error in CI jobs? Could it be the cloud provider failing the machine? (happened to my knowledge)
Moving to node team based on Antonio's assessment of a disappearing node. This general error(timed out waiting for the condition during syncRequiredMachineConfigPools) is showing up a lot: https://search.svc.ci.openshift.org/?search=timed+out+waiting+for+the+condition+during+syncRequiredMachineConfigPools&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Job Failure: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/27442
In https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23129/artifacts/junit_operator.xml Cluster operator network Degraded is True with RolloutHung: DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-03-26T01:17:35Z DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2020-03-26T01:17:36Z DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2020-03-26T01:17:36Z" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)"
*** This bug has been marked as a duplicate of bug 1834895 ***