Bug 1817465

Summary:	timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready
Product:	OpenShift Container Platform	Reporter:	Nikolaos Leandros Moraitis <nmoraiti>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	amurdaca, aos-bugs, bparees, harpatil, jokerman, kgarriso, periklis, rphillips, wking
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-18 14:53:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nikolaos Leandros Moraitis 2020-03-26 11:49:43 UTC

level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-vkz0f4ki-77109.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=info msg="Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.138.27:6443/.well-known/oauth-authorization-server endpoint data"
level=info msg="Cluster operator authentication Available is False with : "
level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-rc.4"
level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment"
level=error msg="Cluster operator etcd Degraded is True with NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
level=info msg="Cluster operator insights Disabled is False with : "
level=error msg="Cluster operator kube-apiserver Degraded is True with NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 5 is pending"
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 2; 1 nodes are at revision 5"
level=error msg="Cluster operator kube-controller-manager Degraded is True with NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 3; 1 nodes are at revision 4; 1 nodes are at revision 6; 0 nodes have achieved new revision 7"
level=error msg="Cluster operator kube-scheduler Degraded is True with NodeController_MasterNodesReady: NodeControllerDegraded: The master nodes not ready: node \"ip-10-0-138-27.us-west-2.compute.internal\" not ready since 2020-03-26 01:17:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
level=info msg="Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 4; 2 nodes are at revision 6"
level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.4.0-rc.4"
level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)"
level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-multus/multus\" rollout is not making progress - last change 2020-03-26T01:17:35Z\nDaemonSet \"openshift-sdn/ovs\" rollout is not making progress - last change 2020-03-26T01:17:36Z\nDaemonSet \"openshift-sdn/sdn\" rollout is not making progress - last change 2020-03-26T01:17:36Z"
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 1 nodes)"
level=error msg="Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)"
2020/03/26 01:44:56 Container setup in pod e2e-aws-upgrade failed, exit code 1, reason Error
2020/03/26 01:58:53 Copied 124.40MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade
2020/03/26 01:58:53 Releasing lease for "aws-quota-slice"
2020/03/26 01:58:53 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/03/26 01:58:54 Ran for 1h7m39s
error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-vkz0f4ki/e2e-aws-upgrade failed after 1h6m4s (failed containers: setup): ContainerFailed one or more containers exited
Container setup exited with code 1, reason Error
---
raded is True with RequiredPoolsFailed: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)"
level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-multus/multus\" rollout is not making progress - last change 2020-03-26T01:17:35Z\nDaemonSet \"openshift-sdn/ovs\" rollout is not making progress - last change 2020-03-26T01:17:36Z\nDaemonSet \"openshift-sdn/sdn\" rollout is not making progress - last change 2020-03-26T01:17:36Z"
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-sdn/sdn\" is not available (awaiting 1 nodes)"
level=error msg="Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.4.0-rc.4 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 2, updated: 3, unavailable: 1)" 

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23129

Comment 1 Antonio Murdaca 2020-04-08 09:14:44 UTC

The master pool was ready and completed the upgrade:

From MCC logs:

I0326 01:12:38.648998       1 status.go:82] Pool master: All nodes are updated with rendered-master-d03f1a11947f94bcb114b096e51147a3



Something after that happened and the node (a master) is gone:

I0326 01:17:35.815305       1 node_controller.go:433] Pool master: node ip-10-0-138-27.us-west-2.compute.internal is now reporting unready: node ip-10-0-138-27.us-west-2.compute.internal is reporting OutOfDisk=Unknown



The MCO hasn't caused that from looking at the logs.

conditions: [
{
lastHeartbeatTime: "2020-03-26T01:16:24Z",
lastTransitionTime: "2020-03-26T01:17:35Z",
message: "Kubelet stopped posting node status.",
reason: "NodeStatusUnknown",
status: "Unknown",
type: "MemoryPressure"
},
{
lastHeartbeatTime: "2020-03-26T01:16:24Z",
lastTransitionTime: "2020-03-26T01:17:35Z",
message: "Kubelet stopped posting node status.",
reason: "NodeStatusUnknown",
status: "Unknown",
type: "DiskPressure"
},
{
lastHeartbeatTime: "2020-03-26T01:16:24Z",
lastTransitionTime: "2020-03-26T01:17:35Z",
message: "Kubelet stopped posting node status.",
reason: "NodeStatusUnknown",
status: "Unknown",
type: "PIDPressure"
},
{
lastHeartbeatTime: "2020-03-26T01:16:24Z",
lastTransitionTime: "2020-03-26T01:17:35Z",
message: "Kubelet stopped posting node status.",
reason: "NodeStatusUnknown",
status: "Unknown",
type: "Ready"
}
],



How often do we see this error in CI jobs? Could it be the cloud provider failing the machine? (happened to my knowledge)

Comment 2 Ben Parees 2020-04-28 21:50:46 UTC

Moving to node team based on Antonio's assessment of a disappearing node.

This general error(timed out waiting for the condition during syncRequiredMachineConfigPools) is showing up a lot:
https://search.svc.ci.openshift.org/?search=timed+out+waiting+for+the+condition+during+syncRequiredMachineConfigPools&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 3 Periklis Tsirakidis 2020-04-29 16:11:25 UTC

Job Failure:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/27442

Comment 4 Harshal Patil 2020-05-18 07:30:43 UTC

In https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23129/artifacts/junit_operator.xml 

Cluster operator network Degraded is True with RolloutHung: DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-03-26T01:17:35Z
DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2020-03-26T01:17:36Z
DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2020-03-26T01:17:36Z" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)
DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)"

Comment 5 Ryan Phillips 2020-05-18 14:53:33 UTC


*** This bug has been marked as a duplicate of bug 1834895 ***