Some CI runs are failing on Azure such as: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.6/1308571357872656384 First indication is there are no worker nodes. Upon further investigation, the machine-controller is not deployed, and this is because the machine-api-operator is not deployed. Checking artifacts/deployment.json reveals: { "lastTransitionTime": "2020-09-23T01:15:10Z", "lastUpdateTime": "2020-09-23T01:15:10Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2020-09-23T01:15:10Z", "lastUpdateTime": "2020-09-23T01:15:10Z", "message": "pods \"machine-api-operator-5c99d74d58-\" is forbidden: unable to validate against any security context constraint: []", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }, { "lastTransitionTime": "2020-09-23T01:25:11Z", "lastUpdateTime": "2020-09-23T01:25:11Z", "message": "ReplicaSet \"machine-api-operator-5c99d74d58\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } Many operators seem broken, including kube-apiserver: Operator unavailable (StaticPods_ZeroNodesActive): StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2 Unsure what the root cause is. For reference, the initial 3 master machines are created by the installer and join the bootstrap cluster. The machine-api has no control over the initial creation or configuration of the master machines.
I would like to see an [early] test that checks to see if we have the events and flake. I think this may happen more often on azure: something about the LB perhaps? Also, I think I see logs that indicate an io timeout from CVO to internal LB.
*** This bug has been marked as a duplicate of bug 1883458 ***