Three azure runs in the last day failed to create the expected number of nodes. This cause an install failure as ingress doesn't come fully available. https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-azure-4.6 specifically 1. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303312119290138624 2. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303077792237228032 3. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1302961038026608640
(In reply to David Eads from comment #0) > Three azure runs in the last day failed to create the expected number of > nodes. This cause an install failure as ingress doesn't come fully available. > > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release- > openshift-ocp-installer-e2e-azure-4.6 > > specifically > 1. > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-azure-4.6/1303312119290138624 The master hosts on this cluster must be seriously broken. There are several minute pauses between log lines on the machine-controller. Also, possible DNS lookup failures for authentication against the cloud. > 2. > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-azure-4.6/1303077792237228032 On this one, the machine-api-operator never successfully rolled out: { "lastTransitionTime": "2020-09-07T21:29:55Z", "lastUpdateTime": "2020-09-07T21:29:55Z", "message": "pods \"machine-api-operator-84b76f9dcd-\" is forbidden: unable to validate against any security context constraint: []", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }, Looks to be some authorization problem on the cluster. > 3. > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-azure-4.6/1302961038026608640 This one the machine-controller appeared to be working as normal, but it didn't start until: I0907 14:11:58.463861 1 main.go:105] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation. and the test was killed at: 2020/09/07 14:27:22 Container setup in pod e2e-azure failed, exit code 1, reason Error So, if the test ran for a few more minutes, the machines would have become nodes most likely. I can see the approved CSR reqeusts for the 2 node bootstrappers, within a normal time frame for azure (Instances take about 15-20 mintues to spin up from machine creation to node joining the cluster). In fact, one worker machine did become a node: "creationTimestamp": "2020-09-07T14:28:04Z", I don't see anything that implicates the machine-api in these failures at this time. Looks like these may be resource starvation in the first two cases and a slow install in the last case.
*** Bug 1875774 has been marked as a duplicate of this bug. ***
I think this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1877483 specifically this run: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303077792237228032
Moving this over to the node team to investigate poor master peformance.