periodic-ci-openshift-release-master-nightly-4.9-e2e-azure is failing frequently in CI, see: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-azure It looks like https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-08-04-204446 is the first nightly where things started failing, which had a bunch of CSI driver changes. The event filter on the failing jobs are full of FailedMount messages. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423023995355140096/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html
Comparing to https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1422744810065760256/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html (the last green one), I see abundance of these events: MountVolume.SetUp failed for volume "machine-api-controllers-tls" : failed to sync secret cache: timed out waiting for the condition MountVolume.SetUp failed for volume "var-lib-tuned-profiles-data" : failed to sync configmap cache: timed out waiting for the condition 66 events total (about other secrets/config maps too), while the green one has only 11 of them. The event is emitted by a node when it can't get a Secret/ConfigMap from the API server. There is nothing storage related in these errors. I don't know if the events are the root cause of the failures.
Thanks for looking, the failures all seem to be reporting SyncLoadBalacnerFailed errors, and I thought storage was the culprit because of those FailedMount messages + the bad timing of the CSI driver changes. For Ingress, it's reporting this: ./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z ERROR operator.ingress_controller controller/controller.go:298 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"} Not sure if ./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z ERROR operator.ingress_controller controller/controller.go:298 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"} From a quick look at the other changes around the time of the failures, I don't see any code changes on our side that might explain this
Looks like Trevor fixed this under BZ1990916, the bootstrap was being left around during installation. *** This bug has been marked as a duplicate of bug 1990916 ***