Bug 1990937

Summary: Azure CI is failing with ingress failures
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NetworkingAssignee: Luigi Mario Zuccarelli <luzuccar>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, aravindh, jsafrane, mmasters, sippy
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-11 15:14:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-08-06 15:46:05 UTC
periodic-ci-openshift-release-master-nightly-4.9-e2e-azure

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-azure

It looks like https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-08-04-204446 is the first nightly where things started failing, which had a bunch of CSI driver changes. 

The event filter on the failing jobs are full of FailedMount messages.

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423023995355140096/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html

Comment 1 Jan Safranek 2021-08-06 16:35:54 UTC
Comparing to https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1422744810065760256/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html (the last green one), I see abundance of these events:


MountVolume.SetUp failed for volume "machine-api-controllers-tls" : failed to sync secret cache: timed out waiting for the condition
MountVolume.SetUp failed for volume "var-lib-tuned-profiles-data" : failed to sync configmap cache: timed out waiting for the condition

66 events total (about other secrets/config maps too), while the green one has only 11 of them. The event is emitted by a node when it can't get a Secret/ConfigMap from the API server. There is nothing storage related in these errors. I don't know if the events are the root cause of the failures.

Comment 2 Stephen Benjamin 2021-08-06 17:08:22 UTC
Thanks for looking, the failures all seem to be reporting SyncLoadBalacnerFailed errors, and I thought storage was the culprit because of those FailedMount messages + the bad timing of the CSI driver changes.

For Ingress, it's reporting this:

./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"}


Not sure if ./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"}


From a quick look at the other changes around the time of the failures, I don't see any code changes on our side that might explain this

Comment 4 Stephen Benjamin 2021-08-11 15:14:58 UTC
Looks like Trevor fixed this under BZ1990916, the bootstrap was being left around during installation.

*** This bug has been marked as a duplicate of bug 1990916 ***