Bug 1990937

Summary:	Azure CI is failing with ingress failures
Product:	OpenShift Container Platform	Reporter:	Stephen Benjamin <stbenjam>
Component:	Networking	Assignee:	Luigi Mario Zuccarelli <luzuccar>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aos-bugs, aravindh, jsafrane, mmasters, sippy
Version:	4.9
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-11 15:14:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-08-06 15:46:05 UTC

periodic-ci-openshift-release-master-nightly-4.9-e2e-azure

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-azure

It looks like https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-08-04-204446 is the first nightly where things started failing, which had a bunch of CSI driver changes. 

The event filter on the failing jobs are full of FailedMount messages.

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423023995355140096/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html

Comment 1 Jan Safranek 2021-08-06 16:35:54 UTC

Comparing to https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1422744810065760256/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html (the last green one), I see abundance of these events:


MountVolume.SetUp failed for volume "machine-api-controllers-tls" : failed to sync secret cache: timed out waiting for the condition
MountVolume.SetUp failed for volume "var-lib-tuned-profiles-data" : failed to sync configmap cache: timed out waiting for the condition

66 events total (about other secrets/config maps too), while the green one has only 11 of them. The event is emitted by a node when it can't get a Secret/ConfigMap from the API server. There is nothing storage related in these errors. I don't know if the events are the root cause of the failures.

Comment 2 Stephen Benjamin 2021-08-06 17:08:22 UTC

Thanks for looking, the failures all seem to be reporting SyncLoadBalacnerFailed errors, and I thought storage was the culprit because of those FailedMount messages + the bad timing of the CSI driver changes.

For Ingress, it's reporting this:

./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"}


Not sure if ./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"}


From a quick look at the other changes around the time of the failures, I don't see any code changes on our side that might explain this

Comment 4 Stephen Benjamin 2021-08-11 15:14:58 UTC

Looks like Trevor fixed this under BZ1990916, the bootstrap was being left around during installation.

*** This bug has been marked as a duplicate of bug 1990916 ***