1990937 – Azure CI is failing with ingress failures

Bug 1990937 - Azure CI is failing with ingress failures

Summary: Azure CI is failing with ingress failures

Keywords:
Status:	CLOSED DUPLICATE of bug 1990916
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Luigi Mario Zuccarelli
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-06 15:46 UTC by Stephen Benjamin
Modified:	2022-08-04 22:35 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-11 15:14:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Stephen Benjamin 2021-08-06 15:46:05 UTC

periodic-ci-openshift-release-master-nightly-4.9-e2e-azure

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-azure

It looks like https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-08-04-204446 is the first nightly where things started failing, which had a bunch of CSI driver changes. 

The event filter on the failing jobs are full of FailedMount messages.

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1423023995355140096/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html

Comment 1 Jan Safranek 2021-08-06 16:35:54 UTC

Comparing to https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1422744810065760256/artifacts/e2e-azure/gather-must-gather/artifacts/event-filter.html (the last green one), I see abundance of these events:


MountVolume.SetUp failed for volume "machine-api-controllers-tls" : failed to sync secret cache: timed out waiting for the condition
MountVolume.SetUp failed for volume "var-lib-tuned-profiles-data" : failed to sync configmap cache: timed out waiting for the condition

66 events total (about other secrets/config maps too), while the green one has only 11 of them. The event is emitted by a node when it can't get a Secret/ConfigMap from the API server. There is nothing storage related in these errors. I don't know if the events are the root cause of the failures.

Comment 2 Stephen Benjamin 2021-08-06 17:08:22 UTC

Thanks for looking, the failures all seem to be reporting SyncLoadBalacnerFailed errors, and I thought storage was the culprit because of those FailedMount messages + the bad timing of the CSI driver changes.

For Ingress, it's reporting this:

./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"}


Not sure if ./registry-build01-ci-openshift-org-ci-op-24lhkg93-stable-sha256-b3482870e4d8e70ff91da69d12fe1898afbcbadbaf90acd19f9a5047067bda84/namespaces/openshift-ingress-operator/pods/ingress-operator-6b4ccff786-ckz9s/ingress-operator/ingress-operator/logs/current.log:2021-08-06T13:07:36.330587391Z 2021-08-06T13:07:36.330Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: EnsureBackendPoolDeleted: failed to parse the VMAS ID : getAvailabilitySetNameByID: failed to parse the VMAS ID \nThe kube-controller-manager logs may contain more details.)"}


From a quick look at the other changes around the time of the failures, I don't see any code changes on our side that might explain this

Comment 4 Stephen Benjamin 2021-08-11 15:14:58 UTC

Looks like Trevor fixed this under BZ1990916, the bootstrap was being left around during installation.

*** This bug has been marked as a duplicate of bug 1990916 ***

Note You need to log in before you can comment on or make changes to this bug.