2030488 – Numerous Azure CI jobs are Failing with Partially Rendered machinesets

Bug 2030488 - Numerous Azure CI jobs are Failing with Partially Rendered machinesets

Summary: Numerous Azure CI jobs are Failing with Partially Rendered machinesets

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Joel Speed
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2047845
TreeView+	depends on / blocked

Reported:	2021-12-08 22:45 UTC by rvanderp
Modified:	2022-03-10 16:33 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2047845 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:32:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-provider-azure pull 11	0	None	open	Bug 2030488: Requeue create on invalid credentials errors	2022-01-06 14:54:27 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:33:11 UTC

Description rvanderp 2021-12-08 22:45:58 UTC

Description of problem:

A percentage of Azure CI jobs[https://search.ci.openshift.org/?search=Timed+out+waiting+for+node+count&maxAge=168h&context=1&type=bug%2Bjunit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job] are failing with partially rendered machinesets.  This results in the installation stalling with error: 

`Timed out waiting for node count (5) to equal or exceed machine count (6).`

Machines consistently fail to render while creating a nic for the machine that ultimately fails to render.  For example[https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-azure/1468665066101411840/artifacts/e2e-azure/gather-extra/artifacts/machines.json]:
failed to create nic ci-op-79544gtd-e2fa6-l2mw7-worker-centralus1-88drt-nic for machine ci-op-79544gtd-e2fa6-l2mw7-worker-centralus1-88drt: unable to create VM network interface: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-79544gtd-e2fa6-l2mw7-rg/providers/Microsoft.Network/virtualNetworks/ci-op-79544gtd-e2fa6-l2mw7-vnet/subnets/ci-op-79544gtd-e2fa6-l2mw7-worker-subnet?api-version=2020-06-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.\\r\\nTrace ID: 4633b68e-45fe-44a9-ac2c-ee95e03a4500\\r\\nCorrelation ID: 1ca8642f-9cac-4c08-b538-cb5974c1ca2b\\r\\nTimestamp: 2021-12-08 20:02:39Z\",\"error_codes\":[7000215],\"timestamp\":\"2021-12-08 20:02:39Z\",\"trace_id\":\"4633b68e-45fe-44a9-ac2c-ee95e03a4500\",\"correlation_id\":\"1ca8642f-9cac-4c08-b538-cb5974c1ca2b\",\"error_uri\":\"https://login.microsoftonline.com/error?code=7000215\"}

Version-Release number of selected component (if applicable):
4.6 and later

How reproducible:
Somewhat consistently in CI.  The problem seems to span numerous CI workflows and versions of OpenShift.

Steps to Reproduce:
1.
2.
3.

Actual results:
machinesets are not consistently rendering all machines

Expected results:
machinesets should consistently render all machines

Additional info:

Comment 1 rvanderp 2021-12-08 22:51:53 UTC

I meant to mention, the Azure account that services the CI jobs was checked to see if any limits were being approached and it appeared that the account was operating within limits.

Comment 2 rvanderp 2021-12-08 23:01:36 UTC

After talking to @wking in Slack, it appears this issue is occurring primarily on 4.6/4.7:

From wking:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Timed+out+waiting+for+node+count&maxAge=48h&type=junit&name=azure' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 97 runs, 77% failed, 1% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 133 runs, 99% failed, 2% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.6-e2e-azure (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.7-e2e-azure-ovn (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-azure (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-azure (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
pull-ci-openshift-machine-api-provider-azure-main-e2e-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-launch-azure-modern (all) - 51 runs, 76% failed, 5% of failures match = 4% impact

Comment 9 Milind Yadav 2022-01-24 05:35:24 UTC

Still seeing this failed , in the last 6hrs log - https://search.ci.openshift.org/?search=Timed+out+waiting+for+node+count&maxAge=6h&context=1&type=junit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 10 Joel Speed 2022-01-24 09:10:50 UTC

I've looked through the test failures over the last week, there is only 1 failure for 4.10 which is a different symptom in my opinion. The current 4.10 failure shows the machine provisioned but it failed for some reason during the ignition phase.
The other failures are all on older versions which we haven't backported the fix into.

I think we can move this to verified

Comment 11 Milind Yadav 2022-01-24 09:26:47 UTC

acknowledged @Joel , we can move this to verified.(I look at 4.7 , mistakenly)

Comment 14 errata-xmlrpc 2022-03-10 16:32:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.