1836141 – [Azure] Worker machines goes into 'Failed' phase at installation time

Bug 1836141 - [Azure] Worker machines goes into 'Failed' phase at installation time

Summary: [Azure] Worker machines goes into 'Failed' phase at installation time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-15 08:53 UTC by Jianwei Hou
Modified:	2020-10-27 16:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When checking to see if a Virtual Machine had already been created, it was expected that the Virtual Machine should be running Consequence: A Virtual Machine may not be recognised as being created. Multiple attempts to create the VM flood the logs with errors. A VM could go failed even though the VM was starting up correctly. Fix: Ensure that VMs in the Creating state are recognised as having been created already. Result: Fewer errors in logs. Machine less likely to go failed.
Clone Of:
Environment:
Last Closed:	2020-10-27 15:59:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-azure pull 147	0	None	closed	BUG 1836141: Assert VM exists if VM state is Creating	2020-11-26 08:26:45 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:00:00 UTC

Comment 1 Danil Grigorev 2020-05-19 18:54:22 UTC

It seems like all your azure-credentials secrets were filled by empty strings at some point after first machine was provisioned an became running. Did you change any of those by any chance?

Comment 2 Jianwei Hou 2020-05-25 03:16:21 UTC

(In reply to Danil Grigorev from comment #1)
> It seems like all your azure-credentials secrets were filled by empty
> strings at some point after first machine was provisioned an became running.
> Did you change any of those by any chance?

Hi @jialiu can you please provide some info?

Comment 3 Johnny Liu 2020-05-25 03:31:18 UTC

No, it is a ci cluster, as far as I know, no one touch with it after first machine was provisioned an became running.

Comment 4 Alberto 2020-05-29 10:32:25 UTC

this require further investigation. Tagging with UpcomingSprint

Comment 5 W. Trevor King 2020-06-26 18:20:14 UTC

Copying into a public comment for search engines, the redacted failure was:

Error Message:  failed to reconcile machine "REDACTED": azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/REDACTED/resourceGroups/REDACTED/providers/Microsoft.Network/virtualNetworks/REDACTED/subnets/REDACTED?api-version=2018-12-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS700026: Client application has no configured keys.\r\nTrace ID: REDACTED\r\nCorrelation ID: REDACTED\r\nTimestamp: 2020-05-15 05:49:59Z","error_codes":[700026],"timestamp":"2020-05-15 05:49:59Z","trace_id":"REDACTED","correlation_id":"REDACTED"}

Bug 1805163 has a similar error, although in that case it was a 400.

Comment 6 Alexander Demicev 2020-07-01 14:39:36 UTC

Hi,
Can you verify that the service principal you are using: hasn't been deleted, hasn't been changed, password hasn't expired?

Comment 7 Alberto 2020-07-02 15:40:17 UTC

can we please also have must gather logs?

Comment 13 Michael Gugino 2020-07-08 14:42:09 UTC

We should not set failed on invalid credentials.  Credentials do expire from time to time and need to be refreshed.  Or someone might have removed a role from a set of credentials out of band, and they could re-add them.  This is especially true for just updating machines.  If we're only setting failed on create, maybe that's okay.

https://github.com/Azure/AKS/issues/222

This issue highlights the error "azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to" listed in #c0 .

Is CI or something else setting these credentials to be very short lived, and the credentials are expiring too quickly?

Comment 15 Joel Speed 2020-07-09 10:36:52 UTC

Tagging as upcoming sprint, I need to investigate possible fixes for the problem and won't complete that by the end of the sprint

Comment 19 sunzhaohua 2020-08-12 09:58:09 UTC

Verified
clusterversion: 4.6.0-0.nightly-2020-08-12-003456

Create a new machine， we don't get the following error any more, will watch CI results if meet this again.

E0709 16:44:26.642473       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile

I0812 08:43:03.349919       1 controller.go:169] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling Machine
I0812 08:43:03.367456       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus-zhnh6"}
I0812 08:43:03.367735       1 controller.go:169] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling Machine
I0812 08:43:03.367756       1 actuator.go:201] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: actuator checking if machine exists
I0812 08:43:03.598401       1 controller.go:424] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: going into phase "Provisioning"
I0812 08:43:03.617245       1 controller.go:313] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling machine triggers idempotent create
I0812 08:43:03.617271       1 actuator.go:84] Creating machine zhsunazure8122-6bfhc-worker-northcentralus-zhnh6
I0812 08:43:03.732394       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machineset" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus"}
I0812 08:43:03.732453       1 controller.go:72] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsunazure8122-6bfhc-worker-northcentralus" "namespace"="openshift-machine-api" 
I0812 08:43:03.778119       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machineset" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus"}
I0812 08:43:03.778163       1 controller.go:72] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsunazure8122-6bfhc-worker-northcentralus" "namespace"="openshift-machine-api" 
I0812 08:43:03.841406       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machineset" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus"}
I0812 08:43:04.849856       1 networkinterfaces.go:226] successfully created network interface zhsunazure8122-6bfhc-worker-northcentralus-zhnh6-nic
I0812 08:43:04.929418       1 virtualmachines.go:76] getting nic zhsunazure8122-6bfhc-worker-northcentralus-zhnh6-nic
I0812 08:43:05.068459       1 virtualmachines.go:85] got nic zhsunazure8122-6bfhc-worker-northcentralus-zhnh6-nic
I0812 08:43:05.068498       1 virtualmachines.go:87] creating vm zhsunazure8122-6bfhc-worker-northcentralus-zhnh6 
I0812 08:43:10.001707       1 machine_scope.go:160] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: patching machine
E0812 08:43:10.044656       1 actuator.go:78] Machine error: failed to reconcile machine "zhsunazure8122-6bfhc-worker-northcentralus-zhnh6"s: failed to create vm zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
W0812 08:43:10.044687       1 controller.go:315] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: failed to create machine: requeue in: 20s
I0812 08:43:10.044704       1 controller.go:405] Actuator returned requeue-after error: requeue in: 20s
I0812 08:43:10.044767       1 controller.go:169] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling Machine
I0812 08:43:10.044783       1 actuator.go:201] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: actuator checking if machine exists
I0812 08:43:10.044850       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="CreateError: failed to reconcile machine \"zhsunazure8122-6bfhc-worker-northcentralus-zhnh6\"s: failed to create vm zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsunazure8122-6bfhc-worker-northcentralus-zhnh6","uid":"1d8cfd28-9a7f-4da3-b164-31f775171ae7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"52648"} "reason"="FailedCreate"
I0812 08:43:10.306552       1 reconciler.go:392] Machine 3820b466-edf9-4541-83f1-351ecc91cd55 is creating
I0812 08:43:10.306580       1 reconciler.go:398] Found vm for machine zhsunazure8122-6bfhc-worker-northcentralus-zhnh6
I0812 08:43:10.306591       1 controller.go:277] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling machine triggers idempotent update
I0812 08:43:10.306597       1 actuator.go:168] Updating machine zhsunazure8122-6bfhc-worker-northcentralus-zhnh6
I0812 08:43:10.691715       1 machine_scope.go:160] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: patching machine
E0812 08:43:10.731714       1 controller.go:284] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: instance exists but providerID or addresses has not been given to the machine yet, requeuing

Comment 21 errata-xmlrpc 2020-10-27 15:59:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.