Bug 1836141
Summary: | [Azure] Worker machines goes into 'Failed' phase at installation time | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jianwei Hou <jhou> |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | agarcial, dgrigore, gpei, jialiu, jspeed, mgugino, wking |
Version: | 4.5 | ||
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: When checking to see if a Virtual Machine had already been created, it was expected that the Virtual Machine should be running
Consequence: A Virtual Machine may not be recognised as being created. Multiple attempts to create the VM flood the logs with errors. A VM could go failed even though the VM was starting up correctly.
Fix: Ensure that VMs in the Creating state are recognised as having been created already.
Result: Fewer errors in logs. Machine less likely to go failed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 15:59:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Danil Grigorev
2020-05-19 18:54:22 UTC
(In reply to Danil Grigorev from comment #1) > It seems like all your azure-credentials secrets were filled by empty > strings at some point after first machine was provisioned an became running. > Did you change any of those by any chance? Hi @jialiu can you please provide some info? No, it is a ci cluster, as far as I know, no one touch with it after first machine was provisioned an became running. this require further investigation. Tagging with UpcomingSprint Copying into a public comment for search engines, the redacted failure was: Error Message: failed to reconcile machine "REDACTED": azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/REDACTED/resourceGroups/REDACTED/providers/Microsoft.Network/virtualNetworks/REDACTED/subnets/REDACTED?api-version=2018-12-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS700026: Client application has no configured keys.\r\nTrace ID: REDACTED\r\nCorrelation ID: REDACTED\r\nTimestamp: 2020-05-15 05:49:59Z","error_codes":[700026],"timestamp":"2020-05-15 05:49:59Z","trace_id":"REDACTED","correlation_id":"REDACTED"} Bug 1805163 has a similar error, although in that case it was a 400. Hi, Can you verify that the service principal you are using: hasn't been deleted, hasn't been changed, password hasn't expired? can we please also have must gather logs? We should not set failed on invalid credentials. Credentials do expire from time to time and need to be refreshed. Or someone might have removed a role from a set of credentials out of band, and they could re-add them. This is especially true for just updating machines. If we're only setting failed on create, maybe that's okay. https://github.com/Azure/AKS/issues/222 This issue highlights the error "azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to" listed in #c0 . Is CI or something else setting these credentials to be very short lived, and the credentials are expiring too quickly? Tagging as upcoming sprint, I need to investigate possible fixes for the problem and won't complete that by the end of the sprint Verified clusterversion: 4.6.0-0.nightly-2020-08-12-003456 Create a new machineļ¼ we don't get the following error any more, will watch CI results if meet this again. E0709 16:44:26.642473 1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile I0812 08:43:03.349919 1 controller.go:169] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling Machine I0812 08:43:03.367456 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus-zhnh6"} I0812 08:43:03.367735 1 controller.go:169] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling Machine I0812 08:43:03.367756 1 actuator.go:201] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: actuator checking if machine exists I0812 08:43:03.598401 1 controller.go:424] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: going into phase "Provisioning" I0812 08:43:03.617245 1 controller.go:313] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling machine triggers idempotent create I0812 08:43:03.617271 1 actuator.go:84] Creating machine zhsunazure8122-6bfhc-worker-northcentralus-zhnh6 I0812 08:43:03.732394 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machineset" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus"} I0812 08:43:03.732453 1 controller.go:72] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsunazure8122-6bfhc-worker-northcentralus" "namespace"="openshift-machine-api" I0812 08:43:03.778119 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machineset" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus"} I0812 08:43:03.778163 1 controller.go:72] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsunazure8122-6bfhc-worker-northcentralus" "namespace"="openshift-machine-api" I0812 08:43:03.841406 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machineset" "request"={"Namespace":"openshift-machine-api","Name":"zhsunazure8122-6bfhc-worker-northcentralus"} I0812 08:43:04.849856 1 networkinterfaces.go:226] successfully created network interface zhsunazure8122-6bfhc-worker-northcentralus-zhnh6-nic I0812 08:43:04.929418 1 virtualmachines.go:76] getting nic zhsunazure8122-6bfhc-worker-northcentralus-zhnh6-nic I0812 08:43:05.068459 1 virtualmachines.go:85] got nic zhsunazure8122-6bfhc-worker-northcentralus-zhnh6-nic I0812 08:43:05.068498 1 virtualmachines.go:87] creating vm zhsunazure8122-6bfhc-worker-northcentralus-zhnh6 I0812 08:43:10.001707 1 machine_scope.go:160] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: patching machine E0812 08:43:10.044656 1 actuator.go:78] Machine error: failed to reconcile machine "zhsunazure8122-6bfhc-worker-northcentralus-zhnh6"s: failed to create vm zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed W0812 08:43:10.044687 1 controller.go:315] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: failed to create machine: requeue in: 20s I0812 08:43:10.044704 1 controller.go:405] Actuator returned requeue-after error: requeue in: 20s I0812 08:43:10.044767 1 controller.go:169] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling Machine I0812 08:43:10.044783 1 actuator.go:201] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: actuator checking if machine exists I0812 08:43:10.044850 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="CreateError: failed to reconcile machine \"zhsunazure8122-6bfhc-worker-northcentralus-zhnh6\"s: failed to create vm zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsunazure8122-6bfhc-worker-northcentralus-zhnh6","uid":"1d8cfd28-9a7f-4da3-b164-31f775171ae7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"52648"} "reason"="FailedCreate" I0812 08:43:10.306552 1 reconciler.go:392] Machine 3820b466-edf9-4541-83f1-351ecc91cd55 is creating I0812 08:43:10.306580 1 reconciler.go:398] Found vm for machine zhsunazure8122-6bfhc-worker-northcentralus-zhnh6 I0812 08:43:10.306591 1 controller.go:277] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: reconciling machine triggers idempotent update I0812 08:43:10.306597 1 actuator.go:168] Updating machine zhsunazure8122-6bfhc-worker-northcentralus-zhnh6 I0812 08:43:10.691715 1 machine_scope.go:160] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: patching machine E0812 08:43:10.731714 1 controller.go:284] zhsunazure8122-6bfhc-worker-northcentralus-zhnh6: instance exists but providerID or addresses has not been given to the machine yet, requeuing Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |