Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1882169

Summary:	Machine objects don't exit Failed phase
Product:	OpenShift Container Platform	Reporter:	Jim Minter <jminter>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	low
Priority:	low	CC:	dramseur, ffranz, jerzhang, lorenzo.dalrio, mimccune, mjudeiki, wking
Version:	4.4	Flags:	jminter: needinfo-
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-01 12:13:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jim Minter 2020-09-24 01:11:20 UTC

It doesn't seem to be possible to rescue a Machine object that is in Failed phase with error "Can't find created instance".

ARO saw a case where the Machine object for a 4.4 master got into that state, probably due to a temporary datacenter outage, but the instance was subsequently healthy, and the only way we found to recover was to delete and recreate the Machine object, which was painful and risky.

It would be useful if the Machine object could be recovered automatically or manually without deletion if the instance becomes available again.

Comment 1 Jim Minter 2020-09-24 01:24:48 UTC

Relevant code: https://github.com/openshift/machine-api-operator/blob/release-4.4/pkg/controller/machine/controller.go#L259-L263 , I guess

Comment 2 Yu Qi Zhang 2020-09-24 13:44:26 UTC

Machine-api is cloud compute, reassigning

Comment 4 Joel Speed 2020-09-24 14:24:13 UTC

Do you have access to this environment still? It would be interesting to see the logs from the machine-api-controllers machine-controller around the time that the Machine went into a failed state? There aren't too many reasons why the Machine should go failed after it has started, but if we can narrow down exactly why it did, we may be able to come up with some way to mitigate that (perhaps we are being too harsh on this failure scenario)

Comment 7 Jim Minter 2020-09-25 16:28:43 UTC

Here's what I found:

20:32:08.023889 1 controller.go:163] xxx-master-0: reconciling Machine
20:32:08.023929 1 actuator.go:196] Checking if machine xxx-master-0 exists
20:32:08.395272 1 reconciler.go:352] Found vm for machine xxx-master-0
20:32:08.395301 1 controller.go:419] xxx-master-0: going into phase "Failed"
20:32:08.407409 1 controller.go:163] xxx-master-0: reconciling Machine
20:32:08.407447 1 controller.go:260] xxx-master-0: machine has gone "Failed" phase. It won't reconcile

The "Found vm for machine xxx-master-0" seems interesting.  I am guessing the code went through this line:

https://github.com/openshift/cluster-api-provider-azure/blob/release-4.4/pkg/cloud/azure/actuators/machine/reconciler.go#L376

i.e. the vm state was neither succeeded nor updating (and was not logged), so it went to Failed.

Comment 8 Joel Speed 2020-09-28 12:37:16 UTC

I believe this list (https://github.com/openshift/cluster-api-provider-azure/blob/86841cbdd693d1d451055a05bbbc113a93ef9f97/pkg/apis/azureprovider/v1beta1/types.go#L320-L349) to be an exhaustive list of the different states that a VM can be in within Azure.

There are a couple of types in there that do not look to be terminal (eg migrating and running) though I've never personally seen a VM in these states (normally they are in succeeded).

Could it be that the VM went through one of those two states? I wonder if we can reproduce this somehow.

For migrating, perhaps someone changed some configuration item about the master VM which caused it to go to migrating? Is there an audit history of changes or do you know if anything was changed on the Azure side about the VM?

Comment 9 Joel Speed 2020-09-30 11:09:10 UTC

I'd like to look into this further during the next sprint.

From my perspective, there is some issue here that means that the Machine prematurely entered the failed state.
I don't think going Failed to Running is a valid transition that we want to support, so we should instead make sure that the bug that meant this Machine went failed prematurely is resolved to prevent this happening in the future.

Comment 10 Lorenzo Dalrio 2020-10-01 12:53:37 UTC

*** Bug 1884247 has been marked as a duplicate of this bug. ***

Comment 11 Lorenzo Dalrio 2020-10-01 12:56:38 UTC

(In reply to Jim Minter from comment #0)
> ARO saw a case where the Machine object for a 4.4 master got into that
> state, probably due to a temporary datacenter outage, but the instance was
> subsequently healthy, and the only way we found to recover was to delete and
> recreate the Machine object, which was painful and risky.

Besides the delete and recreate procedure is there something else we can do to solve the issue?

Comment 12 Joel Speed 2020-10-01 13:57:02 UTC

You could patch the status back to Running, this should trigger the machine controller to re-evaluate the situation. We don't normally do anything to the instances when they go into Failed (as in there's no side effects) so I think this should be relatively safe. If the Machine controller thinks there's a problem still, it will mark the machine as failed again.

To do this you will need to get some information from the admin kubeconfig (or an equally privileged user):

You will need to create a file `ca.crt` by reading the cluster definition `certificate-authority-data`, base64 decoding this and dumping it to the file.

You will need to create a file `client.crt` by reading the user definition `client-certificate-data`, base64 decoding this and dumping it to the file.

You will need to create a file `client.key` by reading the user definition `client-kay-data`, base64 decoding this and dumping it to the file.

Then you will need to export SERVER (or substitute below) with the value from `server` in the cluster definition.

Finally export FAILED_MACHINE (or substitute below) with the name of the failed machine.

Once you've done all of that, you can run the below command which should update the phase to `Running`

curl -H "Content-Type: application/merge-patch+json" --cacert ca.crt --cert client.crt --key client.key $SERVER/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/$FAILED_MACHINE/status -XPATCH -d '{"status":{"phase":"Running"}}'

Comment 13 Lorenzo Dalrio 2020-10-02 13:14:46 UTC

I will try your workaround as soon as possible and will report back on the result.

Comment 14 Lorenzo Dalrio 2020-10-05 08:35:29 UTC

(In reply to Joel Speed from comment #12)
> You could patch the status back to Running, this should trigger the machine
> controller to re-evaluate the situation. We don't normally do anything to
> the instances when they go into Failed (as in there's no side effects) so I
> think this should be relatively safe. If the Machine controller thinks
> there's a problem still, it will mark the machine as failed again.
> 
> To do this you will need to get some information from the admin kubeconfig
> (or an equally privileged user):
> 
> You will need to create a file `ca.crt` by reading the cluster definition
> `certificate-authority-data`, base64 decoding this and dumping it to the
> file.
> 
> You will need to create a file `client.crt` by reading the user definition
> `client-certificate-data`, base64 decoding this and dumping it to the file.
> 
> You will need to create a file `client.key` by reading the user definition
> `client-kay-data`, base64 decoding this and dumping it to the file.
> 
> Then you will need to export SERVER (or substitute below) with the value
> from `server` in the cluster definition.
> 
> Finally export FAILED_MACHINE (or substitute below) with the name of the
> failed machine.
> 
> Once you've done all of that, you can run the below command which should
> update the phase to `Running`
> 
> curl -H "Content-Type: application/merge-patch+json" --cacert ca.crt --cert
> client.crt --key client.key
> $SERVER/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/
> machines/$FAILED_MACHINE/status -XPATCH -d '{"status":{"phase":"Running"}}'

The workaround fixed the machine:

$ oc get machines
NAME                                            PHASE     TYPE              REGION       ZONE   AGE
ocp-dev-westeu-9qm9r-master-0                   Running   Standard_D4s_v3   westeurope   1      312d
ocp-dev-westeu-9qm9r-master-1                   Running   Standard_D4s_v3   westeurope   3      312d
ocp-dev-westeu-9qm9r-master-2                   Running   Standard_D4s_v3   westeurope   2      312d
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Running   Standard_D4s_v3   westeurope   1      312d
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Running   Standard_D4s_v3   westeurope   2      312d
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Running   Standard_D4s_v3   westeurope   3      312d

Here is the output produced by the curl:

{"apiVersion":"machine.openshift.io/v1beta1","kind":"Machine","metadata":{"annotations":{"machine.openshift.io/instance-state":"Unknown"},"creationTimestamp":"2019-11-27T08:38:29Z","finalizers":["machine.machine.openshift.io"],"generation":3,"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-dev-westeu-9qm9r","machine.openshift.io/cluster-api-machine-role":"master","machine.openshift.io/cluster-api-machine-type":"master","machine.openshift.io/instance-type":"Standard_D4s_v3","machine.openshift.io/region":"westeurope","machine.openshift.io/zone":"1"},"name":"ocp-dev-westeu-9qm9r-master-0","namespace":"openshift-machine-api","resourceVersion":"179396808","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp-dev-westeu-9qm9r-master-0/status","uid":"496450ec-10f1-11ea-a708-000d3aa9b143"},"spec":{"metadata":{},"providerID":"azure:///subscriptions/d3203858-393f-4a08-a78c-cb25c71cc699/resourceGroups/ocp-dev-westeu-9qm9r-rg/providers/Microsoft.Compute/virtualMachines/ocp-dev-westeu-9qm9r-master-0","providerSpec":{"value":{"apiVersion":"azureproviderconfig.openshift.io/v1beta1","credentialsSecret":{"name":"azure-cloud-credentials","namespace":"openshift-machine-api"},"image":{"offer":"","publisher":"","resourceID":"/resourceGroups/ocp-dev-westeu-9qm9r-rg/providers/Microsoft.Compute/images/ocp-dev-westeu-9qm9r","sku":"","version":""},"kind":"AzureMachineProviderSpec","location":"westeurope","managedIdentity":"ocp-dev-westeu-9qm9r-identity","metadata":{"creationTimestamp":null},"osDisk":{"diskSizeGB":256,"managedDisk":{"storageAccountType":"Premium_LRS"},"osType":"Linux"},"publicIP":false,"resourceGroup":"ocp-dev-westeu-9qm9r-rg","subnet":"ocp-dev-westeu-9qm9r-master-subnet","userDataSecret":{"name":"master-user-data"},"vmSize":"Standard_D4s_v3","vnet":"ocp-dev-westeu-9qm9r-vnet","zone":"1"}}},"status":{"addresses":[{"address":"ocp-dev-westeu-9qm9r-master-0","type":"Hostname"},{"address":"ocp-dev-westeu-9qm9r-master-0","type":"InternalDNS"},{"address":"ocp-dev-westeu-9qm9r-master-0.odsoql0soznuvohxlfylo4ewnf.ax.internal.cloudapp.net","type":"InternalDNS"},{"address":"172.16.105.6","type":"InternalIP"}],"errorMessage":"Can't find created instance.","lastUpdated":"2020-10-01T11:59:14Z","nodeRef":{"kind":"Node","name":"ocp-dev-westeu-9qm9r-master-0","uid":"4140f426-10f1-11ea-a708-000d3aa9b143"},"phase":"Running","providerStatus":{"conditions":[{"lastProbeTime":"2019-11-27T08:41:03Z","lastTransitionTime":"2019-11-27T08:41:03Z","message":"machine successfully created","reason":"MachineCreationSucceeded","status":"True","type":"MachineCreated"}],"metadata":{"creationTimestamp":null},"vmId":"/subscriptions/d3203858-393f-4a08-a78c-cb25c71cc699/resourceGroups/ocp-dev-westeu-9qm9r-rg/providers/Microsoft.Compute/virtualMachines/ocp-dev-westeu-9qm9r-master-0","vmState":"Updating"}}}

This is the log collected from the machine-controller container:

I1005 08:31:28.412502       1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine
I1005 08:31:28.412539       1 actuator.go:201] Checking if machine ocp-dev-westeu-9qm9r-master-0 exists
I1005 08:31:28.642849       1 reconciler.go:353] Found vm for machine ocp-dev-westeu-9qm9r-master-0
I1005 08:31:28.642876       1 reconciler.go:373] Machine 33bb8eed-967c-4ec6-8329-51bee47ef20f is running
I1005 08:31:28.642889       1 controller.go:277] ocp-dev-westeu-9qm9r-master-0: reconciling machine triggers idempotent update
I1005 08:31:28.642896       1 actuator.go:168] Updating machine ocp-dev-westeu-9qm9r-master-0
I1005 08:31:28.999711       1 machine_scope.go:157] ocp-dev-westeu-9qm9r-master-0: patching machine
I1005 08:31:29.036911       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"}
I1005 08:31:29.037049       1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine
I1005 08:31:29.037161       1 actuator.go:201] Checking if machine ocp-dev-westeu-9qm9r-master-0 exists
I1005 08:31:29.038040       1 recorder.go:52] controller-runtime/manager/events "msg"="Normal"  "message"="Updated machine \"ocp-dev-westeu-9qm9r-master-0\"" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ocp-dev-westeu-9qm9r-master-0","uid":"496450ec-10f1-11ea-a708-000d3aa9b143","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"179396808"} "reason"="Updated"
I1005 08:31:29.188189       1 reconciler.go:353] Found vm for machine ocp-dev-westeu-9qm9r-master-0
I1005 08:31:29.188342       1 reconciler.go:373] Machine 33bb8eed-967c-4ec6-8329-51bee47ef20f is running
I1005 08:31:29.188427       1 controller.go:277] ocp-dev-westeu-9qm9r-master-0: reconciling machine triggers idempotent update
I1005 08:31:29.188491       1 actuator.go:168] Updating machine ocp-dev-westeu-9qm9r-master-0
I1005 08:31:29.439731       1 machine_scope.go:141] ocp-dev-westeu-9qm9r-master-0: status unchanged
I1005 08:31:29.439766       1 machine_scope.go:141] ocp-dev-westeu-9qm9r-master-0: status unchanged
I1005 08:31:29.439771       1 machine_scope.go:157] ocp-dev-westeu-9qm9r-master-0: patching machine
I1005 08:31:29.462268       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"}
I1005 08:31:29.462332       1 recorder.go:52] controller-runtime/manager/events "msg"="Normal"  "message"="Updated machine \"ocp-dev-westeu-9qm9r-master-0\"" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ocp-dev-westeu-9qm9r-master-0","uid":"496450ec-10f1-11ea-a708-000d3aa9b143","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"179396813"} "reason"="Updated"

Comment 15 Joel Speed 2020-10-07 14:35:35 UTC

@Jim Minter, looking back at some you posted in C#7, I wonder if the code actually entered this block https://github.com/openshift/cluster-api-provider-azure/blob/c64d41cda1e8f4928e03e337c65125d843f6658c/pkg/cloud/azure/actuators/machine/reconciler.go#L354-L368. Do you know if the Machine did or did not have a userDataSecret? This shouldn't normally be the case but if so, we could have ended up with an error fetching the extension which resulted in the machine going to failed, rather than the source of this issue being a different phase that hasn't been seen before.

I know there was a bug that was fixed semi recently about the VM being in the creating state and failing, but I doubt that's the case for this example since master machines aren't created by Machine API normally

Comment 16 Joel Speed 2020-11-13 14:38:55 UTC

@jminter Please see https://bugzilla.redhat.com/show_bug.cgi?id=1882169#c15, I don't think there's anything we can really do with the information provided to be more tolerant to this issue. In the general case it would be bad for a machine to exit the Failed state, but we can't really work out why this one did enter Failed when it shouldn't have done so

Comment 17 Michael McCune 2020-12-04 19:33:03 UTC

adding UpcomingSprint tag, i believe we are still waiting on a response to the questions raised in https://bugzilla.redhat.com/show_bug.cgi?id=1882169#c15