Bug 1882169
| Summary: | Machine objects don't exit Failed phase | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jim Minter <jminter> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | dramseur, ffranz, jerzhang, lorenzo.dalrio, mimccune, mjudeiki, wking |
| Version: | 4.4 | Flags: | jminter:
needinfo-
|
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-01 12:13:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jim Minter
2020-09-24 01:11:20 UTC
Relevant code: https://github.com/openshift/machine-api-operator/blob/release-4.4/pkg/controller/machine/controller.go#L259-L263 , I guess Machine-api is cloud compute, reassigning Do you have access to this environment still? It would be interesting to see the logs from the machine-api-controllers machine-controller around the time that the Machine went into a failed state? There aren't too many reasons why the Machine should go failed after it has started, but if we can narrow down exactly why it did, we may be able to come up with some way to mitigate that (perhaps we are being too harsh on this failure scenario) Here's what I found: 20:32:08.023889 1 controller.go:163] xxx-master-0: reconciling Machine 20:32:08.023929 1 actuator.go:196] Checking if machine xxx-master-0 exists 20:32:08.395272 1 reconciler.go:352] Found vm for machine xxx-master-0 20:32:08.395301 1 controller.go:419] xxx-master-0: going into phase "Failed" 20:32:08.407409 1 controller.go:163] xxx-master-0: reconciling Machine 20:32:08.407447 1 controller.go:260] xxx-master-0: machine has gone "Failed" phase. It won't reconcile The "Found vm for machine xxx-master-0" seems interesting. I am guessing the code went through this line: https://github.com/openshift/cluster-api-provider-azure/blob/release-4.4/pkg/cloud/azure/actuators/machine/reconciler.go#L376 i.e. the vm state was neither succeeded nor updating (and was not logged), so it went to Failed. I believe this list (https://github.com/openshift/cluster-api-provider-azure/blob/86841cbdd693d1d451055a05bbbc113a93ef9f97/pkg/apis/azureprovider/v1beta1/types.go#L320-L349) to be an exhaustive list of the different states that a VM can be in within Azure. There are a couple of types in there that do not look to be terminal (eg migrating and running) though I've never personally seen a VM in these states (normally they are in succeeded). Could it be that the VM went through one of those two states? I wonder if we can reproduce this somehow. For migrating, perhaps someone changed some configuration item about the master VM which caused it to go to migrating? Is there an audit history of changes or do you know if anything was changed on the Azure side about the VM? I'd like to look into this further during the next sprint. From my perspective, there is some issue here that means that the Machine prematurely entered the failed state. I don't think going Failed to Running is a valid transition that we want to support, so we should instead make sure that the bug that meant this Machine went failed prematurely is resolved to prevent this happening in the future. *** Bug 1884247 has been marked as a duplicate of this bug. *** (In reply to Jim Minter from comment #0) > ARO saw a case where the Machine object for a 4.4 master got into that > state, probably due to a temporary datacenter outage, but the instance was > subsequently healthy, and the only way we found to recover was to delete and > recreate the Machine object, which was painful and risky. Besides the delete and recreate procedure is there something else we can do to solve the issue? You could patch the status back to Running, this should trigger the machine controller to re-evaluate the situation. We don't normally do anything to the instances when they go into Failed (as in there's no side effects) so I think this should be relatively safe. If the Machine controller thinks there's a problem still, it will mark the machine as failed again.
To do this you will need to get some information from the admin kubeconfig (or an equally privileged user):
You will need to create a file `ca.crt` by reading the cluster definition `certificate-authority-data`, base64 decoding this and dumping it to the file.
You will need to create a file `client.crt` by reading the user definition `client-certificate-data`, base64 decoding this and dumping it to the file.
You will need to create a file `client.key` by reading the user definition `client-kay-data`, base64 decoding this and dumping it to the file.
Then you will need to export SERVER (or substitute below) with the value from `server` in the cluster definition.
Finally export FAILED_MACHINE (or substitute below) with the name of the failed machine.
Once you've done all of that, you can run the below command which should update the phase to `Running`
curl -H "Content-Type: application/merge-patch+json" --cacert ca.crt --cert client.crt --key client.key $SERVER/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/$FAILED_MACHINE/status -XPATCH -d '{"status":{"phase":"Running"}}'
I will try your workaround as soon as possible and will report back on the result. (In reply to Joel Speed from comment #12) > You could patch the status back to Running, this should trigger the machine > controller to re-evaluate the situation. We don't normally do anything to > the instances when they go into Failed (as in there's no side effects) so I > think this should be relatively safe. If the Machine controller thinks > there's a problem still, it will mark the machine as failed again. > > To do this you will need to get some information from the admin kubeconfig > (or an equally privileged user): > > You will need to create a file `ca.crt` by reading the cluster definition > `certificate-authority-data`, base64 decoding this and dumping it to the > file. > > You will need to create a file `client.crt` by reading the user definition > `client-certificate-data`, base64 decoding this and dumping it to the file. > > You will need to create a file `client.key` by reading the user definition > `client-kay-data`, base64 decoding this and dumping it to the file. > > Then you will need to export SERVER (or substitute below) with the value > from `server` in the cluster definition. > > Finally export FAILED_MACHINE (or substitute below) with the name of the > failed machine. > > Once you've done all of that, you can run the below command which should > update the phase to `Running` > > curl -H "Content-Type: application/merge-patch+json" --cacert ca.crt --cert > client.crt --key client.key > $SERVER/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/ > machines/$FAILED_MACHINE/status -XPATCH -d '{"status":{"phase":"Running"}}' The workaround fixed the machine: $ oc get machines NAME PHASE TYPE REGION ZONE AGE ocp-dev-westeu-9qm9r-master-0 Running Standard_D4s_v3 westeurope 1 312d ocp-dev-westeu-9qm9r-master-1 Running Standard_D4s_v3 westeurope 3 312d ocp-dev-westeu-9qm9r-master-2 Running Standard_D4s_v3 westeurope 2 312d ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7 Running Standard_D4s_v3 westeurope 1 312d ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc Running Standard_D4s_v3 westeurope 2 312d ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw Running Standard_D4s_v3 westeurope 3 312d Here is the output produced by the curl: {"apiVersion":"machine.openshift.io/v1beta1","kind":"Machine","metadata":{"annotations":{"machine.openshift.io/instance-state":"Unknown"},"creationTimestamp":"2019-11-27T08:38:29Z","finalizers":["machine.machine.openshift.io"],"generation":3,"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-dev-westeu-9qm9r","machine.openshift.io/cluster-api-machine-role":"master","machine.openshift.io/cluster-api-machine-type":"master","machine.openshift.io/instance-type":"Standard_D4s_v3","machine.openshift.io/region":"westeurope","machine.openshift.io/zone":"1"},"name":"ocp-dev-westeu-9qm9r-master-0","namespace":"openshift-machine-api","resourceVersion":"179396808","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp-dev-westeu-9qm9r-master-0/status","uid":"496450ec-10f1-11ea-a708-000d3aa9b143"},"spec":{"metadata":{},"providerID":"azure:///subscriptions/d3203858-393f-4a08-a78c-cb25c71cc699/resourceGroups/ocp-dev-westeu-9qm9r-rg/providers/Microsoft.Compute/virtualMachines/ocp-dev-westeu-9qm9r-master-0","providerSpec":{"value":{"apiVersion":"azureproviderconfig.openshift.io/v1beta1","credentialsSecret":{"name":"azure-cloud-credentials","namespace":"openshift-machine-api"},"image":{"offer":"","publisher":"","resourceID":"/resourceGroups/ocp-dev-westeu-9qm9r-rg/providers/Microsoft.Compute/images/ocp-dev-westeu-9qm9r","sku":"","version":""},"kind":"AzureMachineProviderSpec","location":"westeurope","managedIdentity":"ocp-dev-westeu-9qm9r-identity","metadata":{"creationTimestamp":null},"osDisk":{"diskSizeGB":256,"managedDisk":{"storageAccountType":"Premium_LRS"},"osType":"Linux"},"publicIP":false,"resourceGroup":"ocp-dev-westeu-9qm9r-rg","subnet":"ocp-dev-westeu-9qm9r-master-subnet","userDataSecret":{"name":"master-user-data"},"vmSize":"Standard_D4s_v3","vnet":"ocp-dev-westeu-9qm9r-vnet","zone":"1"}}},"status":{"addresses":[{"address":"ocp-dev-westeu-9qm9r-master-0","type":"Hostname"},{"address":"ocp-dev-westeu-9qm9r-master-0","type":"InternalDNS"},{"address":"ocp-dev-westeu-9qm9r-master-0.odsoql0soznuvohxlfylo4ewnf.ax.internal.cloudapp.net","type":"InternalDNS"},{"address":"172.16.105.6","type":"InternalIP"}],"errorMessage":"Can't find created instance.","lastUpdated":"2020-10-01T11:59:14Z","nodeRef":{"kind":"Node","name":"ocp-dev-westeu-9qm9r-master-0","uid":"4140f426-10f1-11ea-a708-000d3aa9b143"},"phase":"Running","providerStatus":{"conditions":[{"lastProbeTime":"2019-11-27T08:41:03Z","lastTransitionTime":"2019-11-27T08:41:03Z","message":"machine successfully created","reason":"MachineCreationSucceeded","status":"True","type":"MachineCreated"}],"metadata":{"creationTimestamp":null},"vmId":"/subscriptions/d3203858-393f-4a08-a78c-cb25c71cc699/resourceGroups/ocp-dev-westeu-9qm9r-rg/providers/Microsoft.Compute/virtualMachines/ocp-dev-westeu-9qm9r-master-0","vmState":"Updating"}}} This is the log collected from the machine-controller container: I1005 08:31:28.412502 1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine I1005 08:31:28.412539 1 actuator.go:201] Checking if machine ocp-dev-westeu-9qm9r-master-0 exists I1005 08:31:28.642849 1 reconciler.go:353] Found vm for machine ocp-dev-westeu-9qm9r-master-0 I1005 08:31:28.642876 1 reconciler.go:373] Machine 33bb8eed-967c-4ec6-8329-51bee47ef20f is running I1005 08:31:28.642889 1 controller.go:277] ocp-dev-westeu-9qm9r-master-0: reconciling machine triggers idempotent update I1005 08:31:28.642896 1 actuator.go:168] Updating machine ocp-dev-westeu-9qm9r-master-0 I1005 08:31:28.999711 1 machine_scope.go:157] ocp-dev-westeu-9qm9r-master-0: patching machine I1005 08:31:29.036911 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"} I1005 08:31:29.037049 1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine I1005 08:31:29.037161 1 actuator.go:201] Checking if machine ocp-dev-westeu-9qm9r-master-0 exists I1005 08:31:29.038040 1 recorder.go:52] controller-runtime/manager/events "msg"="Normal" "message"="Updated machine \"ocp-dev-westeu-9qm9r-master-0\"" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ocp-dev-westeu-9qm9r-master-0","uid":"496450ec-10f1-11ea-a708-000d3aa9b143","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"179396808"} "reason"="Updated" I1005 08:31:29.188189 1 reconciler.go:353] Found vm for machine ocp-dev-westeu-9qm9r-master-0 I1005 08:31:29.188342 1 reconciler.go:373] Machine 33bb8eed-967c-4ec6-8329-51bee47ef20f is running I1005 08:31:29.188427 1 controller.go:277] ocp-dev-westeu-9qm9r-master-0: reconciling machine triggers idempotent update I1005 08:31:29.188491 1 actuator.go:168] Updating machine ocp-dev-westeu-9qm9r-master-0 I1005 08:31:29.439731 1 machine_scope.go:141] ocp-dev-westeu-9qm9r-master-0: status unchanged I1005 08:31:29.439766 1 machine_scope.go:141] ocp-dev-westeu-9qm9r-master-0: status unchanged I1005 08:31:29.439771 1 machine_scope.go:157] ocp-dev-westeu-9qm9r-master-0: patching machine I1005 08:31:29.462268 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"} I1005 08:31:29.462332 1 recorder.go:52] controller-runtime/manager/events "msg"="Normal" "message"="Updated machine \"ocp-dev-westeu-9qm9r-master-0\"" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ocp-dev-westeu-9qm9r-master-0","uid":"496450ec-10f1-11ea-a708-000d3aa9b143","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"179396813"} "reason"="Updated" @Jim Minter, looking back at some you posted in C#7, I wonder if the code actually entered this block https://github.com/openshift/cluster-api-provider-azure/blob/c64d41cda1e8f4928e03e337c65125d843f6658c/pkg/cloud/azure/actuators/machine/reconciler.go#L354-L368. Do you know if the Machine did or did not have a userDataSecret? This shouldn't normally be the case but if so, we could have ended up with an error fetching the extension which resulted in the machine going to failed, rather than the source of this issue being a different phase that hasn't been seen before. I know there was a bug that was fixed semi recently about the VM being in the creating state and failing, but I doubt that's the case for this example since master machines aren't created by Machine API normally @jminter Please see https://bugzilla.redhat.com/show_bug.cgi?id=1882169#c15, I don't think there's anything we can really do with the information provided to be more tolerant to this issue. In the general case it would be bad for a machine to exit the Failed state, but we can't really work out why this one did enter Failed when it shouldn't have done so adding UpcomingSprint tag, i believe we are still waiting on a response to the questions raised in https://bugzilla.redhat.com/show_bug.cgi?id=1882169#c15 |