Bug 1957349
| Summary: | [Azure] Machine object showing Failed phase even node is ready and VM is running properly | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Aditya Deshpande <adeshpan> |
| Component: | Cloud Compute | Assignee: | dmoiseev |
| Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | agarcial, dmoiseev, emarquez, jspeed, mfiedler, mrbraga, oarribas, openshift-bugs-escalate |
| Version: | 4.6.z | Flags: | dmoiseev:
needinfo-
|
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, due to strict check of vm's 'ProvisioningState' value machine may undesirably go to 'Failed' phase during its existence check. This check was relaxed, now only actually deleted machines goes into 'Failed' phase during existence check procedure.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 23:06:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1989524 | ||
This reminds me of a previous bug, but I can't find it right now. What I believe is happening here is that the VM state is going through something that we aren't aware of in our "exists" check, and therefore it is going to failed. We need to work out exactly what that state is and add it as an allowed state in https://github.com/openshift/cluster-api-provider-azure/blob/b2eda16dd665ab39459c0b686c88ce2d0b97ec6a/pkg/cloud/azure/actuators/machine/reconciler.go#L386-L400 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: In OCP cluster setup on Azure, some machine objects are showing failed phase. Checking OCP node status shows the respective node is in ready state. From Azure perspective VM is running with correct VMId(ProviderID). ~~~ # oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE awe-pp-xxx-ah97u-master-0 Failed Standard_D8s_v3 westeurope 1 88d awe-pp-xxx-ah97u-master-1 Running Standard_D8s_v3 westeurope 2 88d awe-pp-xxx-ah97u-master-2 Running Standard_D8s_v3 westeurope 3 88d awe-pp-xxx-ah97u-worker-westeurope1-fmm25 Failed Standard_F32s_v2 westeurope 1 88d awe-pp-xxx-ah97u-worker-westeurope2-w22ln Running Standard_F32s_v2 westeurope 2 52d awe-pp-xxx-ah97u-worker-westeurope3-s7vll Running Standard_F32s_v2 westeurope 3 88d # oc get nodes NAME STATUS ROLES AGE VERSION awe-pp-xxx-ah97u-master-0 Ready master 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-master-1 Ready master 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-master-2 Ready master 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-worker-westeurope1-fmm25 Ready worker 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-worker-westeurope2-w22ln Ready worker 52d v1.19.0+2f3101c awe-pp-xxx-ah97u-worker-westeurope3-s7vll Ready worker 88d v1.19.0+2f3101c ~~~ - machine awe-pp-xxx-ah97u-master-0 is in phase: Failed Machine YAML: errorMessage: Can't find created instance. - machine-api-controllers logs shows: ~~~ 2021-04-26T00:58:33.013973757Z I0426 00:58:33.013892 1 controller.go:170] awe-pp-xxx-ah97u-master-0: reconciling Machine 2021-04-26T00:58:33.013973757Z W0426 00:58:33.013918 1 controller.go:267] awe-pp-xxx-ah97u-master-0: machine has gone "Failed" phase. It won't reconcile 2021-04-26T00:58:33.014001757Z I0426 00:58:33.013974 1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="awe-pp-xxx-ah97u-master-0" "namespace"="openshift-machine-api" ~~~ - From Azure point of view, ProviderID from node description is matching with actual VM. Version-Release number of selected component (if applicable): OCP 4.6.21 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Machine is in failed phase. Expected results: Machine should be in Running phase. Additional info: Issue is seen after updating OCP from 4.6.12 to 4.6.21 at the time the node was rebooted.