Description of problem: After unexpected shutdown one of RHV Hypervisors where OCP nodes run, 2 out of 3 OCP worker nodes machines claiming about Failed phase. $ oc -n openshift-machine-api get machine -o custom-columns='NODE:.metadata.name,PHASE:.status.phase,STATUS:.status.errorMessage' NODE PHASE STATUS ocp4-dqlhz-master-0 Running <none> ocp4-dqlhz-master-1 Running <none> ocp4-dqlhz-master-2 Running <none> ocp4-dqlhz-worker-0-cdr9b Running <none> ocp4-dqlhz-worker-0-cfsng Failed Can't find created instance. ocp4-dqlhz-worker-0-cv6b7 Failed Can't find created instance. But all OCP nodes are in Ready state: $ oc get nodes NAME STATUS ROLES AGE VERSION ocp4-dqlhz-master-0 Ready master 69d v1.18.3+b74c5ed ocp4-dqlhz-master-1 Ready master 69d v1.18.3+b74c5ed ocp4-dqlhz-master-2 Ready master 69d v1.18.3+b74c5ed ocp4-dqlhz-worker-0-cdr9b Ready worker 69d v1.18.3+b74c5ed ocp4-dqlhz-worker-0-cfsng Ready worker 69d v1.18.3+b74c5ed ocp4-dqlhz-worker-0-cv6b7 Ready worker 69d v1.18.3+b74c5ed Machineset status shows 3 ReadyReplicas: ... status: availableReplicas: 3 fullyLabeledReplicas: 3 observedGeneration: 1 readyReplicas: 3 replicas: 3 ... Machine-api-controller pod shows: 2020-07-22T07:46:04.882661063Z I0722 07:46:04.882568 1 controller.go:164] Reconciling Machine "ocp4-dqlhz-worker-0-cfsng" 2020-07-22T07:46:04.882661063Z W0722 07:46:04.882628 1 controller.go:273] Machine "ocp4-dqlhz-worker-0-cfsng" has gone "Failed" phase. It won't reconcile 2020-07-22T07:46:11.018940927Z I0722 07:46:11.018795 1 controller.go:164] Reconciling Machine "ocp4-dqlhz-worker-0-cfsng" 2020-07-22T07:46:11.018940927Z W0722 07:46:11.018859 1 controller.go:273] Machine "ocp4-dqlhz-worker-0-cfsng" has gone "Failed" phase. It won't reconcile Machine yaml object for "Failed" worker nodes show: ... {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"ocp4-dqlhz-worker-0-cfsng","generateName":"ocp4-dqlhz-worker-0-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp4-dqlhz-worker-0-cfsng","uid":"3647054f-a202-4dad-98da-0d2995177197","resourceVersion":"92008487","generation":1,"creationTimestamp":"2020-05-14T15:20:51Z","labels":{"machine.openshift.io/cluster-api-cluster":"ocp4-dqlhz","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"ocp4-dqlhz-worker-0"},"annotations":{"VmId":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4","instance-status":""},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"ocp4-dqlhz-worker-0","uid":"9c67d617-774e-4e8e-a06b-b25fa05a0bc2","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"ovirtproviderconfig.openshift.io/v1beta1","cluster_id":"e8728dae-185a-11ea-a2c6-00163e39ad17","credentialsSecret":{"name":"ovirt-credentials"},"id":"","kind":"OvirtMachineProviderSpec","metadata":{"creationTimestamp":null},"name":"","template_name":"ocp-rhcos-4.4.3-tmpl01","userDataSecret":{"name":"worker-user-data"}}},"providerID":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4"},"status":{"nodeRef":{"kind":"Node","name":"ocp4-dqlhz-worker-0-cfsng","uid":"a42a11d9-6b90-4176-aa74-be74c7b1e61b"},"lastUpdated":"2020-07-10T19:55:33Z","providerStatus":{"metadata":{"creationTimestamp":null},"instanceId":"ocp4-dqlhz-worker-0-cfsng","instanceState":"up","conditions":[{"type":"MachineCreated","status":"True","lastProbeTime":"2020-07-10T19:55:33Z","lastTransitionTime":"2020-07-10T19:55:33Z","reason":"machineCreationSucceedReason","message":"machineCreationSucceedMessage"}]},"addresses":[{"type":"InternalDNS","address":"ocp4-dqlhz-worker-0-cfsng"},{"type":"InternalIP","address":"10.XX.XX.XX"}],"phase":"Running"}} ... Machine ID are matching with the underying RHEV Infrastructure: ocp4-dqlhz-worker-0-cdr9b.yaml providerID":"d67ac10c-7b74-4370-92fe-9595bf6ece28" ocp4-dqlhz-worker-0-cfsng.yaml providerID":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4" ocp4-dqlhz-worker-0-cv6b7.yaml providerID":"13c64810-dfea-4d0a-adbd-88235b9ce60d" Restart of the machine-api-controller pod didn't help. Version-Release number of selected component (if applicable): RHV4.3 OCP4.5.2 How reproducible: In the customer environment Steps to Reproduce: 1. 2. 3. Actual results: Machines objects for 2 out of 3 OCP worker nodes are marked as "Failed" after an outage of the RHEV Infrastructure but the OCP worker nodes are actually up and running Expected results: All machines object should be reconciled and in Running state Additional info:
Machine-api is the cloud compute component. Reassigning for a first look
OCP 4.6 is an EUS release, these need back porting from OCP 4.7
Hi finished backporting the 2 PRs that fixed node/machine inconsistency from 4.7 to 4.6: - https://bugzilla.redhat.com/show_bug.cgi?id=1909990 - https://bugzilla.redhat.com/show_bug.cgi?id=1910104 I believe the issue should be resolved on the next 4.6 release. Michal can you please verify that this issue is resolved on 4.6 when you verify the above Bugs?
Verify on: OCP- 4.6.0-0.nightly-2021-01-03-162024 RHV- 4.4.4.3-0.5 Step: 1) In the command line check 'oc get nodes' and verify that all VMs there 1) Open RHV UI 2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Shutdown' 3) Remove the virtual machine 4) come back to the command line and press again 'oc get nodes'- verify that node was deleted 5) check 'oc get machines' - verify that one machine became to 'failed' and after a will it will delete also Result: deleted vm from rhv was updated on nodes and machines list if you perform these steps again, it leads to different bug - Bug 1912567 1) Open RHV UI 2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Shutdown' 3) Remove the virtual machine 4) check 'oc get nodes'- verify that node was deleted 5) check 'oc get machines' - verify that relevant machine became to 'failed' actual: node became to 'NotReady' status and machine status doesn't change [root@mgold-ocp-engine primary]# oc get machines NAME PHASE TYPE REGION ZONE AGE ovirt10-7c7kw-master-0 Running 4h1m ovirt10-7c7kw-master-1 Running 4h1m ovirt10-7c7kw-master-2 Running 4h1m ovirt10-7c7kw-worker-0-9t49p Failed 14m ovirt10-7c7kw-worker-0-svn7p Running 104m [root@mgold-ocp-engine primary]# oc get nodes NAME STATUS ROLES AGE VERSION ovirt10-7c7kw-master-0 Ready master 3h57m v1.19.0+9c69bdc ovirt10-7c7kw-master-1 Ready master 3h57m v1.19.0+9c69bdc ovirt10-7c7kw-master-2 Ready master 3h57m v1.19.0+9c69bdc ovirt10-7c7kw-worker-0-svn7p NotReady worker 96m v1.19.0+9c69bdc expected: node was deleted and relevant machine became to 'failed'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633