1860322 – [OCPv4.5.2] after unexpected shutdown one of RHV Hypervisors, OCP worker nodes machine are marked as "Failed"

Bug 1860322 - [OCPv4.5.2] after unexpected shutdown one of RHV Hypervisors, OCP worker nodes machine are marked as "Failed"

Summary: [OCPv4.5.2] after unexpected shutdown one of RHV Hypervisors, OCP worker node...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Evgeny Slutsky
QA Contact:	michal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1935120
TreeView+	depends on / blocked

Reported:	2020-07-24 10:10 UTC by Angelo Gabrieli
Modified:	2024-06-13 22:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1935120 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:13:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:15:02 UTC

Description Angelo Gabrieli 2020-07-24 10:10:27 UTC

Description of problem:

After unexpected shutdown one of RHV Hypervisors where OCP nodes run, 2 out of 3 OCP worker nodes machines claiming about Failed phase.


$ oc -n openshift-machine-api get machine -o custom-columns='NODE:.metadata.name,PHASE:.status.phase,STATUS:.status.errorMessage'
NODE                        PHASE     STATUS
ocp4-dqlhz-master-0         Running   <none>
ocp4-dqlhz-master-1         Running   <none>
ocp4-dqlhz-master-2         Running   <none>
ocp4-dqlhz-worker-0-cdr9b   Running   <none>
ocp4-dqlhz-worker-0-cfsng   Failed    Can't find created instance.
ocp4-dqlhz-worker-0-cv6b7   Failed    Can't find created instance.


But all OCP nodes are in Ready state:


$ oc get nodes
NAME                        STATUS   ROLES    AGE   VERSION
ocp4-dqlhz-master-0         Ready    master   69d   v1.18.3+b74c5ed
ocp4-dqlhz-master-1         Ready    master   69d   v1.18.3+b74c5ed
ocp4-dqlhz-master-2         Ready    master   69d   v1.18.3+b74c5ed
ocp4-dqlhz-worker-0-cdr9b   Ready    worker   69d   v1.18.3+b74c5ed
ocp4-dqlhz-worker-0-cfsng   Ready    worker   69d   v1.18.3+b74c5ed
ocp4-dqlhz-worker-0-cv6b7   Ready    worker   69d   v1.18.3+b74c5ed


Machineset status shows 3 ReadyReplicas:


...
status:
  availableReplicas: 3
  fullyLabeledReplicas: 3
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
...


Machine-api-controller pod shows:


2020-07-22T07:46:04.882661063Z I0722 07:46:04.882568       1 controller.go:164] Reconciling Machine "ocp4-dqlhz-worker-0-cfsng"
2020-07-22T07:46:04.882661063Z W0722 07:46:04.882628       1 controller.go:273] Machine "ocp4-dqlhz-worker-0-cfsng" has gone "Failed" phase. It won't reconcile
2020-07-22T07:46:11.018940927Z I0722 07:46:11.018795       1 controller.go:164] Reconciling Machine "ocp4-dqlhz-worker-0-cfsng"
2020-07-22T07:46:11.018940927Z W0722 07:46:11.018859       1 controller.go:273] Machine "ocp4-dqlhz-worker-0-cfsng" has gone "Failed" phase. It won't reconcile


Machine yaml object for "Failed" worker nodes show:


...
      {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"ocp4-dqlhz-worker-0-cfsng","generateName":"ocp4-dqlhz-worker-0-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp4-dqlhz-worker-0-cfsng","uid":"3647054f-a202-4dad-98da-0d2995177197","resourceVersion":"92008487","generation":1,"creationTimestamp":"2020-05-14T15:20:51Z","labels":{"machine.openshift.io/cluster-api-cluster":"ocp4-dqlhz","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"ocp4-dqlhz-worker-0"},"annotations":{"VmId":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4","instance-status":""},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"ocp4-dqlhz-worker-0","uid":"9c67d617-774e-4e8e-a06b-b25fa05a0bc2","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"ovirtproviderconfig.openshift.io/v1beta1","cluster_id":"e8728dae-185a-11ea-a2c6-00163e39ad17","credentialsSecret":{"name":"ovirt-credentials"},"id":"","kind":"OvirtMachineProviderSpec","metadata":{"creationTimestamp":null},"name":"","template_name":"ocp-rhcos-4.4.3-tmpl01","userDataSecret":{"name":"worker-user-data"}}},"providerID":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4"},"status":{"nodeRef":{"kind":"Node","name":"ocp4-dqlhz-worker-0-cfsng","uid":"a42a11d9-6b90-4176-aa74-be74c7b1e61b"},"lastUpdated":"2020-07-10T19:55:33Z","providerStatus":{"metadata":{"creationTimestamp":null},"instanceId":"ocp4-dqlhz-worker-0-cfsng","instanceState":"up","conditions":[{"type":"MachineCreated","status":"True","lastProbeTime":"2020-07-10T19:55:33Z","lastTransitionTime":"2020-07-10T19:55:33Z","reason":"machineCreationSucceedReason","message":"machineCreationSucceedMessage"}]},"addresses":[{"type":"InternalDNS","address":"ocp4-dqlhz-worker-0-cfsng"},{"type":"InternalIP","address":"10.XX.XX.XX"}],"phase":"Running"}}
...


Machine ID are matching with the underying RHEV Infrastructure:


ocp4-dqlhz-worker-0-cdr9b.yaml providerID":"d67ac10c-7b74-4370-92fe-9595bf6ece28"
ocp4-dqlhz-worker-0-cfsng.yaml providerID":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4"
ocp4-dqlhz-worker-0-cv6b7.yaml providerID":"13c64810-dfea-4d0a-adbd-88235b9ce60d"


Restart of the machine-api-controller pod didn't help.


Version-Release number of selected component (if applicable):
RHV4.3
OCP4.5.2


How reproducible:
In the customer environment


Steps to Reproduce:
1.
2.
3.

Actual results:
Machines objects for 2 out of 3 OCP worker nodes are marked as "Failed" after an outage of the RHEV Infrastructure but the OCP worker nodes are actually up and running


Expected results:
All machines object should be reconciled and in Running state


Additional info:

Comment 2 Yu Qi Zhang 2020-07-24 20:51:08 UTC

Machine-api is the cloud compute component. Reassigning for a first look

Comment 10 Peter Lauterbach 2020-12-15 12:50:55 UTC

OCP 4.6 is an EUS release, these need back porting from OCP 4.7

Comment 11 Gal Zaidman 2020-12-23 16:44:53 UTC

Hi finished backporting the 2 PRs that fixed node/machine inconsistency from 4.7 to 4.6:

- https://bugzilla.redhat.com/show_bug.cgi?id=1909990
- https://bugzilla.redhat.com/show_bug.cgi?id=1910104

I believe the issue should be resolved on the next 4.6 release.

Michal can you please verify that this issue is resolved on 4.6 when you verify the above Bugs?

Comment 12 michal 2021-01-04 19:12:37 UTC

Verify on:
OCP- 4.6.0-0.nightly-2021-01-03-162024
RHV- 4.4.4.3-0.5

Step:
1) In the command line check 'oc get nodes' and verify that all VMs there
1) Open RHV UI
2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Shutdown'
3) Remove the virtual machine
4) come back to the command line and press again 'oc get nodes'- verify that node was deleted
5) check 'oc get machines' - verify that one machine became to 'failed' and after a will it will delete also


Result:
deleted vm from rhv was updated on nodes and machines list

if you perform these steps again, it leads to different bug - Bug 1912567
1) Open RHV UI
2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Shutdown'
3) Remove the virtual machine
4) check 'oc get nodes'- verify that node was deleted
5) check 'oc get machines' - verify that relevant machine became to 'failed'

actual:
node became to 'NotReady' status and machine status doesn't change

[root@mgold-ocp-engine primary]# oc get machines
NAME                           PHASE     TYPE   REGION   ZONE   AGE
ovirt10-7c7kw-master-0         Running                          4h1m
ovirt10-7c7kw-master-1         Running                          4h1m
ovirt10-7c7kw-master-2         Running                          4h1m
ovirt10-7c7kw-worker-0-9t49p   Failed                           14m
ovirt10-7c7kw-worker-0-svn7p   Running                          104m
[root@mgold-ocp-engine primary]# oc get nodes
NAME                           STATUS     ROLES    AGE     VERSION
ovirt10-7c7kw-master-0         Ready      master   3h57m   v1.19.0+9c69bdc
ovirt10-7c7kw-master-1         Ready      master   3h57m   v1.19.0+9c69bdc
ovirt10-7c7kw-master-2         Ready      master   3h57m   v1.19.0+9c69bdc
ovirt10-7c7kw-worker-0-svn7p   NotReady   worker   96m     v1.19.0+9c69bdc


expected: 
node was deleted and relevant machine became to 'failed'

Comment 15 errata-xmlrpc 2021-02-24 15:13:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.