Bug 1860322

Summary: [OCPv4.5.2] after unexpected shutdown one of RHV Hypervisors, OCP worker nodes machine are marked as "Failed"
Product: OpenShift Container Platform Reporter: Angelo Gabrieli <agabriel>
Component: Cloud ComputeAssignee: Evgeny Slutsky <eslutsky>
Cloud Compute sub component: oVirt Provider QA Contact: michal <mgold>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: cpassare, dsevosty, gzaidman, hpopal, jerzhang, lleistne, mgold, pelauter
Version: 4.5Keywords: Reopened
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1935120 (view as bug list) Environment:
Last Closed: 2021-02-24 15:13:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1935120    

Description Angelo Gabrieli 2020-07-24 10:10:27 UTC
Description of problem:

After unexpected shutdown one of RHV Hypervisors where OCP nodes run, 2 out of 3 OCP worker nodes machines claiming about Failed phase.


$ oc -n openshift-machine-api get machine -o custom-columns='NODE:.metadata.name,PHASE:.status.phase,STATUS:.status.errorMessage'
NODE                        PHASE     STATUS
ocp4-dqlhz-master-0         Running   <none>
ocp4-dqlhz-master-1         Running   <none>
ocp4-dqlhz-master-2         Running   <none>
ocp4-dqlhz-worker-0-cdr9b   Running   <none>
ocp4-dqlhz-worker-0-cfsng   Failed    Can't find created instance.
ocp4-dqlhz-worker-0-cv6b7   Failed    Can't find created instance.


But all OCP nodes are in Ready state:


$ oc get nodes
NAME                        STATUS   ROLES    AGE   VERSION
ocp4-dqlhz-master-0         Ready    master   69d   v1.18.3+b74c5ed
ocp4-dqlhz-master-1         Ready    master   69d   v1.18.3+b74c5ed
ocp4-dqlhz-master-2         Ready    master   69d   v1.18.3+b74c5ed
ocp4-dqlhz-worker-0-cdr9b   Ready    worker   69d   v1.18.3+b74c5ed
ocp4-dqlhz-worker-0-cfsng   Ready    worker   69d   v1.18.3+b74c5ed
ocp4-dqlhz-worker-0-cv6b7   Ready    worker   69d   v1.18.3+b74c5ed


Machineset status shows 3 ReadyReplicas:


...
status:
  availableReplicas: 3
  fullyLabeledReplicas: 3
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
...


Machine-api-controller pod shows:


2020-07-22T07:46:04.882661063Z I0722 07:46:04.882568       1 controller.go:164] Reconciling Machine "ocp4-dqlhz-worker-0-cfsng"
2020-07-22T07:46:04.882661063Z W0722 07:46:04.882628       1 controller.go:273] Machine "ocp4-dqlhz-worker-0-cfsng" has gone "Failed" phase. It won't reconcile
2020-07-22T07:46:11.018940927Z I0722 07:46:11.018795       1 controller.go:164] Reconciling Machine "ocp4-dqlhz-worker-0-cfsng"
2020-07-22T07:46:11.018940927Z W0722 07:46:11.018859       1 controller.go:273] Machine "ocp4-dqlhz-worker-0-cfsng" has gone "Failed" phase. It won't reconcile


Machine yaml object for "Failed" worker nodes show:


...
      {"kind":"Machine","apiVersion":"machine.openshift.io/v1beta1","metadata":{"name":"ocp4-dqlhz-worker-0-cfsng","generateName":"ocp4-dqlhz-worker-0-","namespace":"openshift-machine-api","selfLink":"/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/ocp4-dqlhz-worker-0-cfsng","uid":"3647054f-a202-4dad-98da-0d2995177197","resourceVersion":"92008487","generation":1,"creationTimestamp":"2020-05-14T15:20:51Z","labels":{"machine.openshift.io/cluster-api-cluster":"ocp4-dqlhz","machine.openshift.io/cluster-api-machine-role":"worker","machine.openshift.io/cluster-api-machine-type":"worker","machine.openshift.io/cluster-api-machineset":"ocp4-dqlhz-worker-0"},"annotations":{"VmId":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4","instance-status":""},"ownerReferences":[{"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","name":"ocp4-dqlhz-worker-0","uid":"9c67d617-774e-4e8e-a06b-b25fa05a0bc2","controller":true,"blockOwnerDeletion":true}],"finalizers":["machine.machine.openshift.io"]},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{"value":{"apiVersion":"ovirtproviderconfig.openshift.io/v1beta1","cluster_id":"e8728dae-185a-11ea-a2c6-00163e39ad17","credentialsSecret":{"name":"ovirt-credentials"},"id":"","kind":"OvirtMachineProviderSpec","metadata":{"creationTimestamp":null},"name":"","template_name":"ocp-rhcos-4.4.3-tmpl01","userDataSecret":{"name":"worker-user-data"}}},"providerID":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4"},"status":{"nodeRef":{"kind":"Node","name":"ocp4-dqlhz-worker-0-cfsng","uid":"a42a11d9-6b90-4176-aa74-be74c7b1e61b"},"lastUpdated":"2020-07-10T19:55:33Z","providerStatus":{"metadata":{"creationTimestamp":null},"instanceId":"ocp4-dqlhz-worker-0-cfsng","instanceState":"up","conditions":[{"type":"MachineCreated","status":"True","lastProbeTime":"2020-07-10T19:55:33Z","lastTransitionTime":"2020-07-10T19:55:33Z","reason":"machineCreationSucceedReason","message":"machineCreationSucceedMessage"}]},"addresses":[{"type":"InternalDNS","address":"ocp4-dqlhz-worker-0-cfsng"},{"type":"InternalIP","address":"10.XX.XX.XX"}],"phase":"Running"}}
...


Machine ID are matching with the underying RHEV Infrastructure:


ocp4-dqlhz-worker-0-cdr9b.yaml providerID":"d67ac10c-7b74-4370-92fe-9595bf6ece28"
ocp4-dqlhz-worker-0-cfsng.yaml providerID":"94e31cba-e761-44fa-81ae-b0b5fe8d26c4"
ocp4-dqlhz-worker-0-cv6b7.yaml providerID":"13c64810-dfea-4d0a-adbd-88235b9ce60d"


Restart of the machine-api-controller pod didn't help.


Version-Release number of selected component (if applicable):
RHV4.3
OCP4.5.2


How reproducible:
In the customer environment


Steps to Reproduce:
1.
2.
3.

Actual results:
Machines objects for 2 out of 3 OCP worker nodes are marked as "Failed" after an outage of the RHEV Infrastructure but the OCP worker nodes are actually up and running


Expected results:
All machines object should be reconciled and in Running state


Additional info:

Comment 2 Yu Qi Zhang 2020-07-24 20:51:08 UTC
Machine-api is the cloud compute component. Reassigning for a first look

Comment 10 Peter Lauterbach 2020-12-15 12:50:55 UTC
OCP 4.6 is an EUS release, these need back porting from OCP 4.7

Comment 11 Gal Zaidman 2020-12-23 16:44:53 UTC
Hi finished backporting the 2 PRs that fixed node/machine inconsistency from 4.7 to 4.6:

- https://bugzilla.redhat.com/show_bug.cgi?id=1909990
- https://bugzilla.redhat.com/show_bug.cgi?id=1910104

I believe the issue should be resolved on the next 4.6 release.

Michal can you please verify that this issue is resolved on 4.6 when you verify the above Bugs?

Comment 12 michal 2021-01-04 19:12:37 UTC
Verify on:
OCP- 4.6.0-0.nightly-2021-01-03-162024
RHV- 4.4.4.3-0.5

Step:
1) In the command line check 'oc get nodes' and verify that all VMs there
1) Open RHV UI
2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Shutdown'
3) Remove the virtual machine
4) come back to the command line and press again 'oc get nodes'- verify that node was deleted
5) check 'oc get machines' - verify that one machine became to 'failed' and after a will it will delete also


Result:
deleted vm from rhv was updated on nodes and machines list

if you perform these steps again, it leads to different bug - Bug 1912567
1) Open RHV UI
2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Shutdown'
3) Remove the virtual machine
4) check 'oc get nodes'- verify that node was deleted
5) check 'oc get machines' - verify that relevant machine became to 'failed'

actual:
node became to 'NotReady' status and machine status doesn't change

[root@mgold-ocp-engine primary]# oc get machines
NAME                           PHASE     TYPE   REGION   ZONE   AGE
ovirt10-7c7kw-master-0         Running                          4h1m
ovirt10-7c7kw-master-1         Running                          4h1m
ovirt10-7c7kw-master-2         Running                          4h1m
ovirt10-7c7kw-worker-0-9t49p   Failed                           14m
ovirt10-7c7kw-worker-0-svn7p   Running                          104m
[root@mgold-ocp-engine primary]# oc get nodes
NAME                           STATUS     ROLES    AGE     VERSION
ovirt10-7c7kw-master-0         Ready      master   3h57m   v1.19.0+9c69bdc
ovirt10-7c7kw-master-1         Ready      master   3h57m   v1.19.0+9c69bdc
ovirt10-7c7kw-master-2         Ready      master   3h57m   v1.19.0+9c69bdc
ovirt10-7c7kw-worker-0-svn7p   NotReady   worker   96m     v1.19.0+9c69bdc


expected: 
node was deleted and relevant machine became to 'failed'

Comment 15 errata-xmlrpc 2021-02-24 15:13:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633