1898487 – [oVirt] Node is not removed when VM has been removed from oVirt engine

Bug 1898487 - [oVirt] Node is not removed when VM has been removed from oVirt engine

Summary: [oVirt] Node is not removed when VM has been removed from oVirt engine

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Gal Zaidman
QA Contact:	michal
Docs Contact:
URL:
Whiteboard:
Depends On:	1898655
Blocks:	1910104
TreeView+	depends on / blocked

Reported:	2020-11-17 09:59 UTC by Gal Zaidman
Modified:	2021-02-24 15:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:34:16 UTC
Target Upstream Version:
Embargoed:
Flags:	jspeed: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-ovirt pull 77	0	None	closed	Bug 1898487: Node is not removed when VM has been removed from oVirt engine	2021-01-24 08:05:05 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:34:53 UTC

Description Gal Zaidman 2020-11-17 09:59:05 UTC

Description of problem:

When the oVirt admin removes the oVirt VM that is acting as a node in the Openshift cluster the machine and node object remains.
The node will have status "NotReady"
The machine object will be in PHASE="Running" and STATE(instance-state)="Down", because the node is still a part of openshift then the machine still has a node ref pointing to it.


Steps to Reproduce:
1. Create an OCP cluster on top of RHV
2. Remove a worker VM from RHV
3. watch machines and events in the machine-api NS and see that the corresponding node and machine object remain the same


Actual results:

Node and Machine remains the same on OCP although they are removed in oVirt

Expected results:
ovirt provider should detect the the VM is removed in oVirt and remove the node which will lead to the removal of the machine

Comment 1 Joel Speed 2020-11-17 10:27:11 UTC

> nodelink-controller should detect the the VM is removed in oVirt and remove the node which will lead to Node ref being empty or node not found which should get the machine health check controller to remend it.

This is not quite true, we expect the cloud-provider controller (in either kube-controller-manager or external such as cloud-provider-ovirt) to remove a Node when the VM is removed on the provider side. Nodelink is not involved in this process.

Comment 2 Gal Zaidman 2020-11-17 10:54:09 UTC

(In reply to Joel Speed from comment #1)
> > nodelink-controller should detect the the VM is removed in oVirt and remove the node which will lead to Node ref being empty or node not found which should get the machine health check controller to remend it.
> 
> This is not quite true, we expect the cloud-provider controller (in either
> kube-controller-manager or external such as cloud-provider-ovirt) to remove
> a Node when the VM is removed on the provider side. Nodelink is not involved
> in this process.

Thanks corrected the expected result.
So If I got it the ovirt provider should check if the cloud instance exist, when it find the cloud instance is removed and the machine already have a provuderID and nodeRef the machine should be moved to a failed phase and the machine health check controller will delete the machine and node objects ?

Comment 3 Joel Speed 2020-11-17 11:24:00 UTC

> So If I got it the ovirt provider should check if the cloud instance exist, when it find the cloud instance is removed and the machine already have a provuderID and nodeRef the machine should be moved to a failed phase and the machine health check controller will delete the machine and node objects ?

I think there's maybe been some confusion about terms here.

In a normal cloud environment, let's take AWS as an example. There are two sets of controllers that interact with the cloud.

Firstly, there is cluster-api-provider-aws. This is the provider which is used in the machine controller to create VMs on EC2.

Secondly, there is cloud-provider-aws (which is currently embedded in kube-controller-manager). This is responsible for the Node lifecycle, load balancers and other core Kubernetes functionality that is provided by the cloud provider.

In the oVirt case, I think this is part of https://github.com/ovirt/ovirt-openshift-extensions, under ovirt-cloud-provider.

This ovirt-cloud-provider component should monitor Node objects in the cluster and match their providerID with VMs on the oVirt API. If any Node has a providerID that no longer exists on the oVirt API, then it should remove that Node.

The Machine controller will notice also that the Machine is provisioned (ie has a providerID) but the VM doesn't exist (via the actuator Exists function) and will mark the Machine as `Failed`. This logic is already implemented and is independent of the provider implementation.

These two loops are completely independent of one another.

Assuming your Exists logic is correct (which it may not be looking at https://github.com/openshift/cluster-api-provider-ovirt/blob/bfd553fe7dac29790f0ad1f74222ea2a02be5e36/pkg/cloud/ovirt/machine/actuator.go#L136-L157), it should return "false, nil" if the cloud provider VM is not present. In most providers we have to check the error isn't a not found error and return appropriately but you aren't doing that at the moment. I expect the controller in this case is looping saying it can't find the VM rather than handling this as expected

Comment 4 Gal Zaidman 2020-11-17 16:42:13 UTC

(In reply to Joel Speed from comment #3)
> > So If I got it the ovirt provider should check if the cloud instance exist, when it find the cloud instance is removed and the machine already have a provuderID and nodeRef the machine should be moved to a failed phase and the machine health check controller will delete the machine and node objects ?
> 
> I think there's maybe been some confusion about terms here.
> 
> In a normal cloud environment, let's take AWS as an example. There are two
> sets of controllers that interact with the cloud.
> 
> Firstly, there is cluster-api-provider-aws. This is the provider which is
> used in the machine controller to create VMs on EC2.

This controller is the equivalent of cluster-api-provider-ovirt[1]

> Secondly, there is cloud-provider-aws (which is currently embedded in
> kube-controller-manager). This is responsible for the Node lifecycle, load
> balancers and other core Kubernetes functionality that is provided by the
> cloud provider.
> 
> In the oVirt case, I think this is part of
> https://github.com/ovirt/ovirt-openshift-extensions, under 
> ovirt-cloud-provider.
> 
> This ovirt-cloud-provider component should monitor Node objects in the
> cluster and match their providerID with VMs on the oVirt API. If any Node
> has a providerID that no longer exists on the oVirt API, then it should
> remove that Node.

So a few stuff about that, ovirt-openshift-extensions is not under development and ovirt-cloud-provider[2] is not used.
I talked with Roy Golan that started this project and created the cluster-api-provider-ovirt and since MAO has no mechanism for loading external cloud providers and we can't embed it into KCM like AWS he created another controller providerIDController[3] that handles the providerID stuff.
When I look at the code I see it doesn't have the logic you stated, so I will add it.
 

> The Machine controller will notice also that the Machine is provisioned (ie
> has a providerID) but the VM doesn't exist (via the actuator Exists
> function) and will mark the Machine as `Failed`. This logic is already
> implemented and is independent of the provider implementation.
> 

What do you mean by "independent of the provider implementation"? shouldn't that logic be in our provider implementation (meaning cluster-api-provider-ovirt)

> These two loops are completely independent of one another.
> 
> Assuming your Exists logic is correct (which it may not be looking at
> https://github.com/openshift/cluster-api-provider-ovirt/blob/
> bfd553fe7dac29790f0ad1f74222ea2a02be5e36/pkg/cloud/ovirt/machine/actuator.
> go#L136-L157), it should return "false, nil" if the cloud provider VM is not
> present. In most providers we have to check the error isn't a not found
> error and return appropriately but you aren't doing that at the moment. I
> expect the controller in this case is looping saying it can't find the VM
> rather than handling this as expected

I'll look into that

[1] https://github.com/openshift/cluster-api-provider-ovirt/tree/master/cmd/manager
[2] https://github.com/oVirt/ovirt-openshift-extensions/blob/master/cmd/ovirt-cloud-provider/ovirt-cloud-provider.go
[3] https://github.com/openshift/cluster-api-provider-ovirt/blob/master/pkg/cloud/ovirt/providerIDcontroller/providerIDController.go

Comment 5 Joel Speed 2020-11-18 13:45:51 UTC

>  What do you mean by "independent of the provider implementation"? shouldn't that logic be in our provider implementation (meaning cluster-api-provider-ovirt)

The expectation here is that this logic is in the core of the machine controller, rather than in the provider code, so that it simplifies the interface that the cloud providers have to implement.
Based on the interface, we should be able to use Exists for any cloud provider to determines whether the machine should go failed or not

Comment 8 michal 2020-12-22 09:58:34 UTC

Verify on:
4.7.0-0.nightly-2020-12-20-003733

Step:
1) In the command line check 'oc get nodes' and verify there are all VMs
1) Open RHV UI
2) In the 'Virtual Machine' screen, choose any worker virtual machine and 'Power Off'
3) Remove the virtual machine
4) come back to the command line and press again 'oc get nodes'- verify that node was deleted
5) check 'oc get machines' - verify that one machine became to 'failed' and after a will it will delete also
6) check 'oc get machineset' - verify that 'available' updated with available VMs


Result:
deleted vm from rhv was updated on nodes and machines list

Comment 10 errata-xmlrpc 2021-02-24 15:34:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.