1712068 – cluster-api machines are recreated when backing instance removed

Bug 1712068 - cluster-api machines are recreated when backing instance removed

Summary: cluster-api machines are recreated when backing instance removed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-20 17:05 UTC by Michael Gugino
Modified:	2020-01-23 11:03 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:03:57 UTC

Description Michael Gugino 2019-05-20 17:05:20 UTC

Description of problem:

If an instance is removed from the cloud-provider (either via a user-deletion, or cloud-provider event of some kind), and the machine-object is reconciled again for some reason, the machine-controller may determine the instance no longer 'exists' and attempt to create the instance.  This is undocumented behavior and should not be relied upon for workflows.  This operation might interfere with current or future components, such as the node-health-checker.

We should track state to ensure the Create() function is called once for any machine-object.  For a machine-object that has it's backing instance removed, that should be handled by node-health-checker or similar out-of-band controller.

Comment 1 Michael Gugino 2019-05-20 17:33:12 UTC

Added to 4.1 issue tracker comments: https://github.com/openshift/openshift-docs/issues/12487

If we ship a fix for this prior to shipping 4.1 GA, may want to ensure it gets tracked properly there as well.

Comment 2 Alberto 2019-05-22 08:41:05 UTC

This is by design the expected behaviour for any kubenetes controller and we shouldn't deviate from it - it reconciles existing state with desired state. If something deletes an instance out of band the machine api will notice only once the controller resync period is expired. How the machine health checking or any other component interacts by consuming the API is orthogonal.

Comment 3 Alberto 2019-07-26 13:50:27 UTC

Keeping this open and bumping to 4.3 as we plan to make machines objects "fire and forget" in terms of cloud instance creation

Comment 4 Eric Rich 2019-09-03 21:10:35 UTC

(In reply to Michael Gugino from comment #0)

Why is this not expected behavior? If it's not expected behavior then why is it not part of the docs directly? 

(In reply to Alberto from comment #2)
> This is by design the expected behaviour for any kubenetes controller and we
> shouldn't deviate from it - it reconciles existing state with desired state.

If this is true then we should close as NOT A BUG.

Comment 5 Michael Gugino 2019-09-03 23:04:42 UTC

(In reply to Eric Rich from comment #4)
> (In reply to Michael Gugino from comment #0)
> 
> Why is this not expected behavior? If it's not expected behavior then why is
> it not part of the docs directly? 
> 


@Eric

This behavior is mostly an artifact from upstream.  There's a multitude of reasons why it doesn't fit well for us, and upstream is (I believe) also switching to the 'create once' model.  We only recently decided which behavior we actually want.

> (In reply to Alberto from comment #2)
> > This is by design the expected behaviour for any kubenetes controller and we
> > shouldn't deviate from it - it reconciles existing state with desired state.
> 
> If this is true then we should close as NOT A BUG.

This statement is outdated.  This bug might be redundant if we're tracking feature work elsewhere, but on the other hand, this might be useful to others if they consider it a bug to have the rational here until we cover in docs and code.

Comment 6 Alberto 2019-11-06 12:56:12 UTC

Since we introduced machine phases this should be fixed now. If a cloud instance is deleted out of band the backing machine should enter a failed phase. It must be deleted. https://github.com/openshift/cluster-api/pull/75

Comment 8 Jianwei Hou 2019-11-15 06:36:54 UTC

Verified in 4.3.0-0.nightly-2019-11-13-233341.

If the instance is deleted, it's backing machine has 'Failed' phase. The machine must be deleted.

Comment 10 errata-xmlrpc 2020-01-23 11:03:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.