1812588 – Scaling down machineset to match bmh number deprovision deployed nodes

Bug 1812588 - Scaling down machineset to match bmh number deprovision deployed nodes

Summary: Scaling down machineset to match bmh number deprovision deployed nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1791057 (view as bug list)
Depends On:
Blocks:	1771572
TreeView+	depends on / blocked

Reported:	2020-03-11 16:00 UTC by Michael Zamot
Modified:	2023-09-07 22:19 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-19 17:01:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal issues 56	0	None	closed	Deleting a Host doesn't immediately affect Machine	2021-02-08 10:15:10 UTC
Github	openshift cluster-api-provider-baremetal pull 57	0	None	closed	Don't wait for Host deprovisioning to delete Machine	2021-02-08 10:15:10 UTC

Description Michael Zamot 2020-03-11 16:00:10 UTC

Description of problem:
When scaling up a cluster using IPI, we create a bmh resource, introspection starts and completes successfully.
Afterwards, we increase the --replicas to start the deployment. If for some reason the deployment fails, we delete the bmh resource, then we decrease the --replicas back to the previous number, a working node is deprovisioned.

Version-Release number of selected component (if applicable):
4.3.5

How reproducible:
Always

Steps to Reproduce:
1. Create bmh resource and wait until it's Ready
2. Scale up machineset to start deployment
3. Delete bmh
4. Scale down replicas to match the number of existing worker nodes

Actual results:
One working node is deprovisioned

Expected results:
No working node should be deprovisioned as scaling down is matching the actual number of existing workers.

Additional info:

Comment 1 Zane Bitter 2020-03-11 18:37:53 UTC

The first thing to mention is that deleting the BareMetalHost is probably not what you want. You should only do that when you don't want that Host to be part of the inventory any more (i.e. you will never provision it again).

The baremetal Machine actuator will delete the Machine that has provisioned a Host if the Host is deleted, *but* only once the Host reaches the 'Deleting' state. That means the Host has to be completely deprovisioned first. This is probably something that we could improve.

I'm not clear on how the MachineSet decides which Machine to remove when scaling down. But I suspect that in this scenario if you scale down before the Host has reached the Deleting state, there is no way for it to know that one Machine is not in good shape.

Comment 2 Zane Bitter 2020-03-12 15:08:50 UTC

PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/57

Comment 3 Michael Zamot 2020-03-12 15:12:49 UTC

So this issue not only happens with deleting bmh's.
If for any reason you end up with a replica higher than your actual nodes and you scale it down to match the existing amount of deployed nodes, it still deletes a working provisioned node.

Comment 4 Zane Bitter 2020-03-12 15:25:35 UTC

That sounds like a problem with MachineSet then.

Comment 7 Eduardo Minguez 2020-04-28 12:22:42 UTC

I'm not 100% sure, but I believe there was a special "machineset.clusters.k8s.io/delete-me=yes" annotation that can be added to the node so when the machineset is scaled down, it deletes first the nodes with that annotation (https://github.com/metal3-io/cluster-api-provider-baremetal/issues/10). It is from a while ago so maybe it is now different but just in case.

Comment 8 Michael Zamot 2020-04-28 12:48:34 UTC

The problem is if I have a machineset with 20 replicas, but I only have 10 bmh objects. If I change the replicas to the same number of bmh, it shouldn't delete any node.

Comment 9 Michael Hrivnak 2020-04-28 12:58:36 UTC

The MachineSet controller does not consider whether a Machine corresponds to a healthy Node when it looks for one to delete during a scale-down operation. That would be a reasonable feature to add, and it would benefit all IPI platforms.

See the first step of the process as documented upstream: https://github.com/metal3-io/metal3-docs/blob/master/design/remove-host.md

"Find the Machine that corresponds to the BareMetalHost that you want to remove. Add the annotation cluster.k8s.io/delete-machine with any value that is not an empty string.

This ensures that when you later scale down the MachineSet, this Machine is the one that will be removed."

For openshift, the annotation key should be "machine.openshift.io/cluster-api-delete-machine". Not sure if that's documented, by I created this issue some time ago so we don't forget: https://github.com/openshift/cluster-api-provider-baremetal/issues/46

Comment 10 Michael Hrivnak 2020-04-28 13:15:53 UTC

My understanding is that the PR is not going to fix the problem, assuming the problem is: "When I scale down a MachineSet, I expect to see Machines with no Node prioritized for deletion, but I see that whether a Machine corresponds to a healthy Node has no influence over whether it's selected for deletion."

The fix to the problem as reported in this issue is to enhance the machineset controller to more intelligently choose a Machine to delete when scaling down.

The workaround is to use the annotation as described above to manually designate which Machine to delete.

Comment 11 Russell Bryant 2020-04-28 13:44:59 UTC

The default behavior when scaling down a machineset is to delete a random machine.  If you want a specific one to get deleted, you must annotate it first.  Here is the annotation name from the machine-api-operator code:

›   // DeleteNodeAnnotation marks nodes that will be given priority for deletion
›   // when a machineset scales down. This annotation is given top priority on all delete policies.
›   DeleteNodeAnnotation = "machine.openshift.io/cluster-api-delete-machine"

The value of this annotation can be anything.  The code just checks for it to be non-empty, and then will choose that Machine as the highest priority to get deleted during the scale down.

Zane's PR will also help, which will speed up deleting a Machine if the underlying BareMetalHost gets manually deleted.  With this PR or not, please try using this annotation to ensure the correct Machine is deleted when scaling down the machineset.

Comment 12 Michael Zamot 2020-04-28 14:06:31 UTC

But how do we fix in cases where the MachineSet is way higher than the amount of available nodes? How do we make the MachineSet to match the actual amount of nodes? If we scale down from 20 replicas to 10 replicas to match the 10 available bmh, it will still try to delete nodes, instead of simply not doing anything.

Comment 13 Russell Bryant 2020-04-29 21:23:11 UTC

(In reply to Michael Zamot from comment #12)
> But how do we fix in cases where the MachineSet is way higher than the
> amount of available nodes? How do we make the MachineSet to match the actual
> amount of nodes? If we scale down from 20 replicas to 10 replicas to match
> the 10 available bmh, it will still try to delete nodes, instead of simply
> not doing anything.

If you have a MachineSet with 20 replicas, there will be 20 Machines.  When you scale down from 20 to 10, you want the 10 Machines not yet associated with any BareMetalHost to be the ones deleted.  That should be the case if you annotate those 10.

Comment 16 Beth White 2020-05-12 16:55:43 UTC

*** Bug 1810430 has been marked as a duplicate of this bug. ***

Comment 17 Beth White 2020-05-12 16:58:38 UTC

*** Bug 1791057 has been marked as a duplicate of this bug. ***

Comment 19 Alberto 2020-05-19 17:01:21 UTC

Generally speaking scaling operations are cattle.
If you see yourself needing to pet a scaling operation is usually a signal that you shouldn't have got there in the first place and the angle should be reconsidered.

Machines are the fundamental unit for the machine API ecosystem.
If you want to get rid of a machine, delete the machine resource and your provider must ensure a graceful termination.
Please educate users for this and try to put mechanisms to prevent them from manipulating your lower level objects which might result in unpredictable scenarios.

As a user you can choose a delete policy for your machineSets, "Random, "Newest", "Oldest". https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machineset_types.go#L66-L69
Any machine annotated with "machine.openshift.io/cluster-api-delete-machine" has the highest priority to be deleted regardless of your deletion policy during a scale down operation.
Please See https://docs.openshift.com/container-platform/4.4/machine_management/manually-scaling-machineset.html

All the above describes the existing design. If you have identified strong use cases and background as to propose any change that would result in a better user experience for our users on baremetal please make sure to create a RFE or a PR against https://github.com/openshift/enhancements/tree/master/enhancements/machine-api.
As this is coming from the baremetal team itself I'm assuming this has been evaluated and that's not the case, and that the mechanisms described above are enough so I'm closing this now.

Can we please make sure that all comments are public so anyone coming here have as much context as possible?

Comment 20 Zane Bitter 2020-05-19 19:56:45 UTC

(In reply to Alberto from comment #19)
> Generally speaking scaling operations are cattle.
> If you see yourself needing to pet a scaling operation is usually a signal
> that you shouldn't have got there in the first place and the angle should be
> reconsidered.

Unfortunately this is wildly untrue of baremetal, where each Machine corresponds to one of a finite number of physical objects located in meatspace.

> Machines are the fundamental unit for the machine API ecosystem.
> If you want to get rid of a machine, delete the machine resource and your
> provider must ensure a graceful termination.

The provider does handle this gracefully, but this doesn't prevent a new Machine being created in its place. The only way to get rid of those new Machines, which will not ever finish creating because they are trying to pull from a finite pool of available hardware, is to scale down the MachineSet. But doing so usually deletes healthy Machines and leaves the zombie ones behind, unless the user jumps through hoops with a hacky manual annotation.

> Please educate users for this and try to put mechanisms to prevent them from
> manipulating your lower level objects which might result in unpredictable
> scenarios.

+1

> As a user you can choose a delete policy for your machineSets, "Random,
> "Newest", "Oldest".
> https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/
> machine/v1beta1/machineset_types.go#L66-L69
> Any machine annotated with "machine.openshift.io/cluster-api-delete-machine"
> has the highest priority to be deleted regardless of your deletion policy
> during a scale down operation.
> Please See
> https://docs.openshift.com/container-platform/4.4/machine_management/
> manually-scaling-machineset.html

IMHO if you have e.g. a Machine that never finished creating correctly, it never makes sense to choose a random/newest/oldest Machine to kill on scale down instead of that one. And this appears to be equally true regardless of whether the platform is baremetal or an actual cloud.

Comment 21 Alberto 2020-05-20 07:44:53 UTC

>IMHO if you have e.g. a Machine that never finished creating correctly, it never makes sense to choose a random/newest/oldest Machine to kill on scale down instead of that one. And this appears to be equally true regardless of whether the platform is baremetal or an actual cloud.

In cloud I see it from a different angle, if you have an unhealthy machine you just kill it and let the upper level controller (i.e machineSet) to give you a fresh healthy one. They are cattle.
You have MachineHealthCheck and MachineAutoscaler resources to automate this auto repairing process described above.
You scale out/in to increase/decrease compute capacity not to repair faulty machines.

If this does not play well enough in fairly generic baremetal scenarios as for us to be concern about our users experience we should work towards mitigating the problem.
Particularly introducing a deletion strategy that honours e.g node healthiness or machine phases should be fairly trivial to implement. It'd be rather a case of needing strong background and fleshed out use cases as to justify the feature and ensure we are providing value.

Comment 22 Zane Bitter 2020-05-20 20:14:43 UTC

(In reply to Alberto from comment #21)
> >IMHO if you have e.g. a Machine that never finished creating correctly, it never makes sense to choose a random/newest/oldest Machine to kill on scale down instead of that one. And this appears to be equally true regardless of whether the platform is baremetal or an actual cloud.
> 
> In cloud I see it from a different angle, if you have an unhealthy machine
> you just kill it and let the upper level controller (i.e machineSet) to give
> you a fresh healthy one. They are cattle.

This is a fair point. But there are edge cases: e.g. you accidentally scale up from 10 to 100 Machines, then immediately notice your mistake and fix it. Should the MachineSet (a) delete 90 Machines at random (including *all* of the Machines available to run workloads, roughly a third of the time); or (b) not?

These edge cases are comparatively rare in the cloud (the above is admittedly a fairly contrived example) but become more common on baremetal because you have a fixed pool of resources that are typically always fully-utilised, so any extra Machines will *never* become healthy. (Note that clouds behave similarly when you are operating at the limit of your quota.) What in the cloud would have just been a temporary situation (new Machine gets created within a couple of minutes, existing one gets deleted, there's a temporary drop in capacity but it all sorts itself out) becomes on baremetal a situation that requires manual intervention (annotate the right Machines to delete, lest you suffer from ~30 minutes of downtime).

> You have MachineHealthCheck and MachineAutoscaler resources to automate this
> auto repairing process described above.
> You scale out/in to increase/decrease compute capacity not to repair faulty
> machines.

MachineHealthCheck doesn't really help because it just deletes the Machine that will never finish creating (after a large delay), to be replaced immediately with another that has the same problem.

IIUC the MachineAutoscaler just scales the MachineSet, so that doesn't help either.

> If this does not play well enough in fairly generic baremetal scenarios as
> for us to be concern about our users experience we should work towards
> mitigating the problem.
> Particularly introducing a deletion strategy that honours e.g node
> healthiness or machine phases should be fairly trivial to implement. It'd be
> rather a case of needing strong background and fleshed out use cases as to
> justify the feature and ensure we are providing value.

I believe that taking into account the Machine phases should be sufficient here (it shouldn't need to look at the Node or take into account the kinds of things that MachineHealthCheck can usually be relied on to handle). I'd suggest the priorities for deleting should be:

1. Machines in the Deleting or Failed phase
2. Machines with the cluster.k8s.io/delete-machine annotation
3. Oldest or Newest if one of these policies is specified
4. Machines in the Provisioning phase
5. Random if this policy is specified

Note You need to log in before you can comment on or make changes to this bug.