Description of problem: When scaling up a cluster using IPI, we create a bmh resource, introspection starts and completes successfully. Afterwards, we increase the --replicas to start the deployment. If for some reason the deployment fails, we delete the bmh resource, then we decrease the --replicas back to the previous number, a working node is deprovisioned. Version-Release number of selected component (if applicable): 4.3.5 How reproducible: Always Steps to Reproduce: 1. Create bmh resource and wait until it's Ready 2. Scale up machineset to start deployment 3. Delete bmh 4. Scale down replicas to match the number of existing worker nodes Actual results: One working node is deprovisioned Expected results: No working node should be deprovisioned as scaling down is matching the actual number of existing workers. Additional info:
The first thing to mention is that deleting the BareMetalHost is probably not what you want. You should only do that when you don't want that Host to be part of the inventory any more (i.e. you will never provision it again). The baremetal Machine actuator will delete the Machine that has provisioned a Host if the Host is deleted, *but* only once the Host reaches the 'Deleting' state. That means the Host has to be completely deprovisioned first. This is probably something that we could improve. I'm not clear on how the MachineSet decides which Machine to remove when scaling down. But I suspect that in this scenario if you scale down before the Host has reached the Deleting state, there is no way for it to know that one Machine is not in good shape.
PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/57
So this issue not only happens with deleting bmh's. If for any reason you end up with a replica higher than your actual nodes and you scale it down to match the existing amount of deployed nodes, it still deletes a working provisioned node.
That sounds like a problem with MachineSet then.
I'm not 100% sure, but I believe there was a special "machineset.clusters.k8s.io/delete-me=yes" annotation that can be added to the node so when the machineset is scaled down, it deletes first the nodes with that annotation (https://github.com/metal3-io/cluster-api-provider-baremetal/issues/10). It is from a while ago so maybe it is now different but just in case.
The problem is if I have a machineset with 20 replicas, but I only have 10 bmh objects. If I change the replicas to the same number of bmh, it shouldn't delete any node.
The MachineSet controller does not consider whether a Machine corresponds to a healthy Node when it looks for one to delete during a scale-down operation. That would be a reasonable feature to add, and it would benefit all IPI platforms. See the first step of the process as documented upstream: https://github.com/metal3-io/metal3-docs/blob/master/design/remove-host.md "Find the Machine that corresponds to the BareMetalHost that you want to remove. Add the annotation cluster.k8s.io/delete-machine with any value that is not an empty string. This ensures that when you later scale down the MachineSet, this Machine is the one that will be removed." For openshift, the annotation key should be "machine.openshift.io/cluster-api-delete-machine". Not sure if that's documented, by I created this issue some time ago so we don't forget: https://github.com/openshift/cluster-api-provider-baremetal/issues/46
My understanding is that the PR is not going to fix the problem, assuming the problem is: "When I scale down a MachineSet, I expect to see Machines with no Node prioritized for deletion, but I see that whether a Machine corresponds to a healthy Node has no influence over whether it's selected for deletion." The fix to the problem as reported in this issue is to enhance the machineset controller to more intelligently choose a Machine to delete when scaling down. The workaround is to use the annotation as described above to manually designate which Machine to delete.
The default behavior when scaling down a machineset is to delete a random machine. If you want a specific one to get deleted, you must annotate it first. Here is the annotation name from the machine-api-operator code: › // DeleteNodeAnnotation marks nodes that will be given priority for deletion › // when a machineset scales down. This annotation is given top priority on all delete policies. › DeleteNodeAnnotation = "machine.openshift.io/cluster-api-delete-machine" The value of this annotation can be anything. The code just checks for it to be non-empty, and then will choose that Machine as the highest priority to get deleted during the scale down. Zane's PR will also help, which will speed up deleting a Machine if the underlying BareMetalHost gets manually deleted. With this PR or not, please try using this annotation to ensure the correct Machine is deleted when scaling down the machineset.
But how do we fix in cases where the MachineSet is way higher than the amount of available nodes? How do we make the MachineSet to match the actual amount of nodes? If we scale down from 20 replicas to 10 replicas to match the 10 available bmh, it will still try to delete nodes, instead of simply not doing anything.
(In reply to Michael Zamot from comment #12) > But how do we fix in cases where the MachineSet is way higher than the > amount of available nodes? How do we make the MachineSet to match the actual > amount of nodes? If we scale down from 20 replicas to 10 replicas to match > the 10 available bmh, it will still try to delete nodes, instead of simply > not doing anything. If you have a MachineSet with 20 replicas, there will be 20 Machines. When you scale down from 20 to 10, you want the 10 Machines not yet associated with any BareMetalHost to be the ones deleted. That should be the case if you annotate those 10.
*** Bug 1810430 has been marked as a duplicate of this bug. ***
*** Bug 1791057 has been marked as a duplicate of this bug. ***
Generally speaking scaling operations are cattle. If you see yourself needing to pet a scaling operation is usually a signal that you shouldn't have got there in the first place and the angle should be reconsidered. Machines are the fundamental unit for the machine API ecosystem. If you want to get rid of a machine, delete the machine resource and your provider must ensure a graceful termination. Please educate users for this and try to put mechanisms to prevent them from manipulating your lower level objects which might result in unpredictable scenarios. As a user you can choose a delete policy for your machineSets, "Random, "Newest", "Oldest". https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machineset_types.go#L66-L69 Any machine annotated with "machine.openshift.io/cluster-api-delete-machine" has the highest priority to be deleted regardless of your deletion policy during a scale down operation. Please See https://docs.openshift.com/container-platform/4.4/machine_management/manually-scaling-machineset.html All the above describes the existing design. If you have identified strong use cases and background as to propose any change that would result in a better user experience for our users on baremetal please make sure to create a RFE or a PR against https://github.com/openshift/enhancements/tree/master/enhancements/machine-api. As this is coming from the baremetal team itself I'm assuming this has been evaluated and that's not the case, and that the mechanisms described above are enough so I'm closing this now. Can we please make sure that all comments are public so anyone coming here have as much context as possible?
(In reply to Alberto from comment #19) > Generally speaking scaling operations are cattle. > If you see yourself needing to pet a scaling operation is usually a signal > that you shouldn't have got there in the first place and the angle should be > reconsidered. Unfortunately this is wildly untrue of baremetal, where each Machine corresponds to one of a finite number of physical objects located in meatspace. > Machines are the fundamental unit for the machine API ecosystem. > If you want to get rid of a machine, delete the machine resource and your > provider must ensure a graceful termination. The provider does handle this gracefully, but this doesn't prevent a new Machine being created in its place. The only way to get rid of those new Machines, which will not ever finish creating because they are trying to pull from a finite pool of available hardware, is to scale down the MachineSet. But doing so usually deletes healthy Machines and leaves the zombie ones behind, unless the user jumps through hoops with a hacky manual annotation. > Please educate users for this and try to put mechanisms to prevent them from > manipulating your lower level objects which might result in unpredictable > scenarios. +1 > As a user you can choose a delete policy for your machineSets, "Random, > "Newest", "Oldest". > https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/ > machine/v1beta1/machineset_types.go#L66-L69 > Any machine annotated with "machine.openshift.io/cluster-api-delete-machine" > has the highest priority to be deleted regardless of your deletion policy > during a scale down operation. > Please See > https://docs.openshift.com/container-platform/4.4/machine_management/ > manually-scaling-machineset.html IMHO if you have e.g. a Machine that never finished creating correctly, it never makes sense to choose a random/newest/oldest Machine to kill on scale down instead of that one. And this appears to be equally true regardless of whether the platform is baremetal or an actual cloud.
>IMHO if you have e.g. a Machine that never finished creating correctly, it never makes sense to choose a random/newest/oldest Machine to kill on scale down instead of that one. And this appears to be equally true regardless of whether the platform is baremetal or an actual cloud. In cloud I see it from a different angle, if you have an unhealthy machine you just kill it and let the upper level controller (i.e machineSet) to give you a fresh healthy one. They are cattle. You have MachineHealthCheck and MachineAutoscaler resources to automate this auto repairing process described above. You scale out/in to increase/decrease compute capacity not to repair faulty machines. If this does not play well enough in fairly generic baremetal scenarios as for us to be concern about our users experience we should work towards mitigating the problem. Particularly introducing a deletion strategy that honours e.g node healthiness or machine phases should be fairly trivial to implement. It'd be rather a case of needing strong background and fleshed out use cases as to justify the feature and ensure we are providing value.
(In reply to Alberto from comment #21) > >IMHO if you have e.g. a Machine that never finished creating correctly, it never makes sense to choose a random/newest/oldest Machine to kill on scale down instead of that one. And this appears to be equally true regardless of whether the platform is baremetal or an actual cloud. > > In cloud I see it from a different angle, if you have an unhealthy machine > you just kill it and let the upper level controller (i.e machineSet) to give > you a fresh healthy one. They are cattle. This is a fair point. But there are edge cases: e.g. you accidentally scale up from 10 to 100 Machines, then immediately notice your mistake and fix it. Should the MachineSet (a) delete 90 Machines at random (including *all* of the Machines available to run workloads, roughly a third of the time); or (b) not? These edge cases are comparatively rare in the cloud (the above is admittedly a fairly contrived example) but become more common on baremetal because you have a fixed pool of resources that are typically always fully-utilised, so any extra Machines will *never* become healthy. (Note that clouds behave similarly when you are operating at the limit of your quota.) What in the cloud would have just been a temporary situation (new Machine gets created within a couple of minutes, existing one gets deleted, there's a temporary drop in capacity but it all sorts itself out) becomes on baremetal a situation that requires manual intervention (annotate the right Machines to delete, lest you suffer from ~30 minutes of downtime). > You have MachineHealthCheck and MachineAutoscaler resources to automate this > auto repairing process described above. > You scale out/in to increase/decrease compute capacity not to repair faulty > machines. MachineHealthCheck doesn't really help because it just deletes the Machine that will never finish creating (after a large delay), to be replaced immediately with another that has the same problem. IIUC the MachineAutoscaler just scales the MachineSet, so that doesn't help either. > If this does not play well enough in fairly generic baremetal scenarios as > for us to be concern about our users experience we should work towards > mitigating the problem. > Particularly introducing a deletion strategy that honours e.g node > healthiness or machine phases should be fairly trivial to implement. It'd be > rather a case of needing strong background and fleshed out use cases as to > justify the feature and ensure we are providing value. I believe that taking into account the Machine phases should be sufficient here (it shouldn't need to look at the Node or take into account the kinds of things that MachineHealthCheck can usually be relied on to handle). I'd suggest the priorities for deleting should be: 1. Machines in the Deleting or Failed phase 2. Machines with the cluster.k8s.io/delete-machine annotation 3. Oldest or Newest if one of these policies is specified 4. Machines in the Provisioning phase 5. Random if this policy is specified