Bug 2005052 - Adding a MachineSet selector matchLabel causes orphaned Machines
Summary: Adding a MachineSet selector matchLabel causes orphaned Machines
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-16 15:55 UTC by Michael McCune
Modified: 2022-03-10 16:11 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Modifying a selector changes the list of Machines that the MachineSet observes Consequence: This can cause leaks as the MachineSet loses track of Machines it has already created Fix: Ensure that the selector is immutable once created Result: The MachineSet will now always list the correct Machines
Clone Of:
Environment:
Last Closed: 2022-03-10 16:11:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 944 0 None open Bug 2005052: Deny selector updates via webhook to prevent leaked machines 2021-11-01 10:23:44 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:11:58 UTC

Description Michael McCune 2021-09-16 15:55:04 UTC
Description of problem: When adding new values to a MachineSet's `.spec.selector.matchLabels` new Machines are created for the MachineSet but the old Machines become orphaned from the MachineSet.


How reproducible: always


Steps to Reproduce:
1. Create a new cluster
2. Edit one of the worker MachineSets to add a new label in the `.spec.selector.matchLabels`, eg `test-label: ""`
3. Observe the Machines and MachineSet replica counts

Actual results: a new Machine is created which takes the new selector match label, and the old Machine becomes orphaned from the MachineSet. the old Machine still contains the annotations, labels, and ownerRef that make it appear as part of the MachineSet, but the MachineSet no longer lists it.


Expected results: I expected the MachineSet to accept the new selector without issue. If a new Machine should be created to fit the matchLabel, then the old Machine should be cleaned up.


Additional info: I'm not sure what the best solution to this issue should be, we have several options:
1. don't allow updating of selectors on MachineSets
2. warn the user about the update and orphan Machine(s)
3. automatically delete the orphaned Machine(s)
4. update the Machine(s) in the MachineSet to take the new matchLabel

There is another effect from this that could be deleterious for users, aside from the orphaned Machines, when the user deletes the MachineSet it will also delete the orphaned Machines. This also could lead to unintentional effects.

Comment 1 Joel Speed 2021-10-29 11:38:31 UTC
My vote for resolving this issue would be to work on adding immutability support for CRDs and then making the selector for a MachineSet immutable. The other options have technical challenges that make them too complicated to implement, and going this route matches that of deployments/daemonsets/statefulsets, so should be familiar to users.

In the mean time, we could use webhooks to prevent users from changing this, that would be quick and easy, but not an ideal solution as webhooks can in theory fail.

I'd be tempted to mirror this into Jira for a long term fix and make the webhook fix as the short term fix, and also document in the Machine docs that the selector is expected to be immutable.

Comment 2 Michael McCune 2021-10-29 17:45:59 UTC
(In reply to Joel Speed from comment #1)
> My vote for resolving this issue would be to work on adding immutability
> support for CRDs and then making the selector for a MachineSet immutable.
> The other options have technical challenges that make them too complicated
> to implement, and going this route matches that of
> deployments/daemonsets/statefulsets, so should be familiar to users.
> 
> In the mean time, we could use webhooks to prevent users from changing this,
> that would be quick and easy, but not an ideal solution as webhooks can in
> theory fail.
> 

+1, i think this is a reasonable short term fix

> I'd be tempted to mirror this into Jira for a long term fix and make the
> webhook fix as the short term fix, and also document in the Machine docs
> that the selector is expected to be immutable.

+1, that will help us scope it. thanks

Comment 4 Joel Speed 2021-11-02 08:45:06 UTC
I created https://issues.redhat.com/browse/OCPCLOUD-1345 so that we don't forget to remove this in the future once we have immutability support

Comment 7 sunzhaohua 2021-11-04 08:50:13 UTC
verified
clusterversion: 4.10.0-0.nightly-2021-11-03-181048

update the selector on machineset will be denied. No new machine will be created.
$ oc edit machineset zhsun114-7k4tl-worker-us-east-2c                            [16:47:41]
error: machinesets.machine.openshift.io "zhsun114-7k4tl-worker-us-east-2c" could not be patched: admission webhook "validation.machineset.machine.openshift.io" denied the request: [spec.selector: Forbidden: selector is immutable, spec.template.metadata.labels: Invalid value: map[string]string{"machine.openshift.io/cluster-api-cluster":"zhsun114-7k4tl", "machine.openshift.io/cluster-api-machine-role":"worker", "machine.openshift.io/cluster-api-machine-type":"worker", "machine.openshift.io/cluster-api-machineset":"zhsun114-7k4tl-worker-us-east-2c"}: `selector` does not match template `labels`]

Comment 10 errata-xmlrpc 2022-03-10 16:11:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.