Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1857175

Summary:	[AWS] Machineset creating infinite(large number) of machines , when an machineset with default values is used that has labels and selectors not related to cluster
Product:	OpenShift Container Platform	Reporter:	Milind Yadav <miyadav>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:14:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Milind Yadav 2020-07-15 10:41:04 UTC

Description of problem:Machineset creating infinite(large number) of machines , when an machineset with default values is used that has labels and selectors not related to cluster


Version-Release number of selected component (if applicable):Cluster version is 4.6.0-0.nightly-2020-07-15-031221



How reproducible:
Always

Steps:
1.create a machineset with yaml as below :
http://pastebin.test.redhat.com/884420


2.oc get machineset
NAME                                    DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-j892-69gxs-worker-us-east-2a    1         1         1       1           36m
miyadav-j892-69gxs-worker-us-east-2b    1         1         1       1           36m
miyadav-j892-69gxs-worker-us-east-2c    1         1         1       1           36m
pmali1307-bls8p-worker-us-east-2a-new   1         1                             5s

3.oc get machines
NAME                                          PHASE          TYPE        REGION      ZONE         AGE
miyadav-j892-69gxs-master-0                   Running        m5.xlarge   us-east-2   us-east-2a   36m
miyadav-j892-69gxs-master-1                   Running        m5.xlarge   us-east-2   us-east-2b   36m
miyadav-j892-69gxs-master-2                   Running        m5.xlarge   us-east-2   us-east-2c   36m
miyadav-j892-69gxs-worker-us-east-2a-5nxnp    Running        m5.large    us-east-2   us-east-2a   24m
miyadav-j892-69gxs-worker-us-east-2b-wqtfh    Running        m5.large    us-east-2   us-east-2b   24m
miyadav-j892-69gxs-worker-us-east-2c-rrbh4    Running        m5.large    us-east-2   us-east-2c   24m
pmali1307-bls8p-worker-us-east-2a-new-5xbjz                                                       1s
pmali1307-bls8p-worker-us-east-2a-new-6ql7p   Provisioning   m4.large    us-east-2   us-east-2a   7s
pmali1307-bls8p-worker-us-east-2a-new-hvbfh   Provisioning   m4.large    us-east-2   us-east-2a   5s
pmali1307-bls8p-worker-us-east-2a-new-jpk9p   Provisioning                                        3s
pmali1307-bls8p-worker-us-east-2a-new-lpgvx   Provisioning   m4.large    us-east-2   us-east-2a   9s
pmali1307-bls8p-worker-us-east-2a-new-p8tgc   Provisioning   m4.large    us-east-2   us-east-2a   11s
pmali1307-bls8p-worker-us-east-2a-new-ztjxc   Provisioning   m4.large    us-east-2   us-east-2a   11s
[miyadav@miyadav 880]$ 

Expected : No machines should be created  as the yaml , is not valid w.r.t cluster 
Actual : Large number of machines were created

Additional info:

oc get machineset <machine-setname> -o yaml
http://pastebin.test.redhat.com/884441
Machine controller logs-
.
.

E0715 09:45:52.745034       1 actuator.go:66] pmali1307-bls8p-worker-us-east-2a-new-9zz5w error: pmali1307-bls8p-worker-us-east-2a-new-9zz5w: reconciler failed to Update machine: requeue in: 20s
E0715 09:45:52.745084       1 controller.go:287] pmali1307-bls8p-worker-us-east-2a-new-9zz5w: error updating machine: pmali1307-bls8p-worker-us-east-2a-new-9zz5w: reconciler failed to Update machine: requeue in: 20s
I0715 09:45:52.745129       1 controller.go:172] pmali1307-bls8p-worker-us-east-2a-new-zg9mz: reconciling Machine
I0715 09:45:52.745149       1 actuator.go:100] pmali1307-bls8p-worker-us-east-2a-new-zg9mz: actuator checking if machine exists
I0715 09:45:52.745212       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="pmali1307-bls8p-worker-us-east-2a-new-9zz5w: reconciler failed to Update machine: requeue in: 20s" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"pmali1307-bls8p-worker-us-east-2a-new-9zz5w","uid":"6c9e3bb9-a039-409b-b5e5-5ba5e9e007d2","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"33099"} "reason"="FailedUpdate"

Will try sending must gather logs in a while

Comment 1 Alberto 2020-07-15 10:47:53 UTC

>an machineset with default values is used that has labels and selectors not related to cluster

could you be more specific on what this means?

Comment 2 Alberto 2020-07-15 11:03:34 UTC

>an machineset with default values is used that has labels and selectors not related to cluster

could you be more specific on what this means?

Comment 4 Milind Yadav 2020-07-15 13:48:56 UTC

>an machineset with default values is used that has labels and selectors not related to cluster

I meant below data was not relevant to the existing cluster :

 selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: pmali1307-bls8p
      machine.openshift.io/cluster-api-machineset: pmali1307-bls8p-worker-us-east-2a
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: pmali1307-bls8p
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: pmali1307-bls8p-worker-us-east-2a

the labels we usually use are inline with the ones that comes with installation .

Comment 5 Alberto 2020-07-15 14:19:43 UTC

Thanks Milind. I had a quick look I believe this is so with https://github.com/openshift/machine-api-operator/pull/608/files we introduced an unfortunate discrepancy with the labels used by the machineSet to decide ownership over the machines https://github.com/openshift/machine-api-operator/blob/5688547505e7963783f04ad0737740cfac4b6457/pkg/controller/machineset/controller.go#L377 for the scenario where a bad machine.openshift.io/cluster-api-cluster is set by the user.
We might want to include the same logic in the machineSet controller and additionally may be get back enforce via webhooks as well https://github.com/openshift/machine-api-operator/pull/610/files

Comment 8 Alberto 2020-07-21 15:20:08 UTC

we need to revendor the changes in the actuator for this to pass. Moving back to assigned.

Comment 10 Milind Yadav 2020-07-23 05:38:57 UTC

Validated for AWS on :

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-22-214212   True        False         26m     Cluster version is 4.6.0-0.nightly-2020-07-22-214212

Steps :
1. Create machineset using below yaml:
http://pastebin.test.redhat.com/884420

2.oc create -f <filename>.yaml
machineset created successfully

3.check machine, machineset and nodes

[miyadav@miyadav ~]$ oc get machineset
oc get machiNAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-2307-zpd9c-new                 1         1         1       1           17m
miyadav-awsb-dxjkx-worker-us-east-2a   1         1         1       1           59m
miyadav-awsb-dxjkx-worker-us-east-2b   1         1         1       1           59m
miyadav-awsb-dxjkx-worker-us-east-2c   1         1         1       1           59m
[miyadav@miyadav ~]$ oc get machines -o wide
NAME                                         PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
miyadav-2307-zpd9c-new-dhzvv                 Running   m4.large    us-east-2   us-east-2a   17m   ip-10-0-145-136.us-east-2.compute.internal   aws:///us-east-2a/i-06bc31b3f0e0fd5a2   running
miyadav-awsb-dxjkx-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   59m   ip-10-0-133-14.us-east-2.compute.internal    aws:///us-east-2a/i-063970939f1f7e42d   running
miyadav-awsb-dxjkx-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   59m   ip-10-0-180-32.us-east-2.compute.internal    aws:///us-east-2b/i-0d68493a62e2412e7   running
miyadav-awsb-dxjkx-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   59m   ip-10-0-200-43.us-east-2.compute.internal    aws:///us-east-2c/i-0b02e6d92f49085d6   running
miyadav-awsb-dxjkx-worker-us-east-2a-6m6bf   Running   m5.large    us-east-2   us-east-2a   45m   ip-10-0-138-50.us-east-2.compute.internal    aws:///us-east-2a/i-0aaaf5c1335568126   running
miyadav-awsb-dxjkx-worker-us-east-2b-f2mrv   Running   m5.large    us-east-2   us-east-2b   45m   ip-10-0-191-230.us-east-2.compute.internal   aws:///us-east-2b/i-0fcd744e92eb26168   running
miyadav-awsb-dxjkx-worker-us-east-2c-rxck7   Running   m5.large    us-east-2   us-east-2c   45m   ip-10-0-202-75.us-east-2.compute.internal    aws:///us-east-2c/i-0532a17ad48f52b33   running


Expected and actual :
Machineset yaml updated with correct values and honored the replica count .
.
.
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-awsb-dxjkx
      machine.openshift.io/cluster-api-machineset: miyadav-2307-zpd9c-new
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: miyadav-awsb-dxjkx
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: miyadav-2307-zpd9c-new
.
.
.
Moving to VERIFIED

Additional info:
Will execute for GCP and Azure as well and update in case those fails.

Comment 12 errata-xmlrpc 2020-10-27 16:14:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196