Description of problem:Machineset creating infinite(large number) of machines , when an machineset with default values is used that has labels and selectors not related to cluster Version-Release number of selected component (if applicable):Cluster version is 4.6.0-0.nightly-2020-07-15-031221 How reproducible: Always Steps: 1.create a machineset with yaml as below : http://pastebin.test.redhat.com/884420 2.oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE miyadav-j892-69gxs-worker-us-east-2a 1 1 1 1 36m miyadav-j892-69gxs-worker-us-east-2b 1 1 1 1 36m miyadav-j892-69gxs-worker-us-east-2c 1 1 1 1 36m pmali1307-bls8p-worker-us-east-2a-new 1 1 5s 3.oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-j892-69gxs-master-0 Running m5.xlarge us-east-2 us-east-2a 36m miyadav-j892-69gxs-master-1 Running m5.xlarge us-east-2 us-east-2b 36m miyadav-j892-69gxs-master-2 Running m5.xlarge us-east-2 us-east-2c 36m miyadav-j892-69gxs-worker-us-east-2a-5nxnp Running m5.large us-east-2 us-east-2a 24m miyadav-j892-69gxs-worker-us-east-2b-wqtfh Running m5.large us-east-2 us-east-2b 24m miyadav-j892-69gxs-worker-us-east-2c-rrbh4 Running m5.large us-east-2 us-east-2c 24m pmali1307-bls8p-worker-us-east-2a-new-5xbjz 1s pmali1307-bls8p-worker-us-east-2a-new-6ql7p Provisioning m4.large us-east-2 us-east-2a 7s pmali1307-bls8p-worker-us-east-2a-new-hvbfh Provisioning m4.large us-east-2 us-east-2a 5s pmali1307-bls8p-worker-us-east-2a-new-jpk9p Provisioning 3s pmali1307-bls8p-worker-us-east-2a-new-lpgvx Provisioning m4.large us-east-2 us-east-2a 9s pmali1307-bls8p-worker-us-east-2a-new-p8tgc Provisioning m4.large us-east-2 us-east-2a 11s pmali1307-bls8p-worker-us-east-2a-new-ztjxc Provisioning m4.large us-east-2 us-east-2a 11s [miyadav@miyadav 880]$ Expected : No machines should be created as the yaml , is not valid w.r.t cluster Actual : Large number of machines were created Additional info: oc get machineset <machine-setname> -o yaml http://pastebin.test.redhat.com/884441 Machine controller logs- . . E0715 09:45:52.745034 1 actuator.go:66] pmali1307-bls8p-worker-us-east-2a-new-9zz5w error: pmali1307-bls8p-worker-us-east-2a-new-9zz5w: reconciler failed to Update machine: requeue in: 20s E0715 09:45:52.745084 1 controller.go:287] pmali1307-bls8p-worker-us-east-2a-new-9zz5w: error updating machine: pmali1307-bls8p-worker-us-east-2a-new-9zz5w: reconciler failed to Update machine: requeue in: 20s I0715 09:45:52.745129 1 controller.go:172] pmali1307-bls8p-worker-us-east-2a-new-zg9mz: reconciling Machine I0715 09:45:52.745149 1 actuator.go:100] pmali1307-bls8p-worker-us-east-2a-new-zg9mz: actuator checking if machine exists I0715 09:45:52.745212 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="pmali1307-bls8p-worker-us-east-2a-new-9zz5w: reconciler failed to Update machine: requeue in: 20s" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"pmali1307-bls8p-worker-us-east-2a-new-9zz5w","uid":"6c9e3bb9-a039-409b-b5e5-5ba5e9e007d2","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"33099"} "reason"="FailedUpdate" Will try sending must gather logs in a while
>an machineset with default values is used that has labels and selectors not related to cluster could you be more specific on what this means?
>an machineset with default values is used that has labels and selectors not related to cluster I meant below data was not relevant to the existing cluster : selector: matchLabels: machine.openshift.io/cluster-api-cluster: pmali1307-bls8p machine.openshift.io/cluster-api-machineset: pmali1307-bls8p-worker-us-east-2a template: metadata: labels: machine.openshift.io/cluster-api-cluster: pmali1307-bls8p machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: pmali1307-bls8p-worker-us-east-2a the labels we usually use are inline with the ones that comes with installation .
Thanks Milind. I had a quick look I believe this is so with https://github.com/openshift/machine-api-operator/pull/608/files we introduced an unfortunate discrepancy with the labels used by the machineSet to decide ownership over the machines https://github.com/openshift/machine-api-operator/blob/5688547505e7963783f04ad0737740cfac4b6457/pkg/controller/machineset/controller.go#L377 for the scenario where a bad machine.openshift.io/cluster-api-cluster is set by the user. We might want to include the same logic in the machineSet controller and additionally may be get back enforce via webhooks as well https://github.com/openshift/machine-api-operator/pull/610/files
we need to revendor the changes in the actuator for this to pass. Moving back to assigned.
Validated for AWS on : NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-22-214212 True False 26m Cluster version is 4.6.0-0.nightly-2020-07-22-214212 Steps : 1. Create machineset using below yaml: http://pastebin.test.redhat.com/884420 2.oc create -f <filename>.yaml machineset created successfully 3.check machine, machineset and nodes [miyadav@miyadav ~]$ oc get machineset oc get machiNAME DESIRED CURRENT READY AVAILABLE AGE miyadav-2307-zpd9c-new 1 1 1 1 17m miyadav-awsb-dxjkx-worker-us-east-2a 1 1 1 1 59m miyadav-awsb-dxjkx-worker-us-east-2b 1 1 1 1 59m miyadav-awsb-dxjkx-worker-us-east-2c 1 1 1 1 59m [miyadav@miyadav ~]$ oc get machines -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-2307-zpd9c-new-dhzvv Running m4.large us-east-2 us-east-2a 17m ip-10-0-145-136.us-east-2.compute.internal aws:///us-east-2a/i-06bc31b3f0e0fd5a2 running miyadav-awsb-dxjkx-master-0 Running m5.xlarge us-east-2 us-east-2a 59m ip-10-0-133-14.us-east-2.compute.internal aws:///us-east-2a/i-063970939f1f7e42d running miyadav-awsb-dxjkx-master-1 Running m5.xlarge us-east-2 us-east-2b 59m ip-10-0-180-32.us-east-2.compute.internal aws:///us-east-2b/i-0d68493a62e2412e7 running miyadav-awsb-dxjkx-master-2 Running m5.xlarge us-east-2 us-east-2c 59m ip-10-0-200-43.us-east-2.compute.internal aws:///us-east-2c/i-0b02e6d92f49085d6 running miyadav-awsb-dxjkx-worker-us-east-2a-6m6bf Running m5.large us-east-2 us-east-2a 45m ip-10-0-138-50.us-east-2.compute.internal aws:///us-east-2a/i-0aaaf5c1335568126 running miyadav-awsb-dxjkx-worker-us-east-2b-f2mrv Running m5.large us-east-2 us-east-2b 45m ip-10-0-191-230.us-east-2.compute.internal aws:///us-east-2b/i-0fcd744e92eb26168 running miyadav-awsb-dxjkx-worker-us-east-2c-rxck7 Running m5.large us-east-2 us-east-2c 45m ip-10-0-202-75.us-east-2.compute.internal aws:///us-east-2c/i-0532a17ad48f52b33 running Expected and actual : Machineset yaml updated with correct values and honored the replica count . . . spec: replicas: 1 selector: matchLabels: machine.openshift.io/cluster-api-cluster: miyadav-awsb-dxjkx machine.openshift.io/cluster-api-machineset: miyadav-2307-zpd9c-new template: metadata: labels: machine.openshift.io/cluster-api-cluster: miyadav-awsb-dxjkx machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: miyadav-2307-zpd9c-new . . . Moving to VERIFIED Additional info: Will execute for GCP and Azure as well and update in case those fails.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196