Created attachment 1557283 [details] badmachineset.yaml Description of problem: When creating a new MachineSet with a improperly formatted label for the new nodes, new Machines (for any MachineSet) are able to be created Version-Release number of selected component (if applicable): 4.0.0-0.11 How reproducible: 100% Steps to Reproduce: 1. Deploy a cluster and create a new MachineSet based on attached badmachineset.yaml 2. oc get machineset or oc get machines and see no new machines being provisioned 3. oc logs clusterapi-manager-controllers-XXXX -c controller-manager to see the errors clogging up the controller 4. Create a new machineset (based on goodmachineset.yaml) or scale an existing worker machineset and watch nothing happen. 5. Correct or delete the label in badmachineset and everything kicks into gear Actual results: No new machines are provisioned for _any_ machineset (bad or existing). Controller-manager container in clusterapi-manager-controllers pod showing this error every second (see attached log): I0422 22:26:20.379498 1 reflector.go:169] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126 E0422 22:26:20.382097 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1beta1.MachineSet: v1beta1.MachineSetList.Items: []v1beta1.MachineSet: v1beta1.MachineSet.Spec: v1beta1.MachineSetSpec.Template: v1beta1.MachineTemplateSpec.Spec: v1beta1.MachineSpec.ObjectMeta: v1.ObjectMeta.Labels: ReadMapCB: expect { or n, but found ", error found in #10 byte of ...|"labels":"ssd:\"true|..., bigger context ...|":"badmachineset"}},"spec":{"metadata":{"labels":"ssd:\"true\""},"providerSpec":{"value":{"ami":{"id|... Expected results: The bad machineset would be ignored with no machines created and reasonable error logged. Good and existing machinesets would continue functioning normally. Better yet, OpenShift wouldn't even let you create the badmachineset to begin with. Additional info: badmachine.yaml:25 for the incorrect label goodmachine.yaml:25-26 for the correct labels controller-manager.log for full logs. Note at the end of this log is when I delete the badmachineset and everything starts working.
Created attachment 1557284 [details] goodmachineset.yaml
Created attachment 1557285 [details] logs from controller-manager container
The problem exists on the level of machineset CRD definition [1]. Generate CRD defines metadata field to be of `object` type instead of providing full definition of what is allowed. [1] https://github.com/kubernetes-sigs/cluster-api/blob/master/config/crds/cluster_v1alpha1_machineset.yaml#L68-L70
It's even hardcoded in the generator itself: https://github.com/kubernetes-sigs/controller-tools/blob/master/pkg/internal/codegen/parse/crd.go#L169
In short, what this issue is about: cluster allows to create an invalid machineset (due to weak constraints for the metadata field in generated CRD). Once an invalid machineset exists, an informer is not able to list machineset objects since decoder is not able to de-serialize invalid machineset(s) into MachineSet object(s). Thus, making machineset controller inoperable.
Related upstream issue: https://github.com/kubernetes-sigs/controller-tools/issues/167
Upstream PR to extend the CRD generator with metadata validation: https://github.com/kubernetes-sigs/controller-tools/pull/195
PR for machine-api-operator updating CRDs: https://github.com/openshift/machine-api-operator/pull/297
This is still an issue within our APIs. Upstream have a workaround for this issue by embedding a subset of the metadata object within their types (https://github.com/kubernetes-sigs/cluster-api/pull/1062). We could potentially do the same and then we would have proper validation for the metadata fields, preventing this scenario
Validated on : [miyadav@miyadav bug1702089]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-04-09-005126 True False 56m Cluster version is 4.5.0-0.nightly-2020-04-09-005126 [miyadav@miyadav bug1702089]$ oc project openshift-machine-api Now using project "openshift-machine-api" on server "https://api.miyadav-0904.qe.devcluster.openshift.com:6443". [miyadav@miyadav bug1702089]$ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE miyadav-0904-ctx99-worker-us-east-2a 1 1 1 1 37m miyadav-0904-ctx99-worker-us-east-2b 1 1 1 1 37m miyadav-0904-ctx99-worker-us-east-2c 1 1 1 1 37m [miyadav@miyadav bug1702089]$ oc get machineset miyadav-0904-ctx99-worker-us-east-2c -o yaml > badmachineset.yml [miyadav@miyadav bug1702089]$ vi badmachineset.yml Edited machineset with multiple invalid values for metadata [miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml The MachineSet "badmachineset" is invalid: spec.template.metadata.labels: Invalid value: "string": spec.template.metadata.labels in body must be of type object: "string" [miyadav@miyadav bug1702089]$ oc get machineset miyadav-0904-ctx99-worker-us-east-2c -o yaml > badmachineset.yml [miyadav@miyadav bug1702089]$ vi badmachineset.yml [miyadav@miyadav bug1702089]$ vi badmachineset.yml [miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string" [miyadav@miyadav bug1702089]$ vi badmachineset.yml [miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string" [miyadav@miyadav bug1702089]$ vi badmachineset_2.yml [miyadav@miyadav bug1702089]$ vi badmachineset.yml [miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string" [miyadav@miyadav bug1702089]$ vi badmachineset.yml [miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels.ssd: Invalid value: "boolean": spec.template.spec.metadata.labels.ssd in body must be of type string: "boolean" [miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata: Invalid value: "null": spec.template.spec.metadata in body must be of type object: "null" LOGS: Invalid id: "ami-0e8fa6e37e7" status code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c W0409 04:06:01.666476 1 controller.go:311] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: failed to create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7" status code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c E0409 04:06:01.666653 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql"} I0409 04:06:01.666702 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql","uid":"acfa460c-c9f8-4db7-a222-ac9cb52c305b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"34103"} "reason"="FailedCreate" I0409 04:06:02.666903 1 controller.go:165] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: reconciling Machine I0409 04:06:02.666935 1 actuator.go:97] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: actuator checking if machine exists I0409 04:06:02.709807 1 reconciler.go:211] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: Instance does not exist I0409 04:06:02.709843 1 controller.go:309] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: reconciling machine triggers idempotent create I0409 04:06:02.709854 1 actuator.go:74] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: actuator creating machine I0409 04:06:02.711065 1 reconciler.go:38] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: creating machine E0409 04:06:02.711093 1 reconciler.go:221] NodeRef not found in machine miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql I0409 04:06:02.742938 1 instances.go:47] No stopped instances found for machine miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql I0409 04:06:02.743044 1 instances.go:145] Using AMI ami-0e8fa6e37e7 I0409 04:06:02.743095 1 instances.go:77] Describing security groups based on filters I0409 04:06:02.962568 1 instances.go:122] Describing subnets based on filters E0409 04:06:03.139728 1 instances.go:195] Error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7" status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2 E0409 04:06:03.139780 1 reconciler.go:69] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: error creating machine: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7" status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2 I0409 04:06:03.139792 1 machine_scope.go:134] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: Updating status I0409 04:06:03.139801 1 machine_scope.go:155] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: finished calculating AWS status I0409 04:06:03.139828 1 machine_scope.go:80] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: patching machine E0409 04:06:03.153935 1 actuator.go:65] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql error: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7" status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2 W0409 04:06:03.154515 1 controller.go:311] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: failed to create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7" status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2 E0409 04:06:03.154572 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql"} I0409 04:06:03.154727 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql","uid":"acfa460c-c9f8-4db7-a222-ac9cb52c305b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"34112"} "reason"="FailedCreate" Edited machineset (badmachinset yaml) with invalid ami-id [miyadav@miyadav bug1702089]$ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE miyadav-0904-ctx99-badmachineset-us-east-2c 1 1 4m56s miyadav-0904-ctx99-worker-us-east-2a 1 1 1 1 59m miyadav-0904-ctx99-worker-us-east-2b 1 1 1 1 59m miyadav-0904-ctx99-worker-us-east-2c 1 1 1 1 59m [miyadav@miyadav bug1702089]$ oc edit machineset miyadav-0904-ctx99-worker-us-east-2c Edited valid machineset by changing replicas to 2 machineset.machine.openshift.io/miyadav-0904-ctx99-worker-us-east-2c edited [miyadav@miyadav bug1702089]$ oc get machineset -w NAME DESIRED CURRENT READY AVAILABLE AGE miyadav-0904-ctx99-badmachineset-us-east-2c 1 1 5m35s miyadav-0904-ctx99-worker-us-east-2a 1 1 1 1 60m miyadav-0904-ctx99-worker-us-east-2b 1 1 1 1 60m miyadav-0904-ctx99-worker-us-east-2c 2 2 1 1 60m . . the new machine provisioned successfully miyadav-0904-ctx99-worker-us-east-2c-j7sc9 Running m4.large us-east-2 us-east-2c 19m ip-10-0-175-225.us-east-2.compute.internal aws:///us-east-2c/i-005e8bdbe0d58c8ff running [miyadav@miyadav bug1702089]$ Actual & Expected : the valid machinsets worked as they should , even when invalid machineset is existing in the cluster , for many invalid values we are getting appropriate messages . Waiting for Doc text Verification
Moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409