Bug 1702089 - Bad MachineSet prevent all other MachineSets from scaling activities
Summary: Bad MachineSet prevent all other MachineSets from scaling activities
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-22 23:04 UTC by nate stephany
Modified: 2020-07-23 11:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: metadata field within Machine/MachineSet Spec is not validated on create/update Consequence: Invalid metadata causes unmarshalling errors within controllers leading to controllers not being able to process objects as expected Fix: Enable validation on metadata field within Machine/MachineSet Spec Result: Errors in the metadata field are now returned to the user at create/update
Clone Of:
Environment:
Last Closed: 2020-07-13 17:11:03 UTC
Target Upstream Version:


Attachments (Terms of Use)
badmachineset.yaml (1.93 KB, text/plain)
2019-04-22 23:04 UTC, nate stephany
no flags Details
goodmachineset.yaml (1.95 KB, text/plain)
2019-04-22 23:05 UTC, nate stephany
no flags Details
logs from controller-manager container (1.33 MB, text/plain)
2019-04-22 23:05 UTC, nate stephany
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift machine-api-operator pull 550 None closed Bug 1702089: Move embedded ObjectMeta to machine api to provide open api schema 2020-09-10 14:45:22 UTC
Red Hat Knowledge Base (Solution) 5175041 None None None 2020-06-23 00:21:29 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:11:16 UTC

Description nate stephany 2019-04-22 23:04:39 UTC
Created attachment 1557283 [details]
badmachineset.yaml

Description of problem:
When creating a new MachineSet with a improperly formatted label for the new nodes, new Machines (for any MachineSet) are able to be created

Version-Release number of selected component (if applicable):
4.0.0-0.11

How reproducible:
100%

Steps to Reproduce:
1. Deploy a cluster and create a new MachineSet based on attached badmachineset.yaml
2. oc get machineset or oc get machines and see no new machines being provisioned
3. oc logs clusterapi-manager-controllers-XXXX -c controller-manager to see the errors clogging up the controller
4. Create a new machineset (based on goodmachineset.yaml) or scale an existing worker machineset and watch nothing happen.
5. Correct or delete the label in badmachineset and everything kicks into gear

Actual results:
No new machines are provisioned for _any_ machineset (bad or existing).

Controller-manager container in clusterapi-manager-controllers pod showing this error every second (see attached log):

I0422 22:26:20.379498       1 reflector.go:169] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126
E0422 22:26:20.382097       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1beta1.MachineSet: v1beta1.MachineSetList.Items: []v1beta1.MachineSet: v1beta1.MachineSet.Spec: v1beta1.MachineSetSpec.Template: v1beta1.MachineTemplateSpec.Spec: v1beta1.MachineSpec.ObjectMeta: v1.ObjectMeta.Labels: ReadMapCB: expect { or n, but found ", error found in #10 byte of ...|"labels":"ssd:\"true|..., bigger context ...|":"badmachineset"}},"spec":{"metadata":{"labels":"ssd:\"true\""},"providerSpec":{"value":{"ami":{"id|...


Expected results:
The bad machineset would be ignored with no machines created and reasonable error logged. Good and existing machinesets would continue functioning normally. Better yet, OpenShift wouldn't even let you create the badmachineset to begin with.

Additional info:
badmachine.yaml:25 for the incorrect label
goodmachine.yaml:25-26 for the correct labels
controller-manager.log for full logs. Note at the end of this log is when I delete the badmachineset and everything starts working.

Comment 1 nate stephany 2019-04-22 23:05:13 UTC
Created attachment 1557284 [details]
goodmachineset.yaml

Comment 2 nate stephany 2019-04-22 23:05:43 UTC
Created attachment 1557285 [details]
logs from controller-manager container

Comment 3 Jan Chaloupka 2019-04-23 11:13:26 UTC
The problem exists on the level of machineset CRD definition [1]. Generate CRD defines metadata field to be of `object` type instead of providing full definition of what is allowed.

[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/config/crds/cluster_v1alpha1_machineset.yaml#L68-L70

Comment 4 Jan Chaloupka 2019-04-23 11:33:48 UTC
It's even hardcoded in the generator itself: https://github.com/kubernetes-sigs/controller-tools/blob/master/pkg/internal/codegen/parse/crd.go#L169

Comment 5 Jan Chaloupka 2019-04-23 11:38:39 UTC
In short, what this issue is about: cluster allows to create an invalid machineset (due to weak constraints for the metadata field in generated CRD). Once an invalid machineset exists, an informer is not able to list machineset objects since decoder is not able to de-serialize invalid machineset(s) into MachineSet object(s). Thus, making machineset controller inoperable.

Comment 6 Jan Chaloupka 2019-04-23 11:40:30 UTC
Related upstream issue: https://github.com/kubernetes-sigs/controller-tools/issues/167

Comment 7 Jan Chaloupka 2019-04-23 13:31:13 UTC
Upstream PR to extend the CRD generator with metadata validation: https://github.com/kubernetes-sigs/controller-tools/pull/195

Comment 8 Jan Chaloupka 2019-04-24 08:09:21 UTC
PR for machine-api-operator updating CRDs: https://github.com/openshift/machine-api-operator/pull/297

Comment 9 Joel Speed 2020-04-03 12:59:47 UTC
This is still an issue within our APIs. Upstream have a workaround for this issue by embedding a subset of the metadata object within their types (https://github.com/kubernetes-sigs/cluster-api/pull/1062). We could potentially do the same and then we would have proper validation for the metadata fields, preventing this scenario

Comment 12 Milind Yadav 2020-04-09 04:30:55 UTC
Validated on :
[miyadav@miyadav bug1702089]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-09-005126   True        False         56m     Cluster version is 4.5.0-0.nightly-2020-04-09-005126

 

[miyadav@miyadav bug1702089]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.miyadav-0904.qe.devcluster.openshift.com:6443".
[miyadav@miyadav bug1702089]$ oc get machineset
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-0904-ctx99-worker-us-east-2a   1         1         1       1           37m
miyadav-0904-ctx99-worker-us-east-2b   1         1         1       1           37m
miyadav-0904-ctx99-worker-us-east-2c   1         1         1       1           37m
[miyadav@miyadav bug1702089]$ oc get machineset  miyadav-0904-ctx99-worker-us-east-2c -o yaml > badmachineset.yml
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 

Edited machineset with multiple invalid values for metadata

[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "badmachineset" is invalid: spec.template.metadata.labels: Invalid value: "string": spec.template.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ oc get machineset  miyadav-0904-ctx99-worker-us-east-2c -o yaml > badmachineset.yml
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ vi badmachineset_2.yml
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels.ssd: Invalid value: "boolean": spec.template.spec.metadata.labels.ssd in body must be of type string: "boolean"

[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata: Invalid value: "null": spec.template.spec.metadata in body must be of type object: "null"

LOGS:
Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c
W0409 04:06:01.666476       1 controller.go:311] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: failed to create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c
E0409 04:06:01.666653       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql"}
I0409 04:06:01.666702       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql","uid":"acfa460c-c9f8-4db7-a222-ac9cb52c305b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"34103"} "reason"="FailedCreate"
I0409 04:06:02.666903       1 controller.go:165] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: reconciling Machine
I0409 04:06:02.666935       1 actuator.go:97] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: actuator checking if machine exists
I0409 04:06:02.709807       1 reconciler.go:211] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: Instance does not exist
I0409 04:06:02.709843       1 controller.go:309] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: reconciling machine triggers idempotent create
I0409 04:06:02.709854       1 actuator.go:74] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: actuator creating machine
I0409 04:06:02.711065       1 reconciler.go:38] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: creating machine
E0409 04:06:02.711093       1 reconciler.go:221] NodeRef not found in machine miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql
I0409 04:06:02.742938       1 instances.go:47] No stopped instances found for machine miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql
I0409 04:06:02.743044       1 instances.go:145] Using AMI ami-0e8fa6e37e7
I0409 04:06:02.743095       1 instances.go:77] Describing security groups based on filters
I0409 04:06:02.962568       1 instances.go:122] Describing subnets based on filters
E0409 04:06:03.139728       1 instances.go:195] Error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
E0409 04:06:03.139780       1 reconciler.go:69] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: error creating machine: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
I0409 04:06:03.139792       1 machine_scope.go:134] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: Updating status
I0409 04:06:03.139801       1 machine_scope.go:155] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: finished calculating AWS status
I0409 04:06:03.139828       1 machine_scope.go:80] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: patching machine
E0409 04:06:03.153935       1 actuator.go:65] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql error: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
W0409 04:06:03.154515       1 controller.go:311] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: failed to create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
E0409 04:06:03.154572       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql"}
I0409 04:06:03.154727       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql","uid":"acfa460c-c9f8-4db7-a222-ac9cb52c305b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"34112"} "reason"="FailedCreate"


Edited machineset (badmachinset yaml) with invalid ami-id

[miyadav@miyadav bug1702089]$ oc get machineset
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-0904-ctx99-badmachineset-us-east-2c   1         1                             4m56s
miyadav-0904-ctx99-worker-us-east-2a          1         1         1       1           59m
miyadav-0904-ctx99-worker-us-east-2b          1         1         1       1           59m
miyadav-0904-ctx99-worker-us-east-2c          1         1         1       1           59m
[miyadav@miyadav bug1702089]$ oc edit machineset miyadav-0904-ctx99-worker-us-east-2c 

Edited valid machineset by changing replicas to 2 
 
machineset.machine.openshift.io/miyadav-0904-ctx99-worker-us-east-2c edited
[miyadav@miyadav bug1702089]$ oc get machineset -w
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-0904-ctx99-badmachineset-us-east-2c   1         1                             5m35s
miyadav-0904-ctx99-worker-us-east-2a          1         1         1       1           60m
miyadav-0904-ctx99-worker-us-east-2b          1         1         1       1           60m
miyadav-0904-ctx99-worker-us-east-2c          2         2         1       1           60m
.
.
the new machine provisioned successfully 
miyadav-0904-ctx99-worker-us-east-2c-j7sc9   Running   m4.large    us-east-2   us-east-2c   19m   ip-10-0-175-225.us-east-2.compute.internal   aws:///us-east-2c/i-005e8bdbe0d58c8ff   running
[miyadav@miyadav bug1702089]$ 


Actual & Expected : the valid machinsets worked  as they should , even when invalid machineset is existing in the cluster , for many invalid values we are getting appropriate messages .

Waiting for Doc text Verification

Comment 13 Milind Yadav 2020-04-09 07:27:56 UTC
Moving to VERIFIED

Comment 15 errata-xmlrpc 2020-07-13 17:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.