Bug 1702089

Summary: Bad MachineSet prevent all other MachineSets from scaling activities
Product: OpenShift Container Platform Reporter: nate stephany <nstephan>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: unspecified CC: agarcial, aos-bugs, eparis, gblomqui, jokerman, mmccomas, rbost
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: metadata field within Machine/MachineSet Spec is not validated on create/update Consequence: Invalid metadata causes unmarshalling errors within controllers leading to controllers not being able to process objects as expected Fix: Enable validation on metadata field within Machine/MachineSet Spec Result: Errors in the metadata field are now returned to the user at create/update
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:11:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
badmachineset.yaml
none
goodmachineset.yaml
none
logs from controller-manager container none

Description nate stephany 2019-04-22 23:04:39 UTC
Created attachment 1557283 [details]
badmachineset.yaml

Description of problem:
When creating a new MachineSet with a improperly formatted label for the new nodes, new Machines (for any MachineSet) are able to be created

Version-Release number of selected component (if applicable):
4.0.0-0.11

How reproducible:
100%

Steps to Reproduce:
1. Deploy a cluster and create a new MachineSet based on attached badmachineset.yaml
2. oc get machineset or oc get machines and see no new machines being provisioned
3. oc logs clusterapi-manager-controllers-XXXX -c controller-manager to see the errors clogging up the controller
4. Create a new machineset (based on goodmachineset.yaml) or scale an existing worker machineset and watch nothing happen.
5. Correct or delete the label in badmachineset and everything kicks into gear

Actual results:
No new machines are provisioned for _any_ machineset (bad or existing).

Controller-manager container in clusterapi-manager-controllers pod showing this error every second (see attached log):

I0422 22:26:20.379498       1 reflector.go:169] Listing and watching *v1beta1.MachineSet from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126
E0422 22:26:20.382097       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1beta1.MachineSet: v1beta1.MachineSetList.Items: []v1beta1.MachineSet: v1beta1.MachineSet.Spec: v1beta1.MachineSetSpec.Template: v1beta1.MachineTemplateSpec.Spec: v1beta1.MachineSpec.ObjectMeta: v1.ObjectMeta.Labels: ReadMapCB: expect { or n, but found ", error found in #10 byte of ...|"labels":"ssd:\"true|..., bigger context ...|":"badmachineset"}},"spec":{"metadata":{"labels":"ssd:\"true\""},"providerSpec":{"value":{"ami":{"id|...


Expected results:
The bad machineset would be ignored with no machines created and reasonable error logged. Good and existing machinesets would continue functioning normally. Better yet, OpenShift wouldn't even let you create the badmachineset to begin with.

Additional info:
badmachine.yaml:25 for the incorrect label
goodmachine.yaml:25-26 for the correct labels
controller-manager.log for full logs. Note at the end of this log is when I delete the badmachineset and everything starts working.

Comment 1 nate stephany 2019-04-22 23:05:13 UTC
Created attachment 1557284 [details]
goodmachineset.yaml

Comment 2 nate stephany 2019-04-22 23:05:43 UTC
Created attachment 1557285 [details]
logs from controller-manager container

Comment 3 Jan Chaloupka 2019-04-23 11:13:26 UTC
The problem exists on the level of machineset CRD definition [1]. Generate CRD defines metadata field to be of `object` type instead of providing full definition of what is allowed.

[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/config/crds/cluster_v1alpha1_machineset.yaml#L68-L70

Comment 4 Jan Chaloupka 2019-04-23 11:33:48 UTC
It's even hardcoded in the generator itself: https://github.com/kubernetes-sigs/controller-tools/blob/master/pkg/internal/codegen/parse/crd.go#L169

Comment 5 Jan Chaloupka 2019-04-23 11:38:39 UTC
In short, what this issue is about: cluster allows to create an invalid machineset (due to weak constraints for the metadata field in generated CRD). Once an invalid machineset exists, an informer is not able to list machineset objects since decoder is not able to de-serialize invalid machineset(s) into MachineSet object(s). Thus, making machineset controller inoperable.

Comment 6 Jan Chaloupka 2019-04-23 11:40:30 UTC
Related upstream issue: https://github.com/kubernetes-sigs/controller-tools/issues/167

Comment 7 Jan Chaloupka 2019-04-23 13:31:13 UTC
Upstream PR to extend the CRD generator with metadata validation: https://github.com/kubernetes-sigs/controller-tools/pull/195

Comment 8 Jan Chaloupka 2019-04-24 08:09:21 UTC
PR for machine-api-operator updating CRDs: https://github.com/openshift/machine-api-operator/pull/297

Comment 9 Joel Speed 2020-04-03 12:59:47 UTC
This is still an issue within our APIs. Upstream have a workaround for this issue by embedding a subset of the metadata object within their types (https://github.com/kubernetes-sigs/cluster-api/pull/1062). We could potentially do the same and then we would have proper validation for the metadata fields, preventing this scenario

Comment 12 Milind Yadav 2020-04-09 04:30:55 UTC
Validated on :
[miyadav@miyadav bug1702089]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-09-005126   True        False         56m     Cluster version is 4.5.0-0.nightly-2020-04-09-005126

 

[miyadav@miyadav bug1702089]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.miyadav-0904.qe.devcluster.openshift.com:6443".
[miyadav@miyadav bug1702089]$ oc get machineset
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-0904-ctx99-worker-us-east-2a   1         1         1       1           37m
miyadav-0904-ctx99-worker-us-east-2b   1         1         1       1           37m
miyadav-0904-ctx99-worker-us-east-2c   1         1         1       1           37m
[miyadav@miyadav bug1702089]$ oc get machineset  miyadav-0904-ctx99-worker-us-east-2c -o yaml > badmachineset.yml
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 

Edited machineset with multiple invalid values for metadata

[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "badmachineset" is invalid: spec.template.metadata.labels: Invalid value: "string": spec.template.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ oc get machineset  miyadav-0904-ctx99-worker-us-east-2c -o yaml > badmachineset.yml
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ vi badmachineset_2.yml
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels: Invalid value: "string": spec.template.spec.metadata.labels in body must be of type object: "string"
[miyadav@miyadav bug1702089]$ vi badmachineset.yml 
[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata.labels.ssd: Invalid value: "boolean": spec.template.spec.metadata.labels.ssd in body must be of type string: "boolean"

[miyadav@miyadav bug1702089]$ oc create -f badmachineset.yml 
The MachineSet "miyadav-0904-ctx99-badmachineset-us-east-2c" is invalid: spec.template.spec.metadata: Invalid value: "null": spec.template.spec.metadata in body must be of type object: "null"

LOGS:
Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c
W0409 04:06:01.666476       1 controller.go:311] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: failed to create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c
E0409 04:06:01.666653       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql"}
I0409 04:06:01.666702       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: 08cd9e1b-7277-43bc-9470-6bc34556315c" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql","uid":"acfa460c-c9f8-4db7-a222-ac9cb52c305b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"34103"} "reason"="FailedCreate"
I0409 04:06:02.666903       1 controller.go:165] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: reconciling Machine
I0409 04:06:02.666935       1 actuator.go:97] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: actuator checking if machine exists
I0409 04:06:02.709807       1 reconciler.go:211] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: Instance does not exist
I0409 04:06:02.709843       1 controller.go:309] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: reconciling machine triggers idempotent create
I0409 04:06:02.709854       1 actuator.go:74] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: actuator creating machine
I0409 04:06:02.711065       1 reconciler.go:38] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: creating machine
E0409 04:06:02.711093       1 reconciler.go:221] NodeRef not found in machine miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql
I0409 04:06:02.742938       1 instances.go:47] No stopped instances found for machine miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql
I0409 04:06:02.743044       1 instances.go:145] Using AMI ami-0e8fa6e37e7
I0409 04:06:02.743095       1 instances.go:77] Describing security groups based on filters
I0409 04:06:02.962568       1 instances.go:122] Describing subnets based on filters
E0409 04:06:03.139728       1 instances.go:195] Error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
E0409 04:06:03.139780       1 reconciler.go:69] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: error creating machine: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
I0409 04:06:03.139792       1 machine_scope.go:134] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: Updating status
I0409 04:06:03.139801       1 machine_scope.go:155] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: finished calculating AWS status
I0409 04:06:03.139828       1 machine_scope.go:80] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: patching machine
E0409 04:06:03.153935       1 actuator.go:65] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql error: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
W0409 04:06:03.154515       1 controller.go:311] miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql: failed to create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0e8fa6e37e7"
	status code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2
E0409 04:06:03.154572       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql"}
I0409 04:06:03.154727       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0e8fa6e37e7\"\n\tstatus code: 400, request id: fbd05464-1a73-4cc4-ac06-52257d430ed2" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-0904-ctx99-badmachineset-us-east-2c-rdfql","uid":"acfa460c-c9f8-4db7-a222-ac9cb52c305b","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"34112"} "reason"="FailedCreate"


Edited machineset (badmachinset yaml) with invalid ami-id

[miyadav@miyadav bug1702089]$ oc get machineset
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-0904-ctx99-badmachineset-us-east-2c   1         1                             4m56s
miyadav-0904-ctx99-worker-us-east-2a          1         1         1       1           59m
miyadav-0904-ctx99-worker-us-east-2b          1         1         1       1           59m
miyadav-0904-ctx99-worker-us-east-2c          1         1         1       1           59m
[miyadav@miyadav bug1702089]$ oc edit machineset miyadav-0904-ctx99-worker-us-east-2c 

Edited valid machineset by changing replicas to 2 
 
machineset.machine.openshift.io/miyadav-0904-ctx99-worker-us-east-2c edited
[miyadav@miyadav bug1702089]$ oc get machineset -w
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-0904-ctx99-badmachineset-us-east-2c   1         1                             5m35s
miyadav-0904-ctx99-worker-us-east-2a          1         1         1       1           60m
miyadav-0904-ctx99-worker-us-east-2b          1         1         1       1           60m
miyadav-0904-ctx99-worker-us-east-2c          2         2         1       1           60m
.
.
the new machine provisioned successfully 
miyadav-0904-ctx99-worker-us-east-2c-j7sc9   Running   m4.large    us-east-2   us-east-2c   19m   ip-10-0-175-225.us-east-2.compute.internal   aws:///us-east-2c/i-005e8bdbe0d58c8ff   running
[miyadav@miyadav bug1702089]$ 


Actual & Expected : the valid machinsets worked  as they should , even when invalid machineset is existing in the cluster , for many invalid values we are getting appropriate messages .

Waiting for Doc text Verification

Comment 13 Milind Yadav 2020-04-09 07:27:56 UTC
Moving to VERIFIED

Comment 15 errata-xmlrpc 2020-07-13 17:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409