Created attachment 1552066 [details]
Description of problem:
When the replicas of a machineset are increased, the machine-api controller goes to work scaling the cluster. However, if an issue is hit (e.g. AWS instance limit), the administrator does not receive this feedback in the resources they might expect:
For example, the issue is not reflected in the machineset status
1) It is not reflected in the machine api operator status
2) It is not reflected in messages or status of the machineset
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. oc edit machineset to have replicas > your ec2 instance limit in AWS
Observe the machineset resource. In this case, it will set indefinitely at less than the 'desired'.
NAME DESIRED CURRENT READY AVAILABLE AGE
int-1-hhxh8-worker-us-east-1a 20 20 18 18 4h15m
and no description of the gating issue is present in the machineset status.
Like a Deployment/DaemonSet/etc object, I would expect an overall status message to be fed back to the high level object.
There are ways to uncover the problem (events / messages in individual 'machine' objects), but a summary at the top level resource (and even at the operator level) would seem more consistent with Kube/OpenShift.
A high level API for assessing whether my configuration change is progressing or failing (and why) would be useful for administrators.
See attachments for full machineset yaml / describe.
The machine config operator component is NOT the right component for Machine API issues. Moving to Cloud Compute.
Once https://github.com/openshift/cluster-api/pull/23 is merged at cluster-api repo, will have to bump vendored cluster-api at aws-actuator repo to get this fix in actuator.
1. Scope of Machine Api Operator is only upto managing lifecycle of different controllers and cluster-api-provider(aws in this case). Reporting inner functionality status from the cluster-api stack components is not expected from MAO.
2. Above PR is going will enable reporting of any machineset<->machines reconciliation failures as events at machineset object. Also further details can be discovered by looking at `InstanceState` field, https://github.com/openshift/cluster-api-provider-aws/blob/4d953241bc7f62785e0ff9f759315f386e790ba2/pkg/apis/awsproviderconfig/v1beta1/awsmachineproviderconfig_types.go#L46, in a particular machine object.
Verified in 4.1.0-0.nightly-2019-04-23-223857
The InstanceState from machine's providerStatus shows the instance info the machine is associated with. The instance limit of my account is 700, I can not hit it at the moment. I'm able to set a wrong ami in the machineSet and let it scale. The error is properly reported.
error launching instance: error getting blockDeviceMappings: error
describing AMI: InvalidAMIID.Malformed: Invalid id:
The error is also recorded in the machine-controller log.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.