Bug 1696407

Summary: non-intuitive to detect a machineset scaling issue
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: Cloud ComputeAssignee: Vikas Choudhary <vichoudh>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: low Docs Contact:
Priority: low    
Version: 4.1.0CC: agarcial, vichoudh
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
listings none

Description Justin Pierce 2019-04-04 19:18:39 UTC
Created attachment 1552066 [details]
listings

Description of problem:
When the replicas of a machineset are increased, the machine-api controller goes to work scaling the cluster. However, if an issue is hit (e.g. AWS instance limit), the administrator does not receive this feedback in the resources they might expect:

For example, the issue is not reflected in the machineset status
1) It is not reflected in the machine api operator status
2) It is not reflected in messages or status of the machineset

Version-Release number of selected component (if applicable):
4.0.0-0.9

How reproducible:
100%

Steps to Reproduce:
1. oc edit machineset to have replicas > your ec2 instance limit in AWS
2.
3.

Actual results:
Observe the machineset resource. In this case, it will set indefinitely at less than the 'desired'.
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
int-1-hhxh8-worker-us-east-1a   20        20        18      18          4h15m

and no description of the gating issue is present in the machineset status.

status:
  availableReplicas: 18
  fullyLabeledReplicas: 20
  observedGeneration: 2
  readyReplicas: 18
  replicas: 20


Expected results:
Like a Deployment/DaemonSet/etc object, I would expect an overall status message to be fed back to the high level object. 
There are ways to uncover the problem (events / messages in individual 'machine' objects), but a summary at the top level resource (and even at the operator level) would seem more consistent with Kube/OpenShift.

A high level API for assessing whether my configuration change is progressing or failing (and why) would be useful for administrators. 

Additional info:
See attachments for full machineset yaml / describe.

Comment 1 Antonio Murdaca 2019-04-04 19:29:03 UTC
The machine config operator component is NOT the right component for Machine API issues. Moving to Cloud Compute.

Comment 2 Vikas Choudhary 2019-04-08 07:21:00 UTC
Once https://github.com/openshift/cluster-api/pull/23 is merged at cluster-api repo, will have to bump vendored cluster-api at aws-actuator repo to get this fix in actuator.

Comment 4 Vikas Choudhary 2019-04-08 10:38:21 UTC
1. Scope of Machine Api Operator is only upto managing lifecycle of different controllers and cluster-api-provider(aws in this case). Reporting inner functionality status from the cluster-api stack components is not expected from MAO.
2. Above PR is going will enable reporting of any machineset<->machines reconciliation failures as events at machineset object. Also further details can be discovered by looking at `InstanceState` field, https://github.com/openshift/cluster-api-provider-aws/blob/4d953241bc7f62785e0ff9f759315f386e790ba2/pkg/apis/awsproviderconfig/v1beta1/awsmachineproviderconfig_types.go#L46,  in a particular machine object.

Comment 6 Jianwei Hou 2019-04-24 09:13:49 UTC
Verified in 4.1.0-0.nightly-2019-04-23-223857

The InstanceState from machine's providerStatus shows the instance info the machine is associated with. The instance limit of my account is 700, I can not hit it at the moment. I'm able to set a wrong ami in the machineSet and let it scale. The error is properly reported.


```
error launching instance: error getting blockDeviceMappings: error
        describing AMI: InvalidAMIID.Malformed: Invalid id:
```

The error is also recorded in the machine-controller log.

Comment 8 errata-xmlrpc 2019-06-04 10:47:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758