Bug 1696407 - non-intuitive to detect a machineset scaling issue
Summary: non-intuitive to detect a machineset scaling issue
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.1.0
Assignee: Vikas Choudhary
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-04 19:18 UTC by Justin Pierce
Modified: 2019-06-04 10:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:03 UTC
Target Upstream Version:


Attachments (Terms of Use)
listings (5.44 KB, text/plain)
2019-04-04 19:18 UTC, Justin Pierce
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:47:11 UTC

Description Justin Pierce 2019-04-04 19:18:39 UTC
Created attachment 1552066 [details]
listings

Description of problem:
When the replicas of a machineset are increased, the machine-api controller goes to work scaling the cluster. However, if an issue is hit (e.g. AWS instance limit), the administrator does not receive this feedback in the resources they might expect:

For example, the issue is not reflected in the machineset status
1) It is not reflected in the machine api operator status
2) It is not reflected in messages or status of the machineset

Version-Release number of selected component (if applicable):
4.0.0-0.9

How reproducible:
100%

Steps to Reproduce:
1. oc edit machineset to have replicas > your ec2 instance limit in AWS
2.
3.

Actual results:
Observe the machineset resource. In this case, it will set indefinitely at less than the 'desired'.
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
int-1-hhxh8-worker-us-east-1a   20        20        18      18          4h15m

and no description of the gating issue is present in the machineset status.

status:
  availableReplicas: 18
  fullyLabeledReplicas: 20
  observedGeneration: 2
  readyReplicas: 18
  replicas: 20


Expected results:
Like a Deployment/DaemonSet/etc object, I would expect an overall status message to be fed back to the high level object. 
There are ways to uncover the problem (events / messages in individual 'machine' objects), but a summary at the top level resource (and even at the operator level) would seem more consistent with Kube/OpenShift.

A high level API for assessing whether my configuration change is progressing or failing (and why) would be useful for administrators. 

Additional info:
See attachments for full machineset yaml / describe.

Comment 1 Antonio Murdaca 2019-04-04 19:29:03 UTC
The machine config operator component is NOT the right component for Machine API issues. Moving to Cloud Compute.

Comment 2 Vikas Choudhary 2019-04-08 07:21:00 UTC
Once https://github.com/openshift/cluster-api/pull/23 is merged at cluster-api repo, will have to bump vendored cluster-api at aws-actuator repo to get this fix in actuator.

Comment 4 Vikas Choudhary 2019-04-08 10:38:21 UTC
1. Scope of Machine Api Operator is only upto managing lifecycle of different controllers and cluster-api-provider(aws in this case). Reporting inner functionality status from the cluster-api stack components is not expected from MAO.
2. Above PR is going will enable reporting of any machineset<->machines reconciliation failures as events at machineset object. Also further details can be discovered by looking at `InstanceState` field, https://github.com/openshift/cluster-api-provider-aws/blob/4d953241bc7f62785e0ff9f759315f386e790ba2/pkg/apis/awsproviderconfig/v1beta1/awsmachineproviderconfig_types.go#L46,  in a particular machine object.

Comment 6 Jianwei Hou 2019-04-24 09:13:49 UTC
Verified in 4.1.0-0.nightly-2019-04-23-223857

The InstanceState from machine's providerStatus shows the instance info the machine is associated with. The instance limit of my account is 700, I can not hit it at the moment. I'm able to set a wrong ami in the machineSet and let it scale. The error is properly reported.


```
error launching instance: error getting blockDeviceMappings: error
        describing AMI: InvalidAMIID.Malformed: Invalid id:
```

The error is also recorded in the machine-controller log.

Comment 8 errata-xmlrpc 2019-06-04 10:47:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.