1696407 – non-intuitive to detect a machineset scaling issue

Bug 1696407 - non-intuitive to detect a machineset scaling issue

Summary: non-intuitive to detect a machineset scaling issue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Vikas Choudhary
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-04 19:18 UTC by Justin Pierce
Modified:	2019-06-04 10:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
listings (5.44 KB, text/plain) 2019-04-04 19:18 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:11 UTC

Description Justin Pierce 2019-04-04 19:18:39 UTC

Created attachment 1552066 [details]
listings

Description of problem:
When the replicas of a machineset are increased, the machine-api controller goes to work scaling the cluster. However, if an issue is hit (e.g. AWS instance limit), the administrator does not receive this feedback in the resources they might expect:

For example, the issue is not reflected in the machineset status
1) It is not reflected in the machine api operator status
2) It is not reflected in messages or status of the machineset

Version-Release number of selected component (if applicable):
4.0.0-0.9

How reproducible:
100%

Steps to Reproduce:
1. oc edit machineset to have replicas > your ec2 instance limit in AWS
2.
3.

Actual results:
Observe the machineset resource. In this case, it will set indefinitely at less than the 'desired'.
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
int-1-hhxh8-worker-us-east-1a   20        20        18      18          4h15m

and no description of the gating issue is present in the machineset status.

status:
  availableReplicas: 18
  fullyLabeledReplicas: 20
  observedGeneration: 2
  readyReplicas: 18
  replicas: 20


Expected results:
Like a Deployment/DaemonSet/etc object, I would expect an overall status message to be fed back to the high level object. 
There are ways to uncover the problem (events / messages in individual 'machine' objects), but a summary at the top level resource (and even at the operator level) would seem more consistent with Kube/OpenShift.

A high level API for assessing whether my configuration change is progressing or failing (and why) would be useful for administrators. 

Additional info:
See attachments for full machineset yaml / describe.

Comment 1 Antonio Murdaca 2019-04-04 19:29:03 UTC

The machine config operator component is NOT the right component for Machine API issues. Moving to Cloud Compute.

Comment 2 Vikas Choudhary 2019-04-08 07:21:00 UTC

Once https://github.com/openshift/cluster-api/pull/23 is merged at cluster-api repo, will have to bump vendored cluster-api at aws-actuator repo to get this fix in actuator.

Comment 3 Vikas Choudhary 2019-04-08 09:54:58 UTC

https://github.com/kubernetes-sigs/cluster-api/pull/880/files

Comment 4 Vikas Choudhary 2019-04-08 10:38:21 UTC

1. Scope of Machine Api Operator is only upto managing lifecycle of different controllers and cluster-api-provider(aws in this case). Reporting inner functionality status from the cluster-api stack components is not expected from MAO.
2. Above PR is going will enable reporting of any machineset<->machines reconciliation failures as events at machineset object. Also further details can be discovered by looking at `InstanceState` field, https://github.com/openshift/cluster-api-provider-aws/blob/4d953241bc7f62785e0ff9f759315f386e790ba2/pkg/apis/awsproviderconfig/v1beta1/awsmachineproviderconfig_types.go#L46,  in a particular machine object.

Comment 6 Jianwei Hou 2019-04-24 09:13:49 UTC

Verified in 4.1.0-0.nightly-2019-04-23-223857

The InstanceState from machine's providerStatus shows the instance info the machine is associated with. The instance limit of my account is 700, I can not hit it at the moment. I'm able to set a wrong ami in the machineSet and let it scale. The error is properly reported.


```
error launching instance: error getting blockDeviceMappings: error
        describing AMI: InvalidAMIID.Malformed: Invalid id:
```

The error is also recorded in the machine-controller log.

Comment 8 errata-xmlrpc 2019-06-04 10:47:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.