Bug 1696407
Summary: | non-intuitive to detect a machineset scaling issue | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> | ||||
Component: | Cloud Compute | Assignee: | Vikas Choudhary <vichoudh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 4.1.0 | CC: | agarcial, vichoudh | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.1.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-06-04 10:47:03 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Attachments: |
|
The machine config operator component is NOT the right component for Machine API issues. Moving to Cloud Compute. Once https://github.com/openshift/cluster-api/pull/23 is merged at cluster-api repo, will have to bump vendored cluster-api at aws-actuator repo to get this fix in actuator. 1. Scope of Machine Api Operator is only upto managing lifecycle of different controllers and cluster-api-provider(aws in this case). Reporting inner functionality status from the cluster-api stack components is not expected from MAO. 2. Above PR is going will enable reporting of any machineset<->machines reconciliation failures as events at machineset object. Also further details can be discovered by looking at `InstanceState` field, https://github.com/openshift/cluster-api-provider-aws/blob/4d953241bc7f62785e0ff9f759315f386e790ba2/pkg/apis/awsproviderconfig/v1beta1/awsmachineproviderconfig_types.go#L46, in a particular machine object. Verified in 4.1.0-0.nightly-2019-04-23-223857 The InstanceState from machine's providerStatus shows the instance info the machine is associated with. The instance limit of my account is 700, I can not hit it at the moment. I'm able to set a wrong ami in the machineSet and let it scale. The error is properly reported. ``` error launching instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: ``` The error is also recorded in the machine-controller log. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |
Created attachment 1552066 [details] listings Description of problem: When the replicas of a machineset are increased, the machine-api controller goes to work scaling the cluster. However, if an issue is hit (e.g. AWS instance limit), the administrator does not receive this feedback in the resources they might expect: For example, the issue is not reflected in the machineset status 1) It is not reflected in the machine api operator status 2) It is not reflected in messages or status of the machineset Version-Release number of selected component (if applicable): 4.0.0-0.9 How reproducible: 100% Steps to Reproduce: 1. oc edit machineset to have replicas > your ec2 instance limit in AWS 2. 3. Actual results: Observe the machineset resource. In this case, it will set indefinitely at less than the 'desired'. NAME DESIRED CURRENT READY AVAILABLE AGE int-1-hhxh8-worker-us-east-1a 20 20 18 18 4h15m and no description of the gating issue is present in the machineset status. status: availableReplicas: 18 fullyLabeledReplicas: 20 observedGeneration: 2 readyReplicas: 18 replicas: 20 Expected results: Like a Deployment/DaemonSet/etc object, I would expect an overall status message to be fed back to the high level object. There are ways to uncover the problem (events / messages in individual 'machine' objects), but a summary at the top level resource (and even at the operator level) would seem more consistent with Kube/OpenShift. A high level API for assessing whether my configuration change is progressing or failing (and why) would be useful for administrators. Additional info: See attachments for full machineset yaml / describe.