Bug 1745939
Summary: | [IPI] [OSP] Scale up stops when OSP return error on creating instance | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Sanz <dsanzmor> |
Component: | Cloud Compute | Assignee: | egarcia |
Status: | CLOSED NOTABUG | QA Contact: | David Sanz <dsanzmor> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.2.0 | CC: | agarcial, eduen, egarcia, jchaloup, mfedosin, mpatel, pprinett |
Target Milestone: | --- | Flags: | dsanzmor:
needinfo-
|
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | osp | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-03-06 18:10:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Sanz
2019-08-27 09:48:18 UTC
Verified on 4.2.0-0.nightly-2019-09-01-224700 Scaling up cluster until it fails. ERROR instances are not removed or relaunched, machineset condition is never reached. Moved it to 4.3, because we can't fix it until healthcheck controller will be enabled in machine-api-operator. for 4.2 the workaround is to manually remove the machine and start the scale up process again. This is explained in the troubleshooting doc. The machine health check controller does not run in 4.2. Also that's orthogonal to the title and description of this issue: An error creating a machine should not freeze the scale operation, if that happens you probably want to check the machine controller and machineSet logs. However: >Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried. $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE morenod-ocp-ctcj9-worker 3 3 1 1 116m Based on that output I can't see how it's frozen as desired=current >Expected results: Scalation process detects failed instance, destroy it, and continues until desired state In 4.2 nothing will deleted a failed instance for you but the actuator itself if it choose to implement that logic. I was under the assumption that this bug was still occuring on 4.3 and 4.4. @David Sans can you confirm that this is still a bug in these versions? If the bug is "Scale up stops when OSP return error on creating instance" regardless of the release you'd want to check the machine controller and machineSet logs before anything else. However based the output shared above "current 3 = desired 3" I'd say this bug is wrongly named and the intent of the bug is may be "failed machines are not remediated automatically". If that's the case the answer is as a above, nothing but the actuator would do that in 4.2, or the machine can be manually deleted so the machineSet will reconcile again against the expected number of replicas. In >=4.3 IF you have a machineHealthCheck monitoring your machines, failed machines will be remediated based on your nodeConditions criteria or if they timeout missing a node. This is opt-in atm. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |