Description of problem: Having a healthy cluster, tried to scale up with more workers, editing the machineset: # oc edit machineset -n openshift-machine-api New machines are created: #oc get machines -A NAME STATE TYPE REGION ZONE AGE morenod-ocp-ctcj9-master-0 ACTIVE ci.m1.xlarge regionOne nova 115m morenod-ocp-ctcj9-master-1 ACTIVE ci.m1.xlarge regionOne nova 115m morenod-ocp-ctcj9-master-2 ACTIVE ci.m1.xlarge regionOne nova 115m morenod-ocp-ctcj9-worker-2zgcn 8m46s morenod-ocp-ctcj9-worker-p5zff 8m46s morenod-ocp-ctcj9-worker-qblhj ACTIVE ci.m1.xlarge regionOne nova 25m First to be created on OSP has returned "500: No valid host was found. There are not enough hosts available." Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried. $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE morenod-ocp-ctcj9-worker 3 3 1 1 116m $ oc get nodes NAME STATUS ROLES AGE VERSION morenod-ocp-ctcj9-master-0 Ready master 118m v1.14.0+b985ea310 morenod-ocp-ctcj9-master-1 Ready master 118m v1.14.0+b985ea310 morenod-ocp-ctcj9-master-2 Ready master 118m v1.14.0+b985ea310 morenod-ocp-ctcj9-worker-qblhj Ready worker 14m v1.14.0+b985ea310 Version-Release number of the following components: 4.2.0-0.nightly-2019-08-27-061931 How reproducible: Steps to Reproduce: 1.Install IPI on OSP 2.Force an error on OSP (if it is possible, for example, disabling openstack-nova-compute) 3.Scale up workers using the machine set: `oc edit machineset -n openshift-machine-api` 4.Check status of machines, nodes and machinesets Actual results: If any instance fails to be created, scalation process freezes and do not continue Expected results: Scalation process detects failed instance, destroy it, and continues until desired state Additional info: Please attach logs from ansible-playbook with the -vvv flag
Verified on 4.2.0-0.nightly-2019-09-01-224700 Scaling up cluster until it fails. ERROR instances are not removed or relaunched, machineset condition is never reached.
Moved it to 4.3, because we can't fix it until healthcheck controller will be enabled in machine-api-operator. for 4.2 the workaround is to manually remove the machine and start the scale up process again. This is explained in the troubleshooting doc.
The machine health check controller does not run in 4.2. Also that's orthogonal to the title and description of this issue: An error creating a machine should not freeze the scale operation, if that happens you probably want to check the machine controller and machineSet logs. However: >Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried. $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE morenod-ocp-ctcj9-worker 3 3 1 1 116m Based on that output I can't see how it's frozen as desired=current >Expected results: Scalation process detects failed instance, destroy it, and continues until desired state In 4.2 nothing will deleted a failed instance for you but the actuator itself if it choose to implement that logic.
I was under the assumption that this bug was still occuring on 4.3 and 4.4. @David Sans can you confirm that this is still a bug in these versions?
If the bug is "Scale up stops when OSP return error on creating instance" regardless of the release you'd want to check the machine controller and machineSet logs before anything else. However based the output shared above "current 3 = desired 3" I'd say this bug is wrongly named and the intent of the bug is may be "failed machines are not remediated automatically". If that's the case the answer is as a above, nothing but the actuator would do that in 4.2, or the machine can be manually deleted so the machineSet will reconcile again against the expected number of replicas. In >=4.3 IF you have a machineHealthCheck monitoring your machines, failed machines will be remediated based on your nodeConditions criteria or if they timeout missing a node. This is opt-in atm.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days