Bug 1745939

Summary:	[IPI] [OSP] Scale up stops when OSP return error on creating instance
Product:	OpenShift Container Platform	Reporter:	David Sanz <dsanzmor>
Component:	Cloud Compute	Assignee:	egarcia
Status:	CLOSED NOTABUG	QA Contact:	David Sanz <dsanzmor>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	agarcial, eduen, egarcia, jchaloup, mfedosin, mpatel, pprinett
Target Milestone:	---	Flags:	dsanzmor: needinfo-
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	osp
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-06 18:10:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Sanz 2019-08-27 09:48:18 UTC

Description of problem:
Having a healthy cluster, tried to scale up with more workers, editing the machineset:

# oc edit machineset -n openshift-machine-api

New machines are created:

#oc get machines -A
NAME                             STATE    TYPE           REGION      ZONE   AGE
morenod-ocp-ctcj9-master-0       ACTIVE   ci.m1.xlarge   regionOne   nova   115m
morenod-ocp-ctcj9-master-1	 ACTIVE   ci.m1.xlarge   regionOne   nova   115m
morenod-ocp-ctcj9-master-2	 ACTIVE   ci.m1.xlarge   regionOne   nova   115m
morenod-ocp-ctcj9-worker-2zgcn                                              8m46s
morenod-ocp-ctcj9-worker-p5zff                                              8m46s
morenod-ocp-ctcj9-worker-qblhj   ACTIVE   ci.m1.xlarge   regionOne   nova   25m

First to be created on OSP has returned "500: No valid host was found. There are not enough hosts available."

Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried.

$ oc get machineset
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
morenod-ocp-ctcj9-worker   3         3         1       1           116m

$ oc get nodes
NAME                             STATUS   ROLES    AGE    VERSION
morenod-ocp-ctcj9-master-0       Ready    master   118m   v1.14.0+b985ea310
morenod-ocp-ctcj9-master-1       Ready    master   118m   v1.14.0+b985ea310
morenod-ocp-ctcj9-master-2       Ready    master   118m   v1.14.0+b985ea310
morenod-ocp-ctcj9-worker-qblhj   Ready    worker   14m    v1.14.0+b985ea310



Version-Release number of the following components:
4.2.0-0.nightly-2019-08-27-061931

How reproducible:

Steps to Reproduce:
1.Install IPI on OSP
2.Force an error on OSP (if it is possible, for example, disabling openstack-nova-compute)
3.Scale up workers using the machine set: `oc edit machineset -n openshift-machine-api`
4.Check status of machines, nodes and machinesets

Actual results:
If any instance fails to be created, scalation process freezes and do not continue

Expected results:
Scalation process detects failed instance, destroy it, and continues until desired state

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 5 David Sanz 2019-09-02 13:31:46 UTC

Verified on 4.2.0-0.nightly-2019-09-01-224700

Scaling up cluster until it fails.

ERROR instances are not removed or relaunched, machineset condition is never reached.

Comment 6 Mike Fedosin 2019-09-06 12:18:11 UTC

Moved it to 4.3, because we can't fix it until healthcheck controller will be enabled in machine-api-operator.
for 4.2 the workaround is to manually remove the machine and start the scale up process again.
This is explained in the troubleshooting doc.

Comment 10 Alberto 2020-03-05 22:43:12 UTC

The machine health check controller does not run in 4.2. Also that's orthogonal to the title and description of this issue:
An error creating a machine should not freeze the scale operation, if that happens you probably want to check the machine controller and machineSet logs. However:

>Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried.
$ oc get machineset
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
morenod-ocp-ctcj9-worker   3         3         1       1           116m

Based on that output I can't see how it's frozen as desired=current

>Expected results:
Scalation process detects failed instance, destroy it, and continues until desired state

In 4.2 nothing will deleted a failed instance for you but the actuator itself if it choose to implement that logic.

Comment 11 egarcia 2020-03-05 22:50:47 UTC

I was under the assumption that this bug was still occuring on 4.3 and 4.4. @David Sans can you confirm that this is still a bug in these versions?

Comment 12 Alberto 2020-03-05 22:59:09 UTC

If the bug is "Scale up stops when OSP return error on creating instance" regardless of the release you'd want to check the machine controller and machineSet logs before anything else.

However based the output shared above "current 3 = desired 3" I'd say this bug is wrongly named and the intent of the bug is may be "failed machines are not remediated automatically".

If that's the case the answer is as a above, nothing but the actuator would do that in 4.2, or the machine can be manually deleted so the machineSet will reconcile again against the expected number of replicas. In >=4.3 IF you have a machineHealthCheck monitoring your machines, failed machines will be remediated based on your nodeConditions criteria or if they timeout missing a node. This is opt-in atm.

Comment 13 Red Hat Bugzilla 2023-09-14 05:42:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days