1745939 – [IPI] [OSP] Scale up stops when OSP return error on creating instance

Bug 1745939 - [IPI] [OSP] Scale up stops when OSP return error on creating instance

Summary: [IPI] [OSP] Scale up stops when OSP return error on creating instance

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	egarcia
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:	osp
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-27 09:48 UTC by David Sanz
Modified:	2023-09-14 05:42 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-06 18:10:32 UTC
Target Upstream Version:
Embargoed:
Flags:	dsanzmor: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 59	0	'None'	closed	Bug 1745939: Return error if image creation fails	2020-10-26 16:19:28 UTC
Github	openshift cluster-api-provider-openstack pull 83	0	None	closed	Bug 1745939: machine status not set as error during timeout	2020-10-26 16:19:28 UTC

Description David Sanz 2019-08-27 09:48:18 UTC

Description of problem:
Having a healthy cluster, tried to scale up with more workers, editing the machineset:

# oc edit machineset -n openshift-machine-api

New machines are created:

#oc get machines -A
NAME                             STATE    TYPE           REGION      ZONE   AGE
morenod-ocp-ctcj9-master-0       ACTIVE   ci.m1.xlarge   regionOne   nova   115m
morenod-ocp-ctcj9-master-1	 ACTIVE   ci.m1.xlarge   regionOne   nova   115m
morenod-ocp-ctcj9-master-2	 ACTIVE   ci.m1.xlarge   regionOne   nova   115m
morenod-ocp-ctcj9-worker-2zgcn                                              8m46s
morenod-ocp-ctcj9-worker-p5zff                                              8m46s
morenod-ocp-ctcj9-worker-qblhj   ACTIVE   ci.m1.xlarge   regionOne   nova   25m

First to be created on OSP has returned "500: No valid host was found. There are not enough hosts available."

Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried.

$ oc get machineset
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
morenod-ocp-ctcj9-worker   3         3         1       1           116m

$ oc get nodes
NAME                             STATUS   ROLES    AGE    VERSION
morenod-ocp-ctcj9-master-0       Ready    master   118m   v1.14.0+b985ea310
morenod-ocp-ctcj9-master-1       Ready    master   118m   v1.14.0+b985ea310
morenod-ocp-ctcj9-master-2       Ready    master   118m   v1.14.0+b985ea310
morenod-ocp-ctcj9-worker-qblhj   Ready    worker   14m    v1.14.0+b985ea310



Version-Release number of the following components:
4.2.0-0.nightly-2019-08-27-061931

How reproducible:

Steps to Reproduce:
1.Install IPI on OSP
2.Force an error on OSP (if it is possible, for example, disabling openstack-nova-compute)
3.Scale up workers using the machine set: `oc edit machineset -n openshift-machine-api`
4.Check status of machines, nodes and machinesets

Actual results:
If any instance fails to be created, scalation process freezes and do not continue

Expected results:
Scalation process detects failed instance, destroy it, and continues until desired state

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 5 David Sanz 2019-09-02 13:31:46 UTC

Verified on 4.2.0-0.nightly-2019-09-01-224700

Scaling up cluster until it fails.

ERROR instances are not removed or relaunched, machineset condition is never reached.

Comment 6 Mike Fedosin 2019-09-06 12:18:11 UTC

Moved it to 4.3, because we can't fix it until healthcheck controller will be enabled in machine-api-operator.
for 4.2 the workaround is to manually remove the machine and start the scale up process again.
This is explained in the troubleshooting doc.

Comment 10 Alberto 2020-03-05 22:43:12 UTC

The machine health check controller does not run in 4.2. Also that's orthogonal to the title and description of this issue:
An error creating a machine should not freeze the scale operation, if that happens you probably want to check the machine controller and machineSet logs. However:

>Scaling up is freeze, no more instances are created, and failed one is not deleted, or retried.
$ oc get machineset
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
morenod-ocp-ctcj9-worker   3         3         1       1           116m

Based on that output I can't see how it's frozen as desired=current

>Expected results:
Scalation process detects failed instance, destroy it, and continues until desired state

In 4.2 nothing will deleted a failed instance for you but the actuator itself if it choose to implement that logic.

Comment 11 egarcia 2020-03-05 22:50:47 UTC

I was under the assumption that this bug was still occuring on 4.3 and 4.4. @David Sans can you confirm that this is still a bug in these versions?

Comment 12 Alberto 2020-03-05 22:59:09 UTC

If the bug is "Scale up stops when OSP return error on creating instance" regardless of the release you'd want to check the machine controller and machineSet logs before anything else.

However based the output shared above "current 3 = desired 3" I'd say this bug is wrongly named and the intent of the bug is may be "failed machines are not remediated automatically".

If that's the case the answer is as a above, nothing but the actuator would do that in 4.2, or the machine can be manually deleted so the machineSet will reconcile again against the expected number of replicas. In >=4.3 IF you have a machineHealthCheck monitoring your machines, failed machines will be remediated based on your nodeConditions criteria or if they timeout missing a node. This is opt-in atm.

Comment 13 Red Hat Bugzilla 2023-09-14 05:42:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.