1701334 – Nova's setting and weighing of failed_builds is highly problematic

Bug 1701334 - Nova's setting and weighing of failed_builds is highly problematic

Summary: Nova's setting and weighing of failed_builds is highly problematic

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1703110 1705930 1722201 1728335 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-18 16:21 UTC by vivek koul
Modified:	2023-12-15 16:27 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 23:03:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-3129	0	None	None	None	2022-03-10 17:12:18 UTC

Description vivek koul 2019-04-18 16:21:53 UTC

Description of problem:
Nova scheduler does not place instances evenly across hosts

Version-Release number of selected component (if applicable):
openstack-nova-api-17.0.7-5.el7ost.noarch                   Wed Jan  9 19:28:51 2019
openstack-nova-common-17.0.7-5.el7ost.noarch                Wed Jan  9 19:27:47 2019
openstack-nova-compute-17.0.7-5.el7ost.noarch               Wed Jan  9 19:28:10 2019
openstack-nova-conductor-17.0.7-5.el7ost.noarch             Wed Jan  9 19:28:50 2019
openstack-nova-console-17.0.7-5.el7ost.noarch               Wed Jan  9 19:28:51 2019
openstack-nova-migration-17.0.7-5.el7ost.noarch             Wed Jan  9 19:28:41 2019
openstack-nova-novncproxy-17.0.7-5.el7ost.noarch            Wed Jan  9 19:28:51 2019
openstack-nova-placement-api-17.0.7-5.el7ost.noarch         Wed Jan  9 19:28:50 2019
openstack-nova-scheduler-17.0.7-5.el7ost.noarch             Wed Jan  9 19:28:51 2019
puppet-nova-12.4.0-14.el7ost.noarch                         Wed Jan  9 19:32:09 2019
python2-novaclient-10.1.0-1.el7ost.noarch                   Wed Jan  9 19:27:04 2019
python-nova-17.0.7-5.el7ost.noarch                          Wed Jan  9 19:27:46 2019


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Instances get spawned on only two computes

Expected results:
Instances should be spawned on all compute node


Additional info:

Comment 13 Matthew Booth 2019-05-24 13:32:25 UTC

*** Bug 1703110 has been marked as a duplicate of this bug. ***

Comment 14 Matthew Booth 2019-05-24 14:10:06 UTC

Apparently on master we're filtering some failed_build causes upstream, but it sounds like we should look at extending it.

Comment 15 melanie witt 2019-05-24 15:24:24 UTC

(In reply to Matthew Booth from comment #14)
> Apparently on master we're filtering some failed_build causes upstream, but
> it sounds like we should look at extending it.

Turns out I was mistaken about this -- there is no such filtering on master either. I was thinking of a patch attempt from the past [1] that ended up abandoned because of the complexity and maintainability concerns about having such a whitelist.

The recommended way to handle this issue in an affected environment is to disable the BuildFailureWeigher in config by setting the option:

  [filter_scheduler]build_failure_weight_multiplier = 0. 

[1] https://review.opendev.org/568953

Comment 17 melanie witt 2019-06-05 16:52:50 UTC

(In reply to vivek koul from comment #16)
> Hello,
> 
> After restarting services my Cu is again seeing failed_builds for there
> compute nodes.
> I have suggested the Cu to disable the BuildFailureWeigher, but they want to
> know the reason why they are experiencing build fails on hosts.
> 
> So for that, I did some tests on my test env.
> 
> I resized one of my test instances and it failed to get resize(I did that
> intentionally). That particular instance got migrated to different compute
> and it went into an error state.
> So my question is should there be any failed_build for that compute node?

For the scenario you describe, (intentionally cause a failed resize), it is expected there will be a failed_build for the related compute node. This is why the BuildFailureWeigher can be problematic, because it does not differentiate between user-caused build failures vs compute node-related build failures. Any situation where a request goes to a compute node and fails to build the instance (even a reschedule) will cause a failed_build to be tracked by the BuildFailureWeigher. The failed_build counter is reset (cleared out) for a compute node when any successful build occurs on that compute node. So, it does do some self-healing, but will still result in inconsistent instance placement if any build failures occur. If the customer environment requires a consistent placement of instances on compute nodes, it is best to disable the BuildFailureWeigher by setting [filter_scheduler]build_failure_weight_multiplier = 0.

Comment 21 Artom Lifshitz 2019-06-14 15:17:23 UTC

*** Bug 1705930 has been marked as a duplicate of this bug. ***

Comment 22 Matthew Booth 2019-06-21 12:09:05 UTC

*** Bug 1722201 has been marked as a duplicate of this bug. ***

Comment 23 Matthew Booth 2019-07-11 15:20:37 UTC

*** Bug 1728335 has been marked as a duplicate of this bug. ***

Comment 30 Red Hat Bugzilla 2023-09-18 00:16:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.