Description of problem: Nova scheduler does not place instances evenly across hosts Version-Release number of selected component (if applicable): openstack-nova-api-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:51 2019 openstack-nova-common-17.0.7-5.el7ost.noarch Wed Jan 9 19:27:47 2019 openstack-nova-compute-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:10 2019 openstack-nova-conductor-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:50 2019 openstack-nova-console-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:51 2019 openstack-nova-migration-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:41 2019 openstack-nova-novncproxy-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:51 2019 openstack-nova-placement-api-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:50 2019 openstack-nova-scheduler-17.0.7-5.el7ost.noarch Wed Jan 9 19:28:51 2019 puppet-nova-12.4.0-14.el7ost.noarch Wed Jan 9 19:32:09 2019 python2-novaclient-10.1.0-1.el7ost.noarch Wed Jan 9 19:27:04 2019 python-nova-17.0.7-5.el7ost.noarch Wed Jan 9 19:27:46 2019 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Instances get spawned on only two computes Expected results: Instances should be spawned on all compute node Additional info:
*** Bug 1703110 has been marked as a duplicate of this bug. ***
Apparently on master we're filtering some failed_build causes upstream, but it sounds like we should look at extending it.
(In reply to Matthew Booth from comment #14) > Apparently on master we're filtering some failed_build causes upstream, but > it sounds like we should look at extending it. Turns out I was mistaken about this -- there is no such filtering on master either. I was thinking of a patch attempt from the past [1] that ended up abandoned because of the complexity and maintainability concerns about having such a whitelist. The recommended way to handle this issue in an affected environment is to disable the BuildFailureWeigher in config by setting the option: [filter_scheduler]build_failure_weight_multiplier = 0. [1] https://review.opendev.org/568953
(In reply to vivek koul from comment #16) > Hello, > > After restarting services my Cu is again seeing failed_builds for there > compute nodes. > I have suggested the Cu to disable the BuildFailureWeigher, but they want to > know the reason why they are experiencing build fails on hosts. > > So for that, I did some tests on my test env. > > I resized one of my test instances and it failed to get resize(I did that > intentionally). That particular instance got migrated to different compute > and it went into an error state. > So my question is should there be any failed_build for that compute node? For the scenario you describe, (intentionally cause a failed resize), it is expected there will be a failed_build for the related compute node. This is why the BuildFailureWeigher can be problematic, because it does not differentiate between user-caused build failures vs compute node-related build failures. Any situation where a request goes to a compute node and fails to build the instance (even a reschedule) will cause a failed_build to be tracked by the BuildFailureWeigher. The failed_build counter is reset (cleared out) for a compute node when any successful build occurs on that compute node. So, it does do some self-healing, but will still result in inconsistent instance placement if any build failures occur. If the customer environment requires a consistent placement of instances on compute nodes, it is best to disable the BuildFailureWeigher by setting [filter_scheduler]build_failure_weight_multiplier = 0.
*** Bug 1705930 has been marked as a duplicate of this bug. ***
*** Bug 1722201 has been marked as a duplicate of this bug. ***
*** Bug 1728335 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days