1189906 – Force move operations don't call nova scheduler filters

Bug 1189906 - Force move operations don't call nova scheduler filters

Summary: Force move operations don't call nova scheduler filters

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	6.0 (Juno)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sylvain Bauza
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-02-05 19:01 UTC by Joe Talerico
Modified:	2021-12-10 14:35 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-15 09:16:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1427772	0	None	None	None	Never
Red Hat Issue Tracker	OSP-11258	0	None	None	None	2021-12-10 14:35:22 UTC

Description Joe Talerico 2015-02-05 19:01:33 UTC

Description of problem:
Setting the overcommit ratios to 1 for cpu and memory, and loading a single compute node with guests having hw:numa_nodes=1, I was expecting to see node0 fill with guests, then switch to node1, however guests never switched to node1. 


How reproducible:
100%

Steps to Reproduce:
set overcommit.
set Scheduler Filter
Restart nova scheduler 
launch guests on a single host.

Actual results:
https://gist.github.com/jtaleric/9f7b8b1b9db82fd3dbb5

Comment 3 Nikola Dipanov 2015-02-11 10:25:28 UTC

@Joe - so trying this out on a recent master checkout does not seem to have the problem. It is of course possible that this is a RHOS 6.0 only issue, however we want to align our testing strategies. Here are some more details on the testing strategy I used as it is slightly different from the reported and we should maybe make sure we are teting the same thing:

I have a VM set up to use nested virt as such (virsh xmldump snip):

  <cpu mode='host-passthrough'>
    <numa>
      <cell id='0' cpus='0' memory='2097152'/>
      <cell id='1' cpus='1' memory='2097152'/>
    </numa>
  </cpu>

So then make sure that the following lines are in your /etc/nova/nova.conf file:

scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,NUMATopologyFilter
cpu_allocation_ratio = 2.0

And restart the Nova scheduler service.

And that extra-specs are set on a flavor you want to use as such

$ nova flavor-key 42 set hw:numa_nodes=1
$ nova flavor-show 42

+----------------------------+------------------------+
| Property                   | Value                  |
+----------------------------+------------------------+
| OS-FLV-DISABLED:disabled   | False                  |
| OS-FLV-EXT-DATA:ephemeral  | 0                      |
| disk                       | 0                      |
| extra_specs                | {"hw:numa_nodes": "1"} |
| id                         | 42                     |
| name                       | m1.nano                |
| os-flavor-access:is_public | True                   |
| ram                        | 64                     |
| rxtx_factor                | 1.0                    |
| swap                       |                        |
| vcpus                      | 1                      |
+----------------------------+------------------------+

So then I go on to boot 4 instances (as they are going to fill up the NUMA cpu limit:

$ nova boot --image cirros-0.3.2-x86_64-uec --flavor 42 testnuma

The results are as expected - instance 5 fails to boot and NUMA data in the booted instances XML is as expected. I use the following oneliner:

$ virsh list --all | grep instance | awk '{ print $1 }' | xargs -I {} sh -c 'virsh dumpxml {} | grep '$xmltoken' && echo "=="'

And then setting xmltoken to 'memnodes' gives me

    <memnode cellid='0' mode='strict' nodeset='0'/>
==
    <memnode cellid='0' mode='strict' nodeset='0'/>
==
    <memnode cellid='0' mode='strict' nodeset='1'/>
==
    <memnode cellid='0' mode='strict' nodeset='1'/>
==

And setting it to vcpupin gives me:

    <vcpupin vcpu='0' cpuset='0'/>
==
    <vcpupin vcpu='0' cpuset='0'/>
==
    <vcpupin vcpu='0' cpuset='1'/>
==
    <vcpupin vcpu='0' cpuset='1'/>
==

Both of which is exactly as expected - they get stacked, moved to the next node, and then NUMAFilter stops scheduling them as per the policy set.

I will now proceed to test this with RHOS 6.0 just to make sure, but auditing the code shows very little difference in the key bits. Would be awesome if you could also re-run the same tests as above just to make sure we are checking the same thing.

Comment 4 Joe Talerico 2015-02-13 11:43:22 UTC

Packages on my compute node:
openstack-nova-compute-2014.2.1-14.el7ost.noarch
python-novaclient-2.20.0-1.el7ost.noarch
python-nova-2014.2.1-14.el7ost.noarch
openstack-nova-common-2014.2.1-14.el7ost.noarch

Output from running example above:
https://gist.github.com/jtaleric/4f4b9dd65b47982a30f9

Comment 5 Nikola Dipanov 2015-02-13 17:52:39 UTC

After taking a look at the test environment that Joe was using - it turns out that he was using the --availability-zone flag on boot which results in force_host being set.

This in turn completely bypasses the scheduling logic (it's a well-known miss-feature of the Nova scheduler). Normally this is fine, because claims logic on the compute node should reject the request if it does not pass the claim.

This however breaks down for NUMA instances, since the claiming logic there relies on overcommit limits being set by the filters in the scheduler code, however these are never run, and limits never set. Claims logic on compute nodes considers no limits sent to mean unlimited, and since NUMA is "stacked" by default, the result is that we keep stacking the first NUMA node indefinitely (note that requests that do go through the scheduler would be scheduled properly, and spread on other nodes as expected).

Comment 6 Joe Talerico 2015-02-13 20:32:06 UTC

Validated the above : disabling a compute node and running the same test... Results in node0 filling then spilling over to node1.

[root@macb8ca3a60ff54 ~]# virsh list --all | grep instance | awk '{ print $1 }' | xargs -I {} sh -c 'virsh dumpxml {} | grep '$xmltoken' && echo "=="'
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'/>
==
    <vcpupin vcpu='0' cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31'/>
==
    <vcpupin vcpu='0' cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31'/>
==
    <vcpupin vcpu='0' cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31'/>
==
    <vcpupin vcpu='0' cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31'/>
==
    <vcpupin vcpu='0' cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31'/>
==

Comment 7 Nikola Dipanov 2015-03-11 13:13:42 UTC

There is a bug reported upstream now to track the work related to making sure the filters are re-run as required.

Comment 14 Stan Toporek 2018-10-22 21:24:00 UTC

Spawning instances is limited to one or two at a time.

So each compute has a total of 56 threads (28 pCores), of which 52 are pinned 
(1 pCore or 2 vCPU are reserved per NUMA Zone). Which means that each NUMA has 
26 vCPU or Threads. The 4 virtual machines that are being spawned are of the 
following flavor:

[stack@cougar-director ~]$ openstack flavor show 
UUID
+----------------------------+------------------------------------------------------------------------------------+
| Field                      | Value 
|
+----------------------------+------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False 
|
| OS-FLV-EXT-DATA:ephemeral  | 0 
|
| access_project_ids         | None 
|
| disk                       | 10 
|
| id                         | UUID
|
| name                       | vMTAS_VM_60 
|
| os-flavor-access:is_public | True 
|
| properties                 | hw:cpu_policy='dedicated', 
hw:cpu_thread_policy='require', hw:mem_page_size='2048' |
| ram                        | 61440 
|
| rxtx_factor                | 1.0 
|
| swap                       | 
|
| vcpus                      | 20 
|
+----------------------------+------------------------------------------------------------------------------------+

So as you can see the vCPU count here is 20 per VM. Which we should be able to 
fit 4 of into the two hosts (one per NUMA zone). However when attempting to 
spawn 3-4 of these VMs the scheduler fails and all VMs have a no host found 
error. However if you spawn the VMs in a 2+2 fashion or 1 at a time, all 4 can 
be spawned without issue.

With this behavior it is like the nova scheduler is only checking the first 
NUMA zone for placement. However this is not a correct behavior. Even if the 
scheduler cannot find proper positions for the second set of VMs the entire 
spawning should not enter a failed state. If you have some time we can do a 
quick webex and I can show you the issue, it should only take about 10 minutes 
for me to show you the two cases:

1. Spawn 3-4 VMs with the flavor above which causes a failure to spawn on all 
VMs
2. Spawn the VMs in a 2+2 fashion (total of 4) which succeeds

Comment 19 Matthew Booth 2019-10-15 09:16:34 UTC

I am closing this bug as it has not been addressed for a very long time. Please feel free to reopen if it is still relevant.

Note You need to log in before you can comment on or make changes to this bug.