Bug 1952073

Summary: [OSP16.1] Failed to schedule VMs with minimum bandwidth - SR-IOV bandwidth aware scheduling
Product: Red Hat OpenStack Reporter: Vadim Khitrin <vkhitrin>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED DUPLICATE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: dasmith, eglynn, fbaudin, hakhande, jhakimra, jparker, kchamart, oblaut, sbauza, sgordon, smooney, vromanso
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-05 19:09:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vadim Khitrin 2021-04-21 12:58:21 UTC
Description of problem:
Previously (in OSP16 release) we were able to use bandwidth aware scheduling in order to schedule an instance with minimum bandwidth.
It appears that this is no longer working in our 16.1 deployments.

We have supplied the correct 'NeutronSriovResourceProviderBandwidths' TripleO parameter and ensured that the correct 'resource_provider_bandwidths' values were populated in SR-IOV NIC agent.

The error that we are receiving in the log is:
nova-scheduler.log:2021-04-21 10:04:24.296 23 ERROR nova.scheduler.client.report [req-91a26cb1-2823-433a-9ef7-22f5d997f789 42b9e8e4fc67453aaecb179e1417bda0 61f0d411b9c4486c9cec8144367f3b6e - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set(['COMPUTE_STATUS_DISABLED']),in_tree=None,provider_uuids=[],requester_id=None,required_traits=set(['COMPUTE_IMAGE_TYPE_QCOW2']),resources={DISK_GB=20,MEMORY_MB=8192,PCPU=6},use_same_provider=False), RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set([]),in_tree=None,provider_uuids=[],requester_id='a0f133d7-60fb-4477-8c69-21b70a9c4c57',required_traits=set(['CUSTOM_PHYSNET_MELLANOX_SRIOV_1','CUSTOM_VNIC_TYPE_DIRECT']),resources={NET_BW_EGR_KILOBIT_PER_SEC=25000000},use_same_provider=True)


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20210415.n.0

How reproducible:
In all of the setups we have tried (OVS/OVN backends and even 16.2 which is out of scope of this bug).

Steps to Reproduce:
1. Deploy an environment with the required bandwidth parameters
2. Attempt to spawn an instance with SR-IOV minimum bandwidth.

Actual results:
Instance fails to spawn.

Expected results:
Instance spawns successfully.

Additional info:
Will provide SOS report in comments.

Comment 2 Vadim Khitrin 2021-04-27 11:32:56 UTC
I think this is a misconfiguration in my deployment, after looking in the documentation again (https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/configuration_reference/neutron_2) I have noticed that it appears that I am missing 'resource_provider_hypervisors' parameter which is not yet exposed through TripleO BZ#1936383.

Regardless, I will try to add the parameter manually and will update this bug.

Comment 3 smooney 2021-04-28 15:58:05 UTC
you might be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1949385 i have not fully looked at your sos reports to confim but the default hostnames shoudl match.

you should not need to update  resource_provider_hypervisors just  resource_provider_bandwidths

Comment 4 smooney 2021-05-05 15:45:47 UTC

looking at the sriov agent config i can see that the bandwith is correctly listed.

[sriov_nic]
physical_device_mappings=sriov-1:enp6s0f2,sriov-2:enp6s0f3,mellanox-sriov-1:enp4s0f0,mellanox-sriov-2:enp4s0f1
resource_provider_bandwidths=enp6s0f2:10000000:10000000,enp6s0f3:10000000:10000000,enp4s0f0:40000000:40000000,enp4s0f1:40000000:40000000

looking at /etc/hostname its "computeovsdpdksriov-tigon10-0"

so we would expect the placement RP to be named "computeovsdpdksriov-tigon10-0"

can you list the resource providers in the environment so we can review.

Comment 5 smooney 2021-05-05 16:21:50 UTC
this is likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1900500 actully.

Comment 6 smooney 2021-05-05 19:09:15 UTC

*** This bug has been marked as a duplicate of bug 1900500 ***