Bug 1952073 - [OSP16.1] Failed to schedule VMs with minimum bandwidth - SR-IOV bandwidth aware scheduling
Summary: [OSP16.1] Failed to schedule VMs with minimum bandwidth - SR-IOV bandwidth aw...
Keywords:
Status: CLOSED DUPLICATE of bug 1900500
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-21 12:58 UTC by Vadim Khitrin
Modified: 2023-03-21 19:41 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-05 19:09:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-3218 0 None None None 2022-08-23 18:36:05 UTC

Description Vadim Khitrin 2021-04-21 12:58:21 UTC
Description of problem:
Previously (in OSP16 release) we were able to use bandwidth aware scheduling in order to schedule an instance with minimum bandwidth.
It appears that this is no longer working in our 16.1 deployments.

We have supplied the correct 'NeutronSriovResourceProviderBandwidths' TripleO parameter and ensured that the correct 'resource_provider_bandwidths' values were populated in SR-IOV NIC agent.

The error that we are receiving in the log is:
nova-scheduler.log:2021-04-21 10:04:24.296 23 ERROR nova.scheduler.client.report [req-91a26cb1-2823-433a-9ef7-22f5d997f789 42b9e8e4fc67453aaecb179e1417bda0 61f0d411b9c4486c9cec8144367f3b6e - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set(['COMPUTE_STATUS_DISABLED']),in_tree=None,provider_uuids=[],requester_id=None,required_traits=set(['COMPUTE_IMAGE_TYPE_QCOW2']),resources={DISK_GB=20,MEMORY_MB=8192,PCPU=6},use_same_provider=False), RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set([]),in_tree=None,provider_uuids=[],requester_id='a0f133d7-60fb-4477-8c69-21b70a9c4c57',required_traits=set(['CUSTOM_PHYSNET_MELLANOX_SRIOV_1','CUSTOM_VNIC_TYPE_DIRECT']),resources={NET_BW_EGR_KILOBIT_PER_SEC=25000000},use_same_provider=True)


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20210415.n.0

How reproducible:
In all of the setups we have tried (OVS/OVN backends and even 16.2 which is out of scope of this bug).

Steps to Reproduce:
1. Deploy an environment with the required bandwidth parameters
2. Attempt to spawn an instance with SR-IOV minimum bandwidth.

Actual results:
Instance fails to spawn.

Expected results:
Instance spawns successfully.

Additional info:
Will provide SOS report in comments.

Comment 2 Vadim Khitrin 2021-04-27 11:32:56 UTC
I think this is a misconfiguration in my deployment, after looking in the documentation again (https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/configuration_reference/neutron_2) I have noticed that it appears that I am missing 'resource_provider_hypervisors' parameter which is not yet exposed through TripleO BZ#1936383.

Regardless, I will try to add the parameter manually and will update this bug.

Comment 3 smooney 2021-04-28 15:58:05 UTC
you might be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1949385 i have not fully looked at your sos reports to confim but the default hostnames shoudl match.

you should not need to update  resource_provider_hypervisors just  resource_provider_bandwidths

Comment 4 smooney 2021-05-05 15:45:47 UTC

looking at the sriov agent config i can see that the bandwith is correctly listed.

[sriov_nic]
physical_device_mappings=sriov-1:enp6s0f2,sriov-2:enp6s0f3,mellanox-sriov-1:enp4s0f0,mellanox-sriov-2:enp4s0f1
resource_provider_bandwidths=enp6s0f2:10000000:10000000,enp6s0f3:10000000:10000000,enp4s0f0:40000000:40000000,enp4s0f1:40000000:40000000

looking at /etc/hostname its "computeovsdpdksriov-tigon10-0"

so we would expect the placement RP to be named "computeovsdpdksriov-tigon10-0"

can you list the resource providers in the environment so we can review.

Comment 5 smooney 2021-05-05 16:21:50 UTC
this is likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1900500 actully.

Comment 6 smooney 2021-05-05 19:09:15 UTC

*** This bug has been marked as a duplicate of bug 1900500 ***


Note You need to log in before you can comment on or make changes to this bug.