Bug 1801054 - [OSP16] Fail to schedule a guest with minimum QoS policy rule applied to port before guest creation
Summary: [OSP16] Fail to schedule a guest with minimum QoS policy rule applied to port...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Rodolfo Alonso
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1788974
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-10 06:34 UTC by Vadim Khitrin
Modified: 2020-02-13 07:57 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-10 17:00:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vadim Khitrin 2020-02-10 06:34:42 UTC
Description of problem:

When attempting to spawn a guest instance on top of SR-IOV enabled Mellanox Connect-X5 with minimum bandwidth QoS, the scheduling fails.
It is important to note that this does not occur if we attach the QoS after the instance is spawned. Once the policy is attached post spawn, it is applied to the VF and behaves correctly.
The issue only occurs when attempting to schedule an initial guest with QoS policy attached to port.

Create network:
openstack network create --provider-network-type vlan --provider-physical-network sriov-3 sriov-mx5-net
Create subnet:
openstack subnet create --subnet-range 60.0.20.0/24 --dhcp --network sriov-mx5-net sriov-mx5-net_subnet
Create QoS policy:
openstac network qos policy create sriov-qos
Create QoS minimum bandwidth rule:
openstack network qos rule create --type minimum-bandwidth --min-kbps 4000000 --egress sriov-qos
Creare port:
openstack port create --network sriov-mx5-net --vnic-type direct --qos-policy sriov-qos sriov-mx5-direct-port
Create server:
openstack server create --flavor m1.medium.huge_pages_cpu_pinning_numa_node-0 --image rhel-guest-image-7-6-210-x86-64-qcow2 --nic port-id=sriov-mx5-direct-port TEST_SERVER
Wait for server to enter 'ERROR' state and grab the fault:
openstack server show -c status -c fault TEST_SERVER -c id -f shell
fault="{'code': 500, 'created': '2020-02-10T06:12:00Z', 'message': 'No valid host was found. ', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1333, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 839, in _schedule_instances\n    return_alternates=return_alternates)\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call\n    transport_options=self.transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send\n    transport_options=transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send\n    transport_options=transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 636, in _send\n    raise result\nnova.exception_Remote.NoValidHost_Remote: No valid host was found. \nTraceback (most recent call last):\n\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner\n    return func(*args, **kwargs)\n\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/manager.py", line 199, in select_destinations\n    raise exception.NoValidHost(reason="")\n\nnova.exception.NoValidHost: No valid host was found. \n\n'}"
id="36da87fb-2363-417f-bcda-b4f4b2e8d2bc"
status="ERROR"
The following error appears on controller node:
020-02-10 06:11:58.841 22 DEBUG nova.scheduler.request_filter [req-ea61557d-9ac1-4699-b83f-67ee88778162 c9f98d7a25e44fd58646792151e8c297 90af442de83f4404a9b87580b869dcb9 - default default] Request filter 'compute_status_filter' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:44
2020-02-10 06:11:58.842 22 DEBUG nova.scheduler.utils [req-ea61557d-9ac1-4699-b83f-67ee88778162 c9f98d7a25e44fd58646792151e8c297 90af442de83f4404a9b87580b869dcb9 - default default] Translating request for VCPU=6 to VCPU=0,PCPU=6 _translate_pinning_policies /usr/lib/python3.6/site-packages/nova/scheduler/utils.py:246
2020-02-10 06:12:00.272 22 ERROR nova.scheduler.client.report [req-ea61557d-9ac1-4699-b83f-67ee88778162 c9f98d7a25e44fd58646792151e8c297 90af442de83f4404a9b87580b869dcb9 - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set(['COMPUTE_STATUS_DISABLED']),in_tree=None,provider_uuids=[],requester_id=None,required_traits=set(['COMPUTE_IMAGE_TYPE_QCOW2']),resources={DISK_GB=20,MEMORY_MB=8192,PCPU=6},use_same_provider=False), RequestGroup(aggregates=[],forbidden_aggregates=set([]),forbidden_traits=set([]),in_tree=None,provider_uuids=[],requester_id='9fcdf4dd-e774-4e8a-a38e-b941744a8c88',required_traits=set(['CUSTOM_VNIC_TYPE_DIRECT','CUSTOM_PHYSNET_SRIOV_6']),resources={NET_BW_EGR_KILOBIT_PER_SEC=4000000},use_same_provider=True)
Got 400: {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\n\n No such trait(s): CUSTOM_PHYSNET_SRIOV_3.  ", "code": "placement.undefined_code", "request_id": "req-6d8139cf-bcda-4df0-b86e-736c73f596af"}]}.
We can see that the port has some additional values attached in 'resource_request' attribute:
openstack port create --network sriov-mx5-net --vnic-type direct --qos-policy sriov-qos sriov-mx5-direct-port -c resource_request -f shell
resource_request="{'required': ['CUSTOM_PHYSNET_SRIOV_3', 'CUSTOM_VNIC_TYPE_DIRECT'], 'resources': {'NET_BW_EGR_KILOBIT_PER_SEC': 4000000}}"

Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200204.n.1

How reproducible:
Always

Steps to Reproduce:
1. Deploy overcloud with SR-IOV capabilities with hardware supporting Min QoS
2. Attempt to spawn a guest instance with min QoS attached to port

Actual results:
Guest instances fails to spawn

Expected results:
Guest instance is scheduled successfully.

Additional info:
Will attach SOS reports in comments.

Comment 2 Vadim Khitrin 2020-02-10 11:43:26 UTC
Sorry, Posted incorrect command to view port info, here is the command I wanted to post:

openstack port show sriov-mx5-direct-port -c resource_request -f shell
resource_request="{'required': ['CUSTOM_PHYSNET_SRIOV_3', 'CUSTOM_VNIC_TYPE_DIRECT'], 'resources': {'NET_BW_EGR_KILOBIT_PER_SEC': 4000000}}"

Comment 6 Vadim Khitrin 2020-02-10 17:00:48 UTC
In Train release (OSP16) the behaviour for minimum QoS has changed.
It was misconfiguration on my side, following the documentation https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/networking_guide/QoS#proc-guaranteed-min-bw I was able to spawn a guest with a minimum QoS attached to port.


Note You need to log in before you can comment on or make changes to this bug.