Bug 2104804

Summary: VM creation fails due to VCPU allocation issues when the request reaches placement api
Product: Red Hat OpenStack Reporter: Ketan Mehta <kmehta>
Component: openstack-placementAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED DUPLICATE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 17.0 (Wallaby)CC: astillma, bgibizer, dciabrin, jparker, jraju, jschluet, kchamart, sbauza, schari, smooney, spower, stchen
Target Milestone: gaKeywords: Triaged
Target Release: 17.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2109813 (view as bug list) Environment:
Last Closed: 2022-09-07 10:50:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2096274, 2109813    
Bug Blocks:    

Description Ketan Mehta 2022-07-07 08:17:32 UTC
Description of problem:

Placement API reports incorrect usage of VCPU when requesting an allocation.

At the moment I have 2 VMs running on the compute with 12 vcpus per VM and the host is having 64 CPUs available & online.

However, VM creation fails after creation of 2 VMs on every compute node.

nova-compute resource agent reports 24 vcpus in use by the compute whereas, a new vm creation request fails due to a failure in allocating VCPU to the VM.

+++
RESP BODY: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '65929119-23f6-4ba2-b98b-4eab5884633f'. The requested amount would exceed the capacity.  ", "request_id": "req-f31a74af-6bf3-4547-a41b-f29bfcd9b0f0"}]}
+++

Do note that cpu_shared_set & cpu_dedicated_set were not specified so, the cpu_allocation_ratio would be 16.0 by default.

Here is the issue from Placement API, with respect to 1 resource provider:

+--------------------------------------+-----------------------+------------+
| uuid                                 | name                  | generation |
+--------------------------------------+-----------------------+------------+
| 32c47b84-3bd6-4022-8455-867d1b819dd3 | compute-4.localdomain |         17 |
| 84b1755d-8ffd-4196-a3a6-c6218970307e | compute-2.localdomain |         18 |
| 65929119-23f6-4ba2-b98b-4eab5884633f | compute-5.localdomain |         15 |
| 6664ea69-8737-459c-af0a-e42108a6dcf7 | compute-0.localdomain |         15 |
| a9af1312-abbc-4151-a27f-beb901fb638b | compute-3.localdomain |         15 |
| 4c3b2cae-8ff1-4230-bc43-6f95ff70506c | compute-6.localdomain |         17 |
| 920eaa5d-ea15-4c05-8910-0b34f66b7b92 | compute-1.localdomain |         17 |
+--------------------------------------+-----------------------+------------+

Let's take compute-5 as rp, in the next few commands.

# openstack resource provider show 65929119-23f6-4ba2-b98b-4eab5884633f --allocation

+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field       | Value                                                                                                                                                                                                            |
+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uuid        | 65929119-23f6-4ba2-b98b-4eab5884633f                                                                                                                                                                             |
| name        | compute-5.localdomain                                                                                                                                                                                            |
| generation  | 15                                                                                                                                                                                                               |
| allocations | {'5d939491-fca1-4c67-98b4-6e0d1bb8eac8': {'resources': {'VCPU': 12, 'MEMORY_MB': 8192, 'DISK_GB': 100}}, '5f37eb85-bbdb-4008-9d8f-5394d12ffb66': {'resources': {'VCPU': 12, 'MEMORY_MB': 8192, 'DISK_GB': 100}}} |
+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

# openstack resource provider usage show 65929119-23f6-4ba2-b98b-4eab5884633f
 
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU           |    24 |
| MEMORY_MB      | 16384 |
| DISK_GB        |   200 |
+----------------+-------+

So, based on these we would assume that we have resources. Let's request an allocation.

# openstack resource provider allocation set --allocation rp=65929119-23f6-4ba2-b98b-4eab5884633f,VCPU=4,DISK_GB=100,MEMORY_MB=8192 65929119-23f6-4ba2-b98b-4eab5884633f --debug

It fails with the same error as mentioned above, here is an excerpt from the placement api logs for request id req-f31a74af-6bf3-4547-a41b-f29bfcd9b0f0

+++
RESP BODY: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '65929119-23f6-4ba2-b98b-4eab5884633f'. The requested amount would exceed the capacity.  ", "request_id": "req-1032c99c-5e57-4982-b838-2c0263ee5fb1"}]}
PUT call to placement for http://192.16.0.51:8778/placement/allocations/65929119-23f6-4ba2-b98b-4eab5884633f used request id req-1032c99c-5e57-4982-b838-2c0263ee5fb1
Request returned failure status: 409
Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '65929119-23f6-4ba2-b98b-4eab5884633f'. The requested amount would exceed the capacity. (HTTP 409)
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/osc_placement/http.py", line 32, in _wrap_http_exceptions
    yield
  File "/usr/lib/python3.9/site-packages/osc_placement/http.py", line 59, in request
    return self.session.request(url, method,
  File "/usr/lib/python3.9/site-packages/keystoneauth1/session.py", line 986, in request
    raise exceptions.from_response(resp, method, url)
keystoneauth1.exceptions.http.Conflict: Conflict (HTTP 409) (Request-ID: req-1032c99c-5e57-4982-b838-2c0263ee5fb1)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/cliff/app.py", line 401, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python3.9/site-packages/osc_lib/command/command.py", line 39, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.9/site-packages/cliff/display.py", line 115, in run
    column_names, data = self.take_action(parsed_args)
  File "/usr/lib/python3.9/site-packages/osc_placement/resources/allocation.py", line 139, in take_action
    http.request('PUT', url, json=payload)
  File "/usr/lib/python3.9/site-packages/osc_placement/http.py", line 59, in request
    return self.session.request(url, method,
  File "/usr/lib64/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/lib/python3.9/site-packages/osc_placement/http.py", line 39, in _wrap_http_exceptions
    six.raise_from(exc_class(exc.http_status, msg), exc)
  File "<string>", line 3, in raise_from
osc_lib.exceptions.Conflict: Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '65929119-23f6-4ba2-b98b-4eab5884633f'. The requested amount would exceed the capacity. (HTTP 409)
clean_up SetAllocation: Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '65929119-23f6-4ba2-b98b-4eab5884633f'. The requested amount would exceed the capacity. (HTTP 409)
+++

+++
controller-1 | CHANGED | rc=0 >>
/var/log/containers/placement/placement.log:2022-07-07 08:11:27.601 16 DEBUG placement.requestlog [req-1032c99c-5e57-4982-b838-2c0263ee5fb1 - - - - -] Starting request: 192.17.1.95 "PUT /placement/allocations/65929119-23f6-4ba2-b98b-4eab5884633f" __call__ /usr/lib/python3.9/site-packages/placement/requestlog.py:55
/var/log/containers/placement/placement.log:2022-07-07 08:11:27.729 16 WARNING placement.objects.allocation [req-1032c99c-5e57-4982-b838-2c0263ee5fb1 98e716dbc1af4bf695e0b6ffc41a7569 0bfd001369604c33bfa8ca01814cff04 - default default] Over capacity for VCPU on resource provider 65929119-23f6-4ba2-b98b-4eab5884633f. Needed: 12, Used: 16608, Capacity: 1024.0
/var/log/containers/placement/placement.log:2022-07-07 08:11:27.736 16 DEBUG placement.handlers.allocation [req-1032c99c-5e57-4982-b838-2c0263ee5fb1 98e716dbc1af4bf695e0b6ffc41a7569 0bfd001369604c33bfa8ca01814cff04 - default default] Deleted auto-created consumer with consumer UUID 65929119-23f6-4ba2-b98b-4eab5884633f after failed allocation delete_consumers /usr/lib/python3.9/site-packages/placement/handlers/allocation.py:364
/var/log/containers/placement/placement.log:2022-07-07 08:11:27.737 16 DEBUG placement.wsgi_wrapper [req-1032c99c-5e57-4982-b838-2c0263ee5fb1 98e716dbc1af4bf695e0b6ffc41a7569 0bfd001369604c33bfa8ca01814cff04 - default default] Placement API returning an error response: Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '65929119-23f6-4ba2-b98b-4eab5884633f'. The requested amount would exceed the capacity. call_func /usr/lib/python3.9/site-packages/placement/wsgi_wrapper.py:31
/var/log/containers/placement/placement.log:2022-07-07 08:11:27.739 16 INFO placement.requestlog [req-1032c99c-5e57-4982-b838-2c0263ee5fb1 98e716dbc1af4bf695e0b6ffc41a7569 0bfd001369604c33bfa8ca01814cff04 - default default] 192.17.1.95 "PUT /placement/allocations/65929119-23f6-4ba2-b98b-4eab5884633f" status: 409 len: 364 microversion: 1.0
+++

I'm not sure how the used VCPUs is being reported as 16608 with just 2 VMs with 12 vcpus each running on the mentioned compute node with 64 cpus. The max vcpus seem fine 64*16).

Version-Release number of selected component (if applicable):

[root@controller-1 /]# rpm -qa |grep -i placement
python3-placement-5.0.1-0.20210813021511.adf525a.el9ost.noarch
openstack-placement-common-5.0.1-0.20210813021511.adf525a.el9ost.noarch
openstack-placement-api-5.0.1-0.20210813021511.adf525a.el9ost.noarch

[root@controller-1 /]# rpm -qa |grep -i nova
python3-novaclient-17.4.0-0.20210812172018.54d4da1.el9ost.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

VM creation fails.

Expected results:

VM creation should succeed. Should be able to request an allocation.

Additional info:
Environment details can be shared for review.

Comment 3 smooney 2022-07-18 17:53:09 UTC
is this form an upstream ci job, a downstream ci job or  an issue you hit directly


if this is from a deployment you have access to can you provide a set of sos reports.
if this is from a ci run can you provide the link to the failing job.

this might be a rhel bug in which case we will either need to change the component or close this as cant fix and file a separate bug.

Comment 7 smooney 2022-07-22 09:44:11 UTC
we have identified the cause as  a bug in MariaDB that is being fixed by https://bugzilla.redhat.com/show_bug.cgi?id=2096274

I'm going to triage this as urgent urgent for now although we likely will not need to do anything once the new package is available and the container rebuilt.