Bug 1774243

Summary: When trying to migrate VM's from a failed compute node, we receive a ResourceProviderCreationFailed traceback
Product: Red Hat OpenStack Reporter: Brendan Shephard <bshephar>
Component: openstack-novaAssignee: melanie witt <mwitt>
Status: CLOSED NEXTRELEASE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: dasmith, eglynn, jhakimra, jparker, kchamart, lyarwood, mwitt, osp-dfg-compute, sbauza, sgordon, stephenfin, vromanso
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-08-28 22:47:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brendan Shephard 2019-11-19 20:54:58 UTC
Description of problem:
We have a failed Compute node. When we try to migrate VM's from it, we receive the following traceback:

{u'message': u'Failed to create resource provider compute-0', u'code': 500, u'details': u'Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2920, in rebuild_instance
    migration=migration)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 246, in rebuild_claim
    limits=limits, image_meta=image_meta)
  File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 355, in _move_claim
    self._update(elevated, cn)
  File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 904, in _update
    inv_data,
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider
    parent_provider_uuid=parent_provider_uuid,
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
    return getattr(self.instance, __name)(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 1104, in set_inventory_for_provider
    parent_provider_uuid=parent_provider_uuid)
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 673, in _ensure_resource_provider
    name=name or uuid)
ResourceProviderCreationFailed: Failed to create resource provider compute-0
', u'created': u'2019-11-12T22:51:43Z'}

Version-Release number of selected component (if applicable):
RHOSP13 z7.
(I'll get the exact nova version and provide it as a comment)

How reproducible:
Every time we try to migrate in this environment

Steps to Reproduce:
1. nova evacuate UUID  (from the failed hypervisor)
2. 
3.

Actual results:

{u'message': u'Failed to create resource provider compute-0', u'code': 500, u'details': u'Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2920, in rebuild_instance
    migration=migration)
  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 246, in rebuild_claim
    limits=limits, image_meta=image_meta)
  File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 355, in _move_claim
    self._update(elevated, cn)
  File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 904, in _update
    inv_data,
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider
    parent_provider_uuid=parent_provider_uuid,
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
    return getattr(self.instance, __name)(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 1104, in set_inventory_for_provider
    parent_provider_uuid=parent_provider_uuid)
  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 673, in _ensure_resource_provider
    name=name or uuid)
ResourceProviderCreationFailed: Failed to create resource provider compute-0
', u'created': u'2019-11-12T22:51:43Z'}

Expected results:
It should be able to find the resource provider and migrate the VM.

Additional info:
We fail here:
https://opendev.org/openstack/nova/src/branch/stable/queens/nova/scheduler/client/report.py#L659-L673

So I assume it fails to get / refresh the existing resource provider and moves into the if not statement:

        rps_to_refresh = self._get_providers_in_tree(context, uuid)
        if not rps_to_refresh:

Comment 2 melanie witt 2019-11-22 06:33:04 UTC
Apologies for the delayed response -- the placement service is not my usual area of expertise and I had to do a lot of research to assemble the recovery steps.

There's not much info in this bug report as to what actions were taken prior to the error described here, so I'm going to go through a couple of possibilities and their mitigations.

(1) Did the customer delete a compute service while its nova-compute process was running on the physical compute host? If so, that will result in the compute node record and placement resource provider being deleted out from under nova-compute and when subsequent server actions are attempted, they will fail with ResourceProviderCreationFailed [1]. Per the documentation, the nova-compute process should be *stopped* before deleting the service via the nova API [2].

If this ^ is the case, restart the nova-compute process to recover the problem. You may also need to run 'nova-manage cell_v2 discover_hosts' if the compute service was deleted more recently than the CONF.scheduler.discover_hosts_in_cells_interval (if servers are going to ERROR state on the compute host with "Host 'compute-0' is not mapped to any cell").

(2) Did the customer delete a compute service that had (a) previously evacuated instances and the original host never recovered and/or (b) had unconfirmed migrations? If so, that will result in a ResourceProviderInUse failure to delete the placement resource provider associated with the service because the cascading delete misses the placement allocations related to the evacuations and migrations.

If this ^ is the case, recovery will involve deleting the placement resource provider using the osc-placement CLI. There is a doc upstream [4] related to this, with the caveat that the 'nova-manage heal_allocations' CLI is not available in Queens (OSP13). It was introduced in Rocky (OSP14). We will have to backport that tool to OSP13 separately.

Ultimately, the goal with situation (2) is to delete the colliding resource provider that is causing ResourceProviderCreationFailed. In order to delete the placement resource provider, we must first delete any placement allocations related to it. But before deleting any allocations related to it, we must record what they are currently so that we may restore them after deleting the resource provider. We need to restore them else any instances hosted on the compute host will no longer have their resources properly tracked.

You will need to get the osc-placement CLI from its own package [5].

Here are the steps to recover:

1. Get the resource provider UUID using the compute host name [6]:
rp_uuid=$(openstack resource provider list --name compute-0 -f value -c uuid)

2. View the allocations for the compute host [7]. This will show you all of the servers allocations:
openstack resource provider show --allocations $rp_uuid

3. Get a list of all servers running on the host [8]:
openstack server list --host compute-0

4. For each server in the list:

4a. Get and **SAVE** the allocations for the server [9]. You will need to restore them later:
openstack --os-placement-api-version 1.12 resource provider allocation show $server_uuid

4b. Delete the allocation for the server for compute-0:

If there was only one resource provider in the allocations from step 4a, delete all allocations on all compute hosts for the server:
openstack resource provider allocation delete $server_uuid

If there were multiple allocations in step 4a, delete only the allocation for compute-0 for the server by omitting $rp_uuid [11]:
openstack --os-placement-api-version 1.12 resource provider allocation set $server_uuid \
    --project-id <project uuid from step 4a> \
    --user-id <user uuid from step 4a> \
    --allocation rp=<other rp uuid than compute host>,<resource class from step 4a>=<value from step 4a> \
    repeat --allocation for all the rest of the allocations for resource providers other than compute host

5. Once you have deleted and **SAVED** all the allocations from the servers on compute-0, delete the resource provider [12]:
openstack resource provider delete $rp_uuid

6. Restart nova-compute. The resource providers should be re-created successfully.

7. Get the (new) resource provider UUID using the compute host name [6]:
new_rp_uuid=$(openstack resource provider list --name compute-0 -f value -c uuid)

8. Restore all of the allocations for the servers on compute-0:

For each server in the list:

8a. Create (restore) the same allocations for the server on compute-0 that you **SAVED** from step 4a [11]:
openstack --os-placement-api-version 1.12 resource provider allocation set $server_uuid \
    --project-id <project uuid from step 4a> \
    --user-id <user uuid from step 4a> \
    --allocation rp=$new_rp_uuid,<resource class from step 4a>=<value from step 4a> \
    repeat --allocation for all allocations you saved from step 4a
    
9. The compute host should now be recovered.

[1] https://bugs.launchpad.net/nova/+bug/1817833
[2] https://docs.openstack.org/api-ref/compute/?expanded=delete-compute-service-detail#delete-compute-service
[3] https://bugs.launchpad.net/nova/+bug/1829479
[4] https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html
[5] https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=70403
[6] https://docs.openstack.org/osc-placement/queens/cli/index.html#resource-provider-list
[7] https://docs.openstack.org/osc-placement/queens/cli/index.html#resource-provider-show
[8] https://docs.openstack.org/python-openstackclient/queens/cli/command-objects/server.html#server-list
[9] https://docs.openstack.org/osc-placement/queens/cli/index.html#resource-provider-allocation-show
[10] https://docs.openstack.org/osc-placement/queens/cli/index.html#resource-provider-allocation-delete
[11] https://docs.openstack.org/osc-placement/queens/cli/index.html#resource-provider-allocation-set
[12] https://docs.openstack.org/osc-placement/queens/cli/index.html#resource-provider-delete

Comment 3 melanie witt 2019-11-23 02:05:57 UTC
(In reply to melanie witt from comment #2)
> (1) Did the customer delete a compute service while its nova-compute process
> was running on the physical compute host? If so, that will result in the
> compute node record and placement resource provider being deleted out from
> under nova-compute and when subsequent server actions are attempted, they
> will fail with ResourceProviderCreationFailed [1]. Per the documentation,
> the nova-compute process should be *stopped* before deleting the service via
> the nova API [2].
> 
> If this ^ is the case, restart the nova-compute process to recover the
> problem. You may also need to run 'nova-manage cell_v2 discover_hosts' if
> the compute service was deleted more recently than the
> CONF.scheduler.discover_hosts_in_cells_interval (if servers are going to
> ERROR state on the compute host with "Host 'compute-0' is not mapped to any
> cell").

I went through the upstream bug again today and realized that the bug report [1] doesn't seem to make sense. The report shows an empty placement resource_providers table after deleting the compute service, so it's unclear how ResourceProviderCreationFailed could possibly be raised if there are no existing resource_providers to collide with.

So, please ignore this quoted portion of comment 2 please.

The recovery steps for listed under scenario (2) are the recommended action to take to recover the compute host.

[1] https://bugs.launchpad.net/nova/+bug/1817833

Comment 4 Brendan Shephard 2019-11-26 06:03:46 UTC
Hey, so the physical compute node failed in this case. So we needed to migrate the VM's that were allocated on the node off and onto one of the other nodes. We did this by executing: nova evacuate HOST

This is where the traceback is coming from. It's complaining that it can't migrate the VM's onto compute-0. 


In parallel to this, we also completed the compute node scale down procedure to remove the failed node. Then we added in another compute node to replace it. Once the new compute node was built, we found that we were unable to delete the failed compute nodes resource provider entry as per:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Removing_Compute_Nodes

It asks you to openstack resource provider delete UUID_OF_REMOVED_NODE

This is also failing. So I think the two issues are likely related. When they tried to remove the old resource provider, it says that allocations exist. But all of the allocations have been removed:

$ openstack resource provider delete 8ae56eb0-bd33-4523-b996-1e9f0ae408cf
Unable to delete resource provider 8ae56eb0-bd33-4523-b996-1e9f0ae408cf: Resource provider has allocations. (HTTP 409)

They tried manually deleting the allocations as well, but still get the same issue:

for VM in $(openstack server list --all-projects -f value -c ID)                                                                                                                                                                         
do                                                                                                                                                                                                                                           
  if openstack resource provider allocation show -f value $VM |grep -q 8ae56eb0-bd33-4523-b996-1e9f0ae408cf                                                                                                                              
  then                                                                                                                                                                                                                                       
    echo $VM                                                                                                                                                                                                                             
  fi                                                                                                                                                                                                                                         
done 
3795fd73-7bf8-46c6-9b7c-c30d8edfd94b
a137f82e-8117-47ed-969f-2ae70175fcc5
b5e6ffba-c53e-4741-8aef-367828323265
c5f2f3d8-d86f-4506-8409-f02aa718b053
3702af60-a120-489b-b74f-cbd43c3efaf6
d8159835-b82d-4a33-bcbb-bde07a481258
$openstack resource provider allocation delete 3795fd73-7bf8-46c6-9b7c-c30d8edfd94b
$openstack resource provider allocation delete a137f82e-8117-47ed-969f-2ae70175fcc5
$openstack resource provider allocation delete b5e6ffba-c53e-4741-8aef-367828323265
$ openstack resource provider allocation delete c5f2f3d8-d86f-4506-8409-f02aa718b053
$ openstack resource provider allocation delete 3702af60-a120-489b-b74f-cbd43c3efaf6
$ openstack resource provider allocation delete d8159835-b82d-4a33-bcbb-bde07a481258
$ openstack resource provider delete 8ae56eb0-bd33-4523-b996-1e9f0ae408cf
Unable to delete resource provider 8ae56eb0-bd33-4523-b996-1e9f0ae408cf: Resource provider has allocations. (HTTP 409)
$ for VM in $(openstack server list --all-projects -f value -c ID) ; 
do 
if openstack resource provider allocation show -f value $VM |grep -q 8ae56eb0-bd33-4523-b996-1e9f0ae408cf;   
then
  echo $VM;
done 
$ openstack resource provider delete 8ae56eb0-bd33-4523-b996-1e9f0ae408cf
Unable to delete resource provider 8ae56eb0-bd33-4523-b996-1e9f0ae408cf: Resource provider has allocations. (HTTP 409)


So this is more-or-less what you have suggested above. The only thing that hasn't been done is recreating the allocations back on the other node.

Comment 5 melanie witt 2019-11-27 02:25:38 UTC
Okay, thanks for explaining the situation details.

What you have done so far is found and deleted allocations related to all instances in the deployment. (Take caution doing something like this because it's important to save the allocations for later restoration and listing *all* instances in the deployment and deleting all their allocations could be a very large number of instances).

So, you have done that but there are still allocations related to the resource provider you need to delete.

To view the remaining allocations on the resource provider, run the following command: openstack resource provider show --allocations <rp_uuid>

In the 'allocations' field, you will see the allocations associated with the resource provider keyed by consumer UUID (which is the instance UUID).

Chances are, the instance UUIDs for those allocations are no longer existing in the deployment and were orphaned at some point in the past. Verify whether those instance UUIDs (openstack server show) are still around. If they are not, you can delete their allocations too.

Once all of the allocations shown in 'openstack resource provider show --allocations' are deleted, you should be able to delete the resource provider and move on with migrating the VMs.

Comment 10 melanie witt 2024-08-28 22:47:28 UTC
There is a fair chance the root cause of this issue is the same as: https://bugzilla.redhat.com/show_bug.cgi?id=1982051 and the fix for that will be released in 17.1.4, so closing this NEXTRELEASE.