Bug 1413010 - unable to unshelve instances [NEEDINFO]
Summary: unable to unshelve instances
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 8.0 (Liberty)
Hardware: All
OS: Linux
high
high
Target Milestone: zstream
: 8.0 (Liberty)
Assignee: Vladik Romanovsky
QA Contact: awaugama
URL:
Whiteboard:
Depends On: 1409356
Blocks: 1414965
TreeView+ depends on / blocked
 
Reported: 2017-01-13 12:08 UTC by Pratik Pravin Bandarkar
Modified: 2020-06-11 13:11 UTC (History)
19 users (show)

Fixed In Version: openstack-nova-12.0.6-14.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1414965 (view as bug list)
Environment:
Last Closed: 2017-10-25 17:10:24 UTC
Target Upstream Version:
ccollett: needinfo? (vromanso)
mlopes: needinfo? (vromanso)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2983771 0 None None None 2017-03-28 11:23:14 UTC
Red Hat Product Errata RHBA-2017:3068 0 normal SHIPPED_LIVE openstack-nova bug fix advisory 2017-10-25 21:05:11 UTC

Description Pratik Pravin Bandarkar 2017-01-13 12:08:24 UTC
Description of problem:
Sometime unable to unshelve the instances. 

Details:

- Spawned 7 sriov instances in "prod-az" availability zone.


[stack@ibm-x3630m4-5 ~]$ nova aggregate-list
+----+------+-------------------+
| Id | Name | Availability Zone |
+----+------+-------------------+
| 2  | prod | prod-az           |
| 5  | dev  | dev-az            |
+----+------+-------------------+

[stack@ibm-x3630m4-5 ~]$ nova aggregate-details 2
+----+------+-------------------+-----------------------------------+-----------------------------+
| Id | Name | Availability Zone | Hosts                             | Metadata                    |
+----+------+-------------------+-----------------------------------+-----------------------------+
| 2  | prod | prod-az           | 'overcloud-compute-0.localdomain' | 'availability_zone=prod-az' |
+----+------+-------------------+-----------------------------------+-----------------------------+
[stack@ibm-x3630m4-5 ~]$ nova aggregate-details 5
+----+------+-------------------+-----------------------------------+----------------------------+
| Id | Name | Availability Zone | Hosts                             | Metadata                   |
+----+------+-------------------+-----------------------------------+----------------------------+
| 5  | dev  | dev-az            | 'overcloud-compute-1.localdomain' | 'availability_zone=dev-az' |
+----+------+-------------------+-----------------------------------+----------------------------+


[stack@ibm-x3630m4-5 ~]$ y=0; for i in $(neutron port-list |grep -i sr- |awk {'print $2'}); do ((y++)) ;nova boot --image RHEL7 --flavor 11  --nic port-id=$i --availability-zone prod-az pbandark-$y ; done 

[stack@ibm-x3630m4-5 ~]$ nova list
+--------------------------------------+------------+--------+------------+-------------+---------------------+
| ID                                   | Name       | Status | Task State | Power State | Networks            |
+--------------------------------------+------------+--------+------------+-------------+---------------------+
| 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | ACTIVE | -          | Running     | sriov=10.65.199.199 |
| 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | ACTIVE | -          | Running     | sriov=10.65.199.201 |
| 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | ACTIVE | -          | Running     | sriov=10.65.199.202 |
| 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | ACTIVE | -          | Running     | sriov=10.65.199.197 |
| fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | ACTIVE | -          | Running     | sriov=10.65.199.196 |
| 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | ACTIVE | -          | Running     | sriov=10.65.199.200 |
| 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | ACTIVE | -          | Running     | sriov=10.65.199.203 |
+--------------------------------------+------------+--------+------------+-------------+---------------------+


[stack@ibm-x3630m4-5 ~]$  nova list |awk {'print $4'}|egrep -v '^$|Name' |xargs -i nova show {} |egrep -i zone |awk  {'print $4'}prod-az
prod-az
prod-az
prod-az
prod-az
prod-az
prod-az
prod-az
prod-az

- shelved all instances:

[stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $2'} |egrep -v '^$|ID' |xargs -i nova shelve {}
[stack@ibm-x3630m4-5 ~]$ 
[stack@ibm-x3630m4-5 ~]$ nova list
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| ID                                   | Name       | Status            | Task State | Power State | Networks            |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.199 |
| 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.201 |
| 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.202 |
| 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.197 |
| fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.196 |
| 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.200 |
| 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.203 |
| 6bb91bd7-1495-4de9-96df-d2e15a90886e | pbandark-8 | ERROR             | -          | NOSTATE     |                     |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+

- unshelved the instances:

[stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $2'} |egrep -v '^$|ID' |xargs -i nova unshelve {}^C
[stack@ibm-x3630m4-5 ~]$ nova list
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| ID                                   | Name       | Status            | Task State | Power State | Networks            |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | ACTIVE            | -          | Running     | sriov=10.65.199.199 |
| 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | ACTIVE            | -          | Running     | sriov=10.65.199.201 |
| 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | ACTIVE            | -          | Running     | sriov=10.65.199.202 |
| 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.197 |
| fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.196 |
| 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.200 |
| 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.203 |
| 6bb91bd7-1495-4de9-96df-d2e15a90886e | pbandark-8 | ERROR             | -          | NOSTATE     |                     |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+


^^^^^ The operation failed for few instances.

- From compute logs:

079s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:265
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [req-1848197b-3436-465e-9a0d-6e68285f0d2d cb39646c878442868eec409a98126fc5 a448234ace054e5ab0635dd6e12d0992 - - -] [instance: 0511f414-0392-4188-b3af-b5697525d19b] Instance failed to spawn
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] Traceback (most recent call last):
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4346, in _unshelve_instance
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     with rt.instance_claim(context, instance, limits):
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     return f(*args, **kwargs)
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 173, in instance_claim
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     overhead=overhead, limits=limits)
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 90, in __init__
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     self._claim_test(resources, limits)
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 147, in _claim_test
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     "; ".join(reasons))
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed..


2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher [req-1848197b-3436-465e-9a0d-6e68285f0d2d cb39646c878442868eec409a98126fc5 a448234ace054e5ab0635dd6e12d0992 - - -] Exception during message handling: Insufficient compute resources: Claim pci failed..
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     executor_callback))
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     executor_callback)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     result = func(ctxt, **new_args)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/exception.py", line 89, in wrapped
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     payload)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/exception.py", line 72, in wrapped
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return f(self, context, *args, **kw)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 350, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     LOG.warning(msg, e, instance=instance)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 323, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 400, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 378, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     kwargs['instance'], e, sys.exc_info())
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 366, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4299, in unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     do_unshelve_instance()
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return f(*args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4298, in do_unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     filter_properties, node)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4355, in _unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     instance=instance)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4346, in _unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     with rt.instance_claim(context, instance, limits):
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return f(*args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 173, in instance_claim
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     overhead=overhead, limits=limits)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 90, in __init__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     self._claim_test(resources, limits)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 147, in _claim_test
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     "; ".join(reasons))
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed..
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher 




Version-Release number of selected component (if applicable):
RHOS8

How reproducible:



Actual results:
sometime instance unshelve operation fails.

Expected results:
unshelve operation should be successful. 

Additional info:

Comment 2 Vladik Romanovsky 2017-01-26 02:28:03 UTC
Hello,

Apologize for the delay.
Unfortunately, I didn't find any mentioning of the provided traces in the logs.
These traces are from 2017-01-03, but the attached logs contain only activity from 2017-01-13.

Will it be possible to reproduce the issue and capture the logs right after it occurred?

Also, in OSP8 we didn't allocate new pci devices during unshelve/rebuild/evacuate operations.
We are also not updating the neutron port binding, that holds the pci address of the device - nova libvirt driver uses it to configure the virtual interfaces.. as in [1]
 
However, this change relies on the work that has been done across 2 cycles, (Mitaka and Newton) that introduced a migration context object and made resources to be claimed and allocated during the above operations ([2] and [3]).
These patches are not backportable, due to RPC and object changes.


[1] https://review.openstack.org/#/c/242573
[2] https://review.openstack.org/#/q/topic:bug/1417667
[3] https://review.openstack.org/#/q/topic:bp/migration-fix-resource-tracking

Comment 17 Sahid Ferdjaoui 2017-04-05 11:08:37 UTC
I may have a path to follow...

As Vladik indicated on comment #14, the filter 'pci_passthrough_filter.py' returns true whether it does not find any pci_requests attached to the instance.

My thinking is that when the API is loading the instance to then pass it to the compute API, conductor and finally to the scheduler, the instance does not have the attribute 'pci_requests' loaded resulting that the instance can be offloaded on a compute node which can't accept the request.

That is the patch I would propose I could still provide test-build if customer prefer.

diff --git a/nova/api/openstack/compute/shelve.py b/nova/api/openstack/compute/shelve.py
index 6f9f8ae..2f31554 100644
--- a/nova/api/openstack/compute/shelve.py
+++ b/nova/api/openstack/compute/shelve.py
@@ -59,7 +59,8 @@ class ShelveController(wsgi.Controller):
         context = req.environ["nova.context"]
         authorize(context, action='shelve_offload')
 
-        instance = common.get_instance(self.compute_api, context, id)
+        instance = common.get_instance(
+            self.compute_api, context, id, expected_attrs=['pci_requests'])
         try:
             self.compute_api.shelve_offload(context, instance)
         except exception.InstanceUnknownCell as e:

Comment 18 Sahid Ferdjaoui 2017-04-05 11:19:34 UTC
An other way to "fix" the issue (if that is really the root cause) would be to replace that part of code [0], by a call to the database to get the pci_requests related to the instance scheduled but I would say it's going to create a larger overhead since for each instance scheduled the database is going to be hit.

[0] https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/scheduler/filter_scheduler.py;h=ec986252f49f60640b8d75f8162b6a39aa640fd1;hb=refs/heads/rhos-8.0-patches#l114

Comment 19 Sylvain Bauza 2017-04-05 13:30:20 UTC
Like I said in comment #16, the problem is that the instance we get when calling unshelve is not having the pci_requests field set.

So, yeah, I definitely agree with the proposal of comment #17 to load the PCI bits when getting the instance.
To be clear, that issue is not present in OSP9 because we should get the original RequestSpec record that includes the pci_requests field when calling unshelve *but* the upstream Gerrit change I commented on comment #16 is not backportable given lots of RPC changes and DB modifications involved by that feature.

About comment #18, I disagree to provide such modification in the filter. Conceptually, we don't want for performance reasons (mostly) to query the Nova DB when we lookup the filters (in particular the instances table which is vrey large).

Comment 38 errata-xmlrpc 2017-10-25 17:10:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3068


Note You need to log in before you can comment on or make changes to this bug.