Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1413010

Summary:	unable to unshelve instances
Product:	Red Hat OpenStack	Reporter:	Pratik Pravin Bandarkar <pbandark>
Component:	openstack-nova	Assignee:	Vladik Romanovsky <vromanso>
Status:	CLOSED ERRATA	QA Contact:	awaugama
Severity:	high	Docs Contact:
Priority:	high
Version:	8.0 (Liberty)	CC:	aguetta, berrange, ccollett, cshastri, dasmith, eglynn, jhakimra, jjoyce, kchamart, mbooth, mlopes, mschuppe, pbandark, sbauza, sferdjao, sgordon, srevivo, vaggarwa, vromanso
Target Milestone:	zstream	Keywords:	OtherQA, TestOnly, Triaged, ZStream
Target Release:	8.0 (Liberty)
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-nova-12.0.6-14.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1414965 (view as bug list)		Environment:
Last Closed:	2017-10-25 17:10:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1409356
Bug Blocks:	1414965

Description Pratik Pravin Bandarkar 2017-01-13 12:08:24 UTC

Description of problem:
Sometime unable to unshelve the instances. 

Details:

- Spawned 7 sriov instances in "prod-az" availability zone.


[stack@ibm-x3630m4-5 ~]$ nova aggregate-list
+----+------+-------------------+
| Id | Name | Availability Zone |
+----+------+-------------------+
| 2  | prod | prod-az           |
| 5  | dev  | dev-az            |
+----+------+-------------------+

[stack@ibm-x3630m4-5 ~]$ nova aggregate-details 2
+----+------+-------------------+-----------------------------------+-----------------------------+
| Id | Name | Availability Zone | Hosts                             | Metadata                    |
+----+------+-------------------+-----------------------------------+-----------------------------+
| 2  | prod | prod-az           | 'overcloud-compute-0.localdomain' | 'availability_zone=prod-az' |
+----+------+-------------------+-----------------------------------+-----------------------------+
[stack@ibm-x3630m4-5 ~]$ nova aggregate-details 5
+----+------+-------------------+-----------------------------------+----------------------------+
| Id | Name | Availability Zone | Hosts                             | Metadata                   |
+----+------+-------------------+-----------------------------------+----------------------------+
| 5  | dev  | dev-az            | 'overcloud-compute-1.localdomain' | 'availability_zone=dev-az' |
+----+------+-------------------+-----------------------------------+----------------------------+


[stack@ibm-x3630m4-5 ~]$ y=0; for i in $(neutron port-list |grep -i sr- |awk {'print $2'}); do ((y++)) ;nova boot --image RHEL7 --flavor 11  --nic port-id=$i --availability-zone prod-az pbandark-$y ; done 

[stack@ibm-x3630m4-5 ~]$ nova list
+--------------------------------------+------------+--------+------------+-------------+---------------------+
| ID                                   | Name       | Status | Task State | Power State | Networks            |
+--------------------------------------+------------+--------+------------+-------------+---------------------+
| 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | ACTIVE | -          | Running     | sriov=10.65.199.199 |
| 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | ACTIVE | -          | Running     | sriov=10.65.199.201 |
| 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | ACTIVE | -          | Running     | sriov=10.65.199.202 |
| 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | ACTIVE | -          | Running     | sriov=10.65.199.197 |
| fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | ACTIVE | -          | Running     | sriov=10.65.199.196 |
| 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | ACTIVE | -          | Running     | sriov=10.65.199.200 |
| 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | ACTIVE | -          | Running     | sriov=10.65.199.203 |
+--------------------------------------+------------+--------+------------+-------------+---------------------+


[stack@ibm-x3630m4-5 ~]$  nova list |awk {'print $4'}|egrep -v '^$|Name' |xargs -i nova show {} |egrep -i zone |awk  {'print $4'}prod-az
prod-az
prod-az
prod-az
prod-az
prod-az
prod-az
prod-az
prod-az

- shelved all instances:

[stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $2'} |egrep -v '^$|ID' |xargs -i nova shelve {}
[stack@ibm-x3630m4-5 ~]$ 
[stack@ibm-x3630m4-5 ~]$ nova list
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| ID                                   | Name       | Status            | Task State | Power State | Networks            |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.199 |
| 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.201 |
| 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.202 |
| 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.197 |
| fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.196 |
| 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.200 |
| 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.203 |
| 6bb91bd7-1495-4de9-96df-d2e15a90886e | pbandark-8 | ERROR             | -          | NOSTATE     |                     |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+

- unshelved the instances:

[stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $2'} |egrep -v '^$|ID' |xargs -i nova unshelve {}^C
[stack@ibm-x3630m4-5 ~]$ nova list
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| ID                                   | Name       | Status            | Task State | Power State | Networks            |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+
| 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | ACTIVE            | -          | Running     | sriov=10.65.199.199 |
| 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | ACTIVE            | -          | Running     | sriov=10.65.199.201 |
| 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | ACTIVE            | -          | Running     | sriov=10.65.199.202 |
| 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.197 |
| fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.196 |
| 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.200 |
| 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | SHELVED_OFFLOADED | -          | Shutdown    | sriov=10.65.199.203 |
| 6bb91bd7-1495-4de9-96df-d2e15a90886e | pbandark-8 | ERROR             | -          | NOSTATE     |                     |
+--------------------------------------+------------+-------------------+------------+-------------+---------------------+


^^^^^ The operation failed for few instances.

- From compute logs:

079s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:265
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [req-1848197b-3436-465e-9a0d-6e68285f0d2d cb39646c878442868eec409a98126fc5 a448234ace054e5ab0635dd6e12d0992 - - -] [instance: 0511f414-0392-4188-b3af-b5697525d19b] Instance failed to spawn
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] Traceback (most recent call last):
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4346, in _unshelve_instance
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     with rt.instance_claim(context, instance, limits):
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     return f(*args, **kwargs)
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 173, in instance_claim
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     overhead=overhead, limits=limits)
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 90, in __init__
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     self._claim_test(resources, limits)
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 147, in _claim_test
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b]     "; ".join(reasons))
2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed..


2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher [req-1848197b-3436-465e-9a0d-6e68285f0d2d cb39646c878442868eec409a98126fc5 a448234ace054e5ab0635dd6e12d0992 - - -] Exception during message handling: Insufficient compute resources: Claim pci failed..
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     executor_callback))
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     executor_callback)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     result = func(ctxt, **new_args)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/exception.py", line 89, in wrapped
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     payload)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/exception.py", line 72, in wrapped
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return f(self, context, *args, **kw)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 350, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     LOG.warning(msg, e, instance=instance)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 323, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 400, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 378, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     kwargs['instance'], e, sys.exc_info())
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 366, in decorated_function
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return function(self, context, *args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4299, in unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     do_unshelve_instance()
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return f(*args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4298, in do_unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     filter_properties, node)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4355, in _unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     instance=instance)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     six.reraise(self.type_, self.value, self.tb)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4346, in _unshelve_instance
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     with rt.instance_claim(context, instance, limits):
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     return f(*args, **kwargs)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 173, in instance_claim
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     overhead=overhead, limits=limits)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 90, in __init__
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     self._claim_test(resources, limits)
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 147, in _claim_test
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher     "; ".join(reasons))
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed..
2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher 




Version-Release number of selected component (if applicable):
RHOS8

How reproducible:



Actual results:
sometime instance unshelve operation fails.

Expected results:
unshelve operation should be successful. 

Additional info:

Comment 2 Vladik Romanovsky 2017-01-26 02:28:03 UTC

Hello,

Apologize for the delay.
Unfortunately, I didn't find any mentioning of the provided traces in the logs.
These traces are from 2017-01-03, but the attached logs contain only activity from 2017-01-13.

Will it be possible to reproduce the issue and capture the logs right after it occurred?

Also, in OSP8 we didn't allocate new pci devices during unshelve/rebuild/evacuate operations.
We are also not updating the neutron port binding, that holds the pci address of the device - nova libvirt driver uses it to configure the virtual interfaces.. as in [1]
 
However, this change relies on the work that has been done across 2 cycles, (Mitaka and Newton) that introduced a migration context object and made resources to be claimed and allocated during the above operations ([2] and [3]).
These patches are not backportable, due to RPC and object changes.


[1] https://review.openstack.org/#/c/242573
[2] https://review.openstack.org/#/q/topic:bug/1417667
[3] https://review.openstack.org/#/q/topic:bp/migration-fix-resource-tracking

Comment 17 Sahid Ferdjaoui 2017-04-05 11:08:37 UTC

I may have a path to follow...

As Vladik indicated on comment #14, the filter 'pci_passthrough_filter.py' returns true whether it does not find any pci_requests attached to the instance.

My thinking is that when the API is loading the instance to then pass it to the compute API, conductor and finally to the scheduler, the instance does not have the attribute 'pci_requests' loaded resulting that the instance can be offloaded on a compute node which can't accept the request.

That is the patch I would propose I could still provide test-build if customer prefer.

diff --git a/nova/api/openstack/compute/shelve.py b/nova/api/openstack/compute/shelve.py
index 6f9f8ae..2f31554 100644
--- a/nova/api/openstack/compute/shelve.py
+++ b/nova/api/openstack/compute/shelve.py
@@ -59,7 +59,8 @@ class ShelveController(wsgi.Controller):
         context = req.environ["nova.context"]
         authorize(context, action='shelve_offload')
 
-        instance = common.get_instance(self.compute_api, context, id)
+        instance = common.get_instance(
+            self.compute_api, context, id, expected_attrs=['pci_requests'])
         try:
             self.compute_api.shelve_offload(context, instance)
         except exception.InstanceUnknownCell as e:

Comment 18 Sahid Ferdjaoui 2017-04-05 11:19:34 UTC

An other way to "fix" the issue (if that is really the root cause) would be to replace that part of code [0], by a call to the database to get the pci_requests related to the instance scheduled but I would say it's going to create a larger overhead since for each instance scheduled the database is going to be hit.

[0] https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/scheduler/filter_scheduler.py;h=ec986252f49f60640b8d75f8162b6a39aa640fd1;hb=refs/heads/rhos-8.0-patches#l114

Comment 19 Sylvain Bauza 2017-04-05 13:30:20 UTC

Like I said in comment #16, the problem is that the instance we get when calling unshelve is not having the pci_requests field set.

So, yeah, I definitely agree with the proposal of comment #17 to load the PCI bits when getting the instance.
To be clear, that issue is not present in OSP9 because we should get the original RequestSpec record that includes the pci_requests field when calling unshelve *but* the upstream Gerrit change I commented on comment #16 is not backportable given lots of RPC changes and DB modifications involved by that feature.

About comment #18, I disagree to provide such modification in the filter. Conceptually, we don't want for performance reasons (mostly) to query the Nova DB when we lookup the filters (in particular the instances table which is vrey large).

Comment 38 errata-xmlrpc 2017-10-25 17:10:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3068

Comment 39 Red Hat Bugzilla 2023-09-14 03:37:24 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days