Description of problem: Sometime unable to unshelve the instances. Details: - Spawned 7 sriov instances in "prod-az" availability zone. [stack@ibm-x3630m4-5 ~]$ nova aggregate-list +----+------+-------------------+ | Id | Name | Availability Zone | +----+------+-------------------+ | 2 | prod | prod-az | | 5 | dev | dev-az | +----+------+-------------------+ [stack@ibm-x3630m4-5 ~]$ nova aggregate-details 2 +----+------+-------------------+-----------------------------------+-----------------------------+ | Id | Name | Availability Zone | Hosts | Metadata | +----+------+-------------------+-----------------------------------+-----------------------------+ | 2 | prod | prod-az | 'overcloud-compute-0.localdomain' | 'availability_zone=prod-az' | +----+------+-------------------+-----------------------------------+-----------------------------+ [stack@ibm-x3630m4-5 ~]$ nova aggregate-details 5 +----+------+-------------------+-----------------------------------+----------------------------+ | Id | Name | Availability Zone | Hosts | Metadata | +----+------+-------------------+-----------------------------------+----------------------------+ | 5 | dev | dev-az | 'overcloud-compute-1.localdomain' | 'availability_zone=dev-az' | +----+------+-------------------+-----------------------------------+----------------------------+ [stack@ibm-x3630m4-5 ~]$ y=0; for i in $(neutron port-list |grep -i sr- |awk {'print $2'}); do ((y++)) ;nova boot --image RHEL7 --flavor 11 --nic port-id=$i --availability-zone prod-az pbandark-$y ; done [stack@ibm-x3630m4-5 ~]$ nova list +--------------------------------------+------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------------+--------+------------+-------------+---------------------+ | 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | ACTIVE | - | Running | sriov=10.65.199.199 | | 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | ACTIVE | - | Running | sriov=10.65.199.201 | | 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | ACTIVE | - | Running | sriov=10.65.199.202 | | 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | ACTIVE | - | Running | sriov=10.65.199.197 | | fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | ACTIVE | - | Running | sriov=10.65.199.196 | | 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | ACTIVE | - | Running | sriov=10.65.199.200 | | 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | ACTIVE | - | Running | sriov=10.65.199.203 | +--------------------------------------+------------+--------+------------+-------------+---------------------+ [stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $4'}|egrep -v '^$|Name' |xargs -i nova show {} |egrep -i zone |awk {'print $4'}prod-az prod-az prod-az prod-az prod-az prod-az prod-az prod-az prod-az - shelved all instances: [stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $2'} |egrep -v '^$|ID' |xargs -i nova shelve {} [stack@ibm-x3630m4-5 ~]$ [stack@ibm-x3630m4-5 ~]$ nova list +--------------------------------------+------------+-------------------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------------+-------------------+------------+-------------+---------------------+ | 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.199 | | 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.201 | | 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.202 | | 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.197 | | fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.196 | | 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.200 | | 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.203 | | 6bb91bd7-1495-4de9-96df-d2e15a90886e | pbandark-8 | ERROR | - | NOSTATE | | +--------------------------------------+------------+-------------------+------------+-------------+---------------------+ - unshelved the instances: [stack@ibm-x3630m4-5 ~]$ nova list |awk {'print $2'} |egrep -v '^$|ID' |xargs -i nova unshelve {}^C [stack@ibm-x3630m4-5 ~]$ nova list +--------------------------------------+------------+-------------------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------------+-------------------+------------+-------------+---------------------+ | 398ea645-03e8-4920-90c8-532980021cbe | pbandark-1 | ACTIVE | - | Running | sriov=10.65.199.199 | | 4983827f-bf98-4749-80fb-ec2e64c5619a | pbandark-2 | ACTIVE | - | Running | sriov=10.65.199.201 | | 34c6c4e5-cf35-440e-b4c1-d9246d97da2d | pbandark-3 | ACTIVE | - | Running | sriov=10.65.199.202 | | 0511f414-0392-4188-b3af-b5697525d19b | pbandark-4 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.197 | | fab4c569-2e21-4616-894c-2e6d35eb2e0d | pbandark-5 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.196 | | 8691b590-e5fb-4904-a615-0cd5c17fc7fc | pbandark-6 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.200 | | 895d65bb-9f33-4baf-a729-954a8d276339 | pbandark-7 | SHELVED_OFFLOADED | - | Shutdown | sriov=10.65.199.203 | | 6bb91bd7-1495-4de9-96df-d2e15a90886e | pbandark-8 | ERROR | - | NOSTATE | | +--------------------------------------+------------+-------------------+------------+-------------+---------------------+ ^^^^^ The operation failed for few instances. - From compute logs: 079s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:265 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [req-1848197b-3436-465e-9a0d-6e68285f0d2d cb39646c878442868eec409a98126fc5 a448234ace054e5ab0635dd6e12d0992 - - -] [instance: 0511f414-0392-4188-b3af-b5697525d19b] Instance failed to spawn 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] Traceback (most recent call last): 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4346, in _unshelve_instance 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] with rt.instance_claim(context, instance, limits): 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] return f(*args, **kwargs) 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 173, in instance_claim 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] overhead=overhead, limits=limits) 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 90, in __init__ 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] self._claim_test(resources, limits) 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 147, in _claim_test 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] "; ".join(reasons)) 2017-01-03 16:07:01.031 11944 ERROR nova.compute.manager [instance: 0511f414-0392-4188-b3af-b5697525d19b] ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed.. 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher [req-1848197b-3436-465e-9a0d-6e68285f0d2d cb39646c878442868eec409a98126fc5 a448234ace054e5ab0635dd6e12d0992 - - -] Exception during message handling: Insufficient compute resources: Claim pci failed.. 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last): 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher executor_callback)) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher executor_callback) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 89, in wrapped 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher payload) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__ 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 72, in wrapped 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 350, in decorated_function 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher LOG.warning(msg, e, instance=instance) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__ 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 323, in decorated_function 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 400, in decorated_function 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 378, in decorated_function 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info()) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__ 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 366, in decorated_function 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4299, in unshelve_instance 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher do_unshelve_instance() 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher return f(*args, **kwargs) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4298, in do_unshelve_instance 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher filter_properties, node) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4355, in _unshelve_instance 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher instance=instance) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__ 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4346, in _unshelve_instance 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher with rt.instance_claim(context, instance, limits): 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 254, in inner 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher return f(*args, **kwargs) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 173, in instance_claim 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher overhead=overhead, limits=limits) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 90, in __init__ 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher self._claim_test(resources, limits) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/claims.py", line 147, in _claim_test 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher "; ".join(reasons)) 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher ComputeResourcesUnavailable: Insufficient compute resources: Claim pci failed.. 2017-01-03 16:07:01.213 11944 ERROR oslo_messaging.rpc.dispatcher Version-Release number of selected component (if applicable): RHOS8 How reproducible: Actual results: sometime instance unshelve operation fails. Expected results: unshelve operation should be successful. Additional info:
Hello, Apologize for the delay. Unfortunately, I didn't find any mentioning of the provided traces in the logs. These traces are from 2017-01-03, but the attached logs contain only activity from 2017-01-13. Will it be possible to reproduce the issue and capture the logs right after it occurred? Also, in OSP8 we didn't allocate new pci devices during unshelve/rebuild/evacuate operations. We are also not updating the neutron port binding, that holds the pci address of the device - nova libvirt driver uses it to configure the virtual interfaces.. as in [1] However, this change relies on the work that has been done across 2 cycles, (Mitaka and Newton) that introduced a migration context object and made resources to be claimed and allocated during the above operations ([2] and [3]). These patches are not backportable, due to RPC and object changes. [1] https://review.openstack.org/#/c/242573 [2] https://review.openstack.org/#/q/topic:bug/1417667 [3] https://review.openstack.org/#/q/topic:bp/migration-fix-resource-tracking
I may have a path to follow... As Vladik indicated on comment #14, the filter 'pci_passthrough_filter.py' returns true whether it does not find any pci_requests attached to the instance. My thinking is that when the API is loading the instance to then pass it to the compute API, conductor and finally to the scheduler, the instance does not have the attribute 'pci_requests' loaded resulting that the instance can be offloaded on a compute node which can't accept the request. That is the patch I would propose I could still provide test-build if customer prefer. diff --git a/nova/api/openstack/compute/shelve.py b/nova/api/openstack/compute/shelve.py index 6f9f8ae..2f31554 100644 --- a/nova/api/openstack/compute/shelve.py +++ b/nova/api/openstack/compute/shelve.py @@ -59,7 +59,8 @@ class ShelveController(wsgi.Controller): context = req.environ["nova.context"] authorize(context, action='shelve_offload') - instance = common.get_instance(self.compute_api, context, id) + instance = common.get_instance( + self.compute_api, context, id, expected_attrs=['pci_requests']) try: self.compute_api.shelve_offload(context, instance) except exception.InstanceUnknownCell as e:
An other way to "fix" the issue (if that is really the root cause) would be to replace that part of code [0], by a call to the database to get the pci_requests related to the instance scheduled but I would say it's going to create a larger overhead since for each instance scheduled the database is going to be hit. [0] https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/scheduler/filter_scheduler.py;h=ec986252f49f60640b8d75f8162b6a39aa640fd1;hb=refs/heads/rhos-8.0-patches#l114
Like I said in comment #16, the problem is that the instance we get when calling unshelve is not having the pci_requests field set. So, yeah, I definitely agree with the proposal of comment #17 to load the PCI bits when getting the instance. To be clear, that issue is not present in OSP9 because we should get the original RequestSpec record that includes the pci_requests field when calling unshelve *but* the upstream Gerrit change I commented on comment #16 is not backportable given lots of RPC changes and DB modifications involved by that feature. About comment #18, I disagree to provide such modification in the filter. Conceptually, we don't want for performance reasons (mostly) to query the Nova DB when we lookup the filters (in particular the instances table which is vrey large).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3068
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days