Bug 1211716 - [Nokia OSP 6.0 Defect] Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling
Summary: [Nokia OSP 6.0 Defect] Several concurent scheduling requests for CPU pinning ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 6.0 (Juno)
Assignee: Sahid Ferdjaoui
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 1321780 1342588 1342592 1348905 1517272 1517276 1517278 1537041
TreeView+ depends on / blocked
 
Reported: 2015-04-14 17:03 UTC by Stephen Gordon
Modified: 2020-05-14 14:58 UTC (History)
14 users (show)

Fixed In Version: openstack-nova-2014.2.3-61.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1321780 (view as bug list)
Environment:
Last Closed: 2016-03-29 06:52:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1417667 0 None None None 2016-04-20 10:51:10 UTC
Red Hat Knowledge Base (Solution) 2181171 0 None None None 2016-03-01 11:24:58 UTC
Red Hat Product Errata RHBA-2016:0500 0 normal SHIPPED_LIVE openstack-nova bug fix advisory 2016-03-23 18:25:29 UTC

Description Stephen Gordon 2015-04-14 17:03:01 UTC
Cloned from launchpad bug 1438238.

Description:

The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.
 

015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception during message handling: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     incoming.message))

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     return func(*args, **kwargs)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py", line 86, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 80, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py", line 266, in consume_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1472, in get_host_numa_usage_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1344, in numa_usage_from_instances

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line 91, in pin_cpus

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher

What is likely happening is:

* nova-scheduler is handling several RPC calls to select_destinations at the same time, in multiple greenthreads

* greenthread 1 runs the NUMATopologyFilter and selects a cpu on a particular compute node, updating host_state.instance_numa_topology

* greenthread 1 then blocks for some reason

* greenthread 2 runs the NUMATopologyFilter and selects the same cpu on the same compute node, updating host_state.instance_numa_topology. This also seems like an issue if a different cpu was selected, as it would be overwriting the instance_numa_topology selected by greenthread 1.

* greenthread 2 then blocks for some reason

* greenthread 1 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology

*  greenthread 1 completes the scheduling operation

* greenthread 2 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology - since the resources were already consumed by greenthread 1, we get the exception above

Specification URL (additional information):

https://bugs.launchpad.net/nova/+bug/1438238

Comment 15 errata-xmlrpc 2016-03-23 14:26:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0500.html


Note You need to log in before you can comment on or make changes to this bug.