Cloned from launchpad bug 1438238.
Description:
The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.
015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception during message handling: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher return func(*args, **kwargs)
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py", line 86, in select_destinations
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 80, in select_destinations
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py", line 266, in consume_from_instance
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1472, in get_host_numa_usage_from_instance
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1344, in numa_usage_from_instances
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line 91, in pin_cpus
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]
2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher
What is likely happening is:
* nova-scheduler is handling several RPC calls to select_destinations at the same time, in multiple greenthreads
* greenthread 1 runs the NUMATopologyFilter and selects a cpu on a particular compute node, updating host_state.instance_numa_topology
* greenthread 1 then blocks for some reason
* greenthread 2 runs the NUMATopologyFilter and selects the same cpu on the same compute node, updating host_state.instance_numa_topology. This also seems like an issue if a different cpu was selected, as it would be overwriting the instance_numa_topology selected by greenthread 1.
* greenthread 2 then blocks for some reason
* greenthread 1 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology
* greenthread 1 completes the scheduling operation
* greenthread 2 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology - since the resources were already consumed by greenthread 1, we get the exception above
Specification URL (additional information):
https://bugs.launchpad.net/nova/+bug/1438238
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2016-0500.html