Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1211716

Summary: [Nokia OSP 6.0 Defect] Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling
Product: Red Hat OpenStack Reporter: Stephen Gordon <sgordon>
Component: openstack-novaAssignee: Sahid Ferdjaoui <sferdjao>
Status: CLOSED ERRATA QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: high    
Version: 6.0 (Juno)CC: berrange, brault, dasmith, dmaley, eglynn, kchamart, mschuppe, ndipanov, sbauza, scorcora, sferdjao, sgordon, vromanso, yeylon
Target Milestone: asyncKeywords: Reopened, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-2014.2.3-61.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1321780 (view as bug list) Environment:
Last Closed: 2016-03-29 06:52:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1321780, 1342588, 1342592, 1348905, 1517272, 1517276, 1517278, 1537041    

Description Stephen Gordon 2015-04-14 17:03:01 UTC
Cloned from launchpad bug 1438238.

Description:

The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.
 

015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception during message handling: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     incoming.message))

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     return func(*args, **kwargs)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py", line 86, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 80, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py", line 266, in consume_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1472, in get_host_numa_usage_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1344, in numa_usage_from_instances

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line 91, in pin_cpus

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher

What is likely happening is:

* nova-scheduler is handling several RPC calls to select_destinations at the same time, in multiple greenthreads

* greenthread 1 runs the NUMATopologyFilter and selects a cpu on a particular compute node, updating host_state.instance_numa_topology

* greenthread 1 then blocks for some reason

* greenthread 2 runs the NUMATopologyFilter and selects the same cpu on the same compute node, updating host_state.instance_numa_topology. This also seems like an issue if a different cpu was selected, as it would be overwriting the instance_numa_topology selected by greenthread 1.

* greenthread 2 then blocks for some reason

* greenthread 1 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology

*  greenthread 1 completes the scheduling operation

* greenthread 2 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology - since the resources were already consumed by greenthread 1, we get the exception above

Specification URL (additional information):

https://bugs.launchpad.net/nova/+bug/1438238

Comment 15 errata-xmlrpc 2016-03-23 14:26:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0500.html