1211716 – [Nokia OSP 6.0 Defect] Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling

Bug 1211716 - [Nokia OSP 6.0 Defect] Several concurent scheduling requests for CPU pinning may fail due to racy host_state handling

Summary: [Nokia OSP 6.0 Defect] Several concurent scheduling requests for CPU pinning ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	6.0 (Juno)
Assignee:	Sahid Ferdjaoui
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1321780 1342588 1342592 1348905 1517272 1517276 1517278 1537041
TreeView+	depends on / blocked

Reported:	2015-04-14 17:03 UTC by Stephen Gordon
Modified:	2020-05-14 14:58 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-nova-2014.2.3-61.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1321780 (view as bug list)
Environment:
Last Closed:	2016-03-29 06:52:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1417667	None	None	None	2016-04-20 10:51:10 UTC
Red Hat Knowledge Base (Solution)	2181171	None	None	None	2016-03-01 11:24:58 UTC
Red Hat Product Errata	RHBA-2016:0500	normal	SHIPPED_LIVE	openstack-nova bug fix advisory	2016-03-23 18:25:29 UTC

Description Stephen Gordon 2015-04-14 17:03:01 UTC

Cloned from launchpad bug 1438238.

Description:

The issue happens when multiple scheduling attempts that request CPU pinning are done in parallel.
 

015-03-25T14:18:00.222 controller-0 nova-scheduler err Exception during message handling: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     incoming.message))

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     result = getattr(endpoint, method)(ctxt, **new_args)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "/usr/lib64/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher     return func(*args, **kwargs)

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/manager.py", line 86, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 80, in select_destinations

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 241, in _schedule

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/scheduler/host_manager.py", line 266, in consume_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1472, in get_host_numa_usage_from_instance

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/virt/hardware.py", line 1344, in numa_usage_from_instances

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher   File "./usr/lib64/python2.7/site-packages/nova/objects/numa.py", line 91, in pin_cpus

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher CPUPinningInvalid: Cannot pin/unpin cpus [4] from the following pinned set [3, 4, 5, 6, 7, 8, 9]

2015-03-25 14:18:00.221 34127 TRACE oslo.messaging.rpc.dispatcher

What is likely happening is:

* nova-scheduler is handling several RPC calls to select_destinations at the same time, in multiple greenthreads

* greenthread 1 runs the NUMATopologyFilter and selects a cpu on a particular compute node, updating host_state.instance_numa_topology

* greenthread 1 then blocks for some reason

* greenthread 2 runs the NUMATopologyFilter and selects the same cpu on the same compute node, updating host_state.instance_numa_topology. This also seems like an issue if a different cpu was selected, as it would be overwriting the instance_numa_topology selected by greenthread 1.

* greenthread 2 then blocks for some reason

* greenthread 1 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology

*  greenthread 1 completes the scheduling operation

* greenthread 2 gets scheduled and calls consume_from_instance, which consumes the numa resources based on what is in host_state.instance_numa_topology - since the resources were already consumed by greenthread 1, we get the exception above

Specification URL (additional information):

https://bugs.launchpad.net/nova/+bug/1438238

Comment 13 nlevinki 2016-03-21 08:32:24 UTC

code is in
openstack-nova-api-2014.2.3-65.el7ost.noarch

automation passed
https://rhos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/RHOS/view/RHOS6/job/rhos-jenkins-rhos-6.0-puddle-rhel-7.2-multi-node-packstack-neutron-ml2-vxlan-rabbitmq-tempest-git-all/17/

Comment 15 errata-xmlrpc 2016-03-23 14:26:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0500.html

Note You need to log in before you can comment on or make changes to this bug.