Bug 1750781

Summary:	Containers not pinned to host cpus
Product:	Red Hat OpenStack	Reporter:	Christophe Fontaine <cfontain>
Component:	openstack-tripleo-heat-templates	Assignee:	Emilien Macchi <emacchi>
Status:	CLOSED ERRATA	QA Contact:	David Rosenfeld <drosenfe>
Severity:	high	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	aschultz, djuran, drosenfe, eelena, emacchi, fbaudin, fherrman, fiezzi, gconsalv, hakhande, jraju, marjones, mburns, mschuppe, ndeevy, owalsh, supadhya
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	python-paunch-2.5.0-9.el7ost openstack-tripleo-heat-templates-8.4.1-24.el7ost	Doc Type:	If docs needed, set a value
Doc Text:	Before this update, all OpenStack containers floated across all system CPUs, ignoring tuned cpu-partitioning profiles and the `isolcpus` boot parameter. This meant that containers could preempt CPUs that were dedicated to VMs (vCPUs) or OVS-DPDK, resulting in packet loss on VNFs or on OVS-DPDK. This bug affected mainly NFV and other use cases that required isolated vCPUs. With this update, the issue is resolved.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-10 11:22:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christophe Fontaine 2019-09-10 13:14:26 UTC

For NFV workloads, in order to achieve 0 packet loss, linux processes, ovs-dpdk (if applicable) and VMs are isolated thanks to kernel args (isolcpus) and tuned profiles (cpu-partitioning).

Yet, all docker containers are run without cpu-isolation, as the parameter "cpuset-cpus" is undefined.

For example:
  49582 ?        Sl     0:00      \_ /usr/bin/docker-containerd-shim-current c35beb0c9708a114cac3110ba8bf0235eb10864d0f13b653feebdf5b9e509677 /var/run/docker/libcontainerd/c35beb0c9708a114cac3110ba8bf0235eb10
  49600 ?        Ss     0:00      |   \_ /bin/bash /neutron_ovs_agent_launcher.sh
  49714 ?        S     24:59      |       \_ /usr/bin/python2 /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/n

# cat /proc/49582/status | grep Cpus_allowed_list
Cpus_allowed_list:      0,12
--> docker-containerd-shim-current is correctly isolated

# cat /proc/49600/status | grep Cpus_allowed_list
Cpus_allowed_list:      0-23
--> The process is allowed to run on ALL cpus instead of only 0,12.

This can be reproduced as well on a machine with the same tuning (isolcpus & cpu-partitioning) by starting a simple container:
# docker run --rm  -ti centos:7 cat /proc/1/status | grep Cpus_allowed_list
Cpus_allowed_list:      0-23

In order to have these containers well isolated from both ovs-dpdk AND the virtual machines, we have to set the value of "CpusetCpus" when the container is started.
# docker run --rm --cpuset-cpus=$(cat /proc/self/status | awk '/Cpus_allowed_list/ {print$2}')  -ti centos:7 cat /proc/1/status | grep Cpus_allowed_list
Cpus_allowed_list:      0,12

For all containers, the value of "CpusetCpus" must be set to the non-isolated cpus in order to avoid any interruption in the packet processing (for ovs-dpdk and the VNFs).

For existing deployments, the following command should be run to repin the containers:
# docker ps -q | xargs docker update --cpuset-cpus=$(cat /proc/self/status | awk '/Cpus_allowed_list/ {print$2}')

Comment 1 Franck Baudin 2019-09-10 19:00:01 UTC

If we apply the workaround above, we cannot start VMs (openstack server show on failed VM):

| fault                               | {u'message': u"Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\\x2d1\\x2dinstance\\x2d00000001.scope/vcpu0/cpuset.cpus': Permission denied", u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1858, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2142, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'created': u'2019-09-10T18:44:28Z'} |
| flavor                              | vnfc (9b09d8fa-b32f-466c-909d-4aff9afe2d00)                                   


Libvirt will need to be excluded from the workaround above.

Comment 3 Christophe Fontaine 2019-09-12 11:36:48 UTC

Indeed, pinning nova containers may need to issues, here is the proper command line which re-pins all containers but nova*:
docker ps -q | grep -v -E $(docker ps -q --filter='name=nova' | paste -sd "|" - ) | xargs docker update --cpuset-cpus=$(cat /proc/self/status | awk '/Cpus_allowed_list/ {print $2}') 

We can check before/after the remaining processes dangling on all CPUs:
for s in  /proc/[0-9]* ; do if [[ $(grep $(lscpu  | awk '/On-line/ {print $4}') $s/status) ]]; then cat $s/cmdline ; echo ''; fi ;done  | sort -u

Comment 10 David Rosenfeld 2020-02-19 18:24:02 UTC

On compute-0 server:

[heat-admin@compute-0 ~]$ cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:    0-1

[heat-admin@compute-0 ~]$ sudo docker inspect nova_compute |grep CpusetCpus
            "CpusetCpus": "0,1",

Containers on compute-0 are pinned.


On controller-0 (To verify it was not broken):

[heat-admin@controller-0 ~]$ cat /proc/self/status | grep Cpus_allowed_list
Cpus_allowed_list:	0-7

[heat-admin@controller-0 ~]$ sudo docker inspect nova_api_db_sync |grep CpusetCpus
            "CpusetCpus": "0,1,2,3,4,5,6,7",

Containers on controller-0 are not pinned.

Comment 12 errata-xmlrpc 2020-03-10 11:22:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0760

Comment 13 Emilien Macchi 2020-03-19 21:58:27 UTC

For the record, the initial implementation created a regression for PPC, see https://bugzilla.redhat.com/show_bug.cgi?id=1813091