Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1817861

Summary:	[HCI-DPDK] OSDs are using only one CPU, despite providing multiple CPUs in THT param ceph_osd_docker_cpuset_cpus
Product:	Red Hat OpenStack	Reporter:	Karthik Sundaravel <ksundara>
Component:	openstack-tripleo-heat-templates	Assignee:	Karthik Sundaravel <ksundara>
Status:	CLOSED CURRENTRELEASE	QA Contact:	David Rosenfeld <drosenfe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	13.0 (Queens)	CC:	atheurer, bengland, ccopello, cfontain, cswanson, ekuric, fbaudin, hakhande, jdurgin, jmario, johfulto, kfida, ksundara, lhh, marjones, mburns, srangach, supadhya, vchundur, vkhitrin, yrachman
Target Milestone:	z11
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:	For HCI-DPDK deployment the below changes are recommended Before KernelArgs: default_hugepagesz=1GB hugepagesz=1G hugepages=<Number of HugePages> intel_iommu=on iommu=pt isolcpus=<all cpu’s excluding OvsDpdkCoreList> IsolCpusList: <List of all CPU’s excluding OvsDpdkCoreList > OvsPmdCoreList +NovaVcpuPinSet Now KernelArgs: default_hugepagesz=1GB hugepagesz=1G hugepages=<Number of HugePages> intel_iommu=on iommu=pt isolcpus=<OvsPmdCoreList + NovaVcpuPinSet > IsolCpusList: < OvsPmdCoreList +NovaVcpuPinSet >	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-05 09:34:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1716326

Description Karthik Sundaravel 2020-03-27 06:48:40 UTC

In HCI-DPDK deployment (OSP13-Z10), the THT param ceph_osd_docker_cpuset_cpus specifies the list of CPUs available and reserved for OSD's

ceph_osd_docker_cpuset_cpus: "4,40,5,41,6,42,7,43"

After the system is deployed, the below script is used to generate stress in Ceph

==========================================
echo "[vms]" > ips

openstack server list -c Name -c Networks -f value | egrep "vm[0-9]" |
sed s/tenant=//g | awk {'print $2'} >> ips

namespace=$(ip netns | grep "(id: 0)" | awk {'print $1'})
common_params='/usr/local/bin/fio --name=karthik --filename=/dev/vdb --ramp_time=5 --startdelay=5'
sudo ip netns exec $namespace ansible --ssh-extra-args "-o StrictHostKeyChecking=no" --private-key test.pem -u cloud-user -i ips -f 30 -b -m shell -a "$common_params --rw=randrw --bs=1k --direct=1 --size=40G --runtime=600 --time_based=1 --output=/home/cloud-user/rand-write.log" vms

==========================================

While this script is running, when we check the CPU usage in each of the HCI compute nodes, only CPU 4 is used 100% with 0% idle. All other CPUs "40,5,41,6,42,7,43" are in 100% idle.

Even if the above script is made to run in parallel for /dev/vdc and /dev/vdd the CPU utilisation does not go beyond 1 CPU.

Note : the storage and storage management NICs are setup on 25Gbps network.

Expectation:
The Cpu load shall be shared with other CPUs as well


Expectation:

Comment 3 John Fulton 2020-03-27 12:12:43 UTC

Director only passes the parameters through to ceph-ansible, which in turn only passes the parameters through to docker (osp13) or podman (osp16) [1].

Maybe it's more accurate to open the bug on the container program (docker/podman) for the --cpuset-cpus flag.

Or if the container program's --cpuset-cpus flag is working as designed, then we change how we achieve the original goal of isolation. E.g. Should we recommend instead using numactl [2].

Maybe input from the performance team on if we can use numactl in its place should be the next step.


[1] https://github.com/ceph/ceph-ansible/commit/8cba44262cf7291091b2318b563a28380e5049fd 
[2] https://github.com/ceph/ceph-ansible/commit/b3eb9206fada05df811602217d8770db854e0adf

Comment 4 John Fulton 2020-03-27 12:15:21 UTC

Or we have director or ceph-ansible implement what's described here:

 https://github.com/moby/moby/issues/31086#issuecomment-323363442

Comment 7 Joe Mario 2020-03-27 13:41:52 UTC

> While this script is running, when we check the CPU usage in each of the HCI compute nodes,
> only CPU 4 is used 100% with 0% idle. All other CPUs "40,5,41,6,42,7,43" are in 100% idle.
>
> Even if the above script is made to run in parallel for /dev/vdc and /dev/vdd the CPU utilisation
> does not go beyond 1 CPU.

Isolcpus is working as advertised.  You get no scheduler load balancing with isolcpus.

If you need quiet cpus (like what isolcpus gives you) but you also need the kernel scheduler's load balancing, have you looked at the cpu-partitioning tuned profile.  It gives you exactly that.

Comment 8 Karthik Sundaravel 2020-03-27 13:54:37 UTC

(In reply to Joe Mario from comment #7)
> > While this script is running, when we check the CPU usage in each of the HCI compute nodes,
> > only CPU 4 is used 100% with 0% idle. All other CPUs "40,5,41,6,42,7,43"
> are in 100% idle.
> >
> > Even if the above script is made to run in parallel for /dev/vdc and
> /dev/vdd the CPU utilisation
> > does not go beyond 1 CPU.
> 
> Isolcpus is working as advertised.  You get no scheduler load balancing
> with isolcpus.
> 
> If you need quiet cpus (like what isolcpus gives you) but you also need the
> kernel scheduler's load balancing, have you looked at the cpu-partitioning
> tuned profile.  It gives you exactly that.

For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile.
Chirs/Franck,
Should we consider using only cpu-partitioning profile ?

Comment 9 Joe Mario 2020-03-27 14:07:49 UTC

Karthik:

> For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile.
Have you checked to see if the cpus you list for "isolcpus" is going beyond what you need for the PMD threads?  Without seeing all the details of your setup, it feels like that's what is happening.

Doesn't the "no_balance_cores" feature in the cpu-partitioning-variables.conf file give you the "isolcpus" behavior you need for your subset of cpus needed for the dpdk PMDs?

Comment 10 Karthik Sundaravel 2020-03-27 14:15:00 UTC

(In reply to Joe Mario from comment #9)
> Karthik:
> 
> > For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile.
> Have you checked to see if the cpus you list for "isolcpus" is going beyond
> what you need for the PMD threads?  Without seeing all the details of your
> setup, it feels like that's what is happening.

Yes, now the isolcpus now includes cpu for guests + DPDK PMDs + Cpu for Ceph OSD.
Should I try with isolcpus = cpu for guests + DPDK PMDs ?

> 
> Doesn't the "no_balance_cores" feature in the
> cpu-partitioning-variables.conf file give you the "isolcpus" behavior you
> need for your subset of cpus needed for the dpdk PMDs?

We are not using this feature now. I am not sure of the differences between isolcpus and no_balance_cores.

Comment 11 Andrew Theurer 2020-03-27 14:28:04 UTC

(In reply to Karthik Sundaravel from comment #10)
> (In reply to Joe Mario from comment #9)
> > Karthik:
> > 
> > > For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile.
> > Have you checked to see if the cpus you list for "isolcpus" is going beyond
> > what you need for the PMD threads?  Without seeing all the details of your
> > setup, it feels like that's what is happening.
> 
> Yes, now the isolcpus now includes cpu for guests + DPDK PMDs + Cpu for Ceph
> OSD.

Yes, this is why you only have 1 CPU used.

> Should I try with isolcpus = cpu for guests + DPDK PMDs ?

Yes.

> 
> > 
> > Doesn't the "no_balance_cores" feature in the
> > cpu-partitioning-variables.conf file give you the "isolcpus" behavior you
> > need for your subset of cpus needed for the dpdk PMDs?

It removes the load-balancing from those CPUs, but I don't think we are recommending using it instead of isolcpus yet.  We need to fix another part of cpu-partitioning, to disable timer-migration instead of having it enabled (as it is today).

> 
> We are not using this feature now. I am not sure of the differences between
> isolcpus and no_balance_cores.

Comment 12 Karthik Sundaravel 2020-03-27 21:17:45 UTC

(In reply to Andrew Theurer from comment #11)
> (In reply to Karthik Sundaravel from comment #10)
> > (In reply to Joe Mario from comment #9)
> > > Karthik:
> > > 
> > > > For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile.
> > > Have you checked to see if the cpus you list for "isolcpus" is going beyond
> > > what you need for the PMD threads?  Without seeing all the details of your
> > > setup, it feels like that's what is happening.
> > 
> > Yes, now the isolcpus now includes cpu for guests + DPDK PMDs + Cpu for Ceph
> > OSD.
> 
> Yes, this is why you only have 1 CPU used.
> 
> > Should I try with isolcpus = cpu for guests + DPDK PMDs ?
> 
> Yes.

with this, the OSD's use all 8 cores.

> 
> > 
> > > 
> > > Doesn't the "no_balance_cores" feature in the
> > > cpu-partitioning-variables.conf file give you the "isolcpus" behavior you
> > > need for your subset of cpus needed for the dpdk PMDs?
> 
> It removes the load-balancing from those CPUs, but I don't think we are
> recommending using it instead of isolcpus yet.  We need to fix another part
> of cpu-partitioning, to disable timer-migration instead of having it enabled
> (as it is today).
> 
> > 
> > We are not using this feature now. I am not sure of the differences between
> > isolcpus and no_balance_cores.

in cpu partitioning tuned profile, should the ``isolated_core`` be same as ``isolcpus`` ? or isolated core shall include ceph cpuset as well ?

Comment 13 Andrew Theurer 2020-03-27 21:27:00 UTC

> in cpu partitioning tuned profile, should the ``isolated_core`` be same as
> ``isolcpus`` ? or isolated core shall include ceph cpuset as well ?

isolated_cores only ensures these CPUs are not used by systemd, irqbalance, and some kernel threads, unless a program (like OVS, Nova, or perhaps Ceph OSD) chooses to change its cpumask to something that includes those isolated CPUs.  It is not equivalent to isolcpus because isolcpus always disables load-balancing on those CPUs.  If you have a program which does not need load balancing (OVS PMD threads, Nova pinning vcpu threads to pcpus), then not having load balancing is probably fine.  If you have a program which does not follow a 1:1 thread:CPU model, the lack of load balancing will cause you to use just 1 CPU.

Comment 14 Karthik Sundaravel 2020-03-28 07:28:14 UTC

Thanks Andrew.

So if I understand correctly, it all boils down to whether the CPUs used for ceph OSDs needs the features of isolated_cores in cpu partitioning tuned profile.

John/Ben
Can you please advise.

Comment 15 Christophe Fontaine 2020-03-30 08:48:23 UTC

What is the current behaviour of HCI+SRIOV ?
My take is that we should run with the same kind of deployment of HCI+SRIOV, ie running Ceph on non-isolated CPUs.

Comment 16 Karthik Sundaravel 2020-04-02 06:31:38 UTC

(In reply to Christophe Fontaine from comment #15)
> What is the current behaviour of HCI+SRIOV ?
I couldn't get hold of a document explaining the Isolcpus parameter of HCI + SRIOV deployment.
I understand that for HCI or HCI-SRIOV deployment, the specific list of CPUs are not provided.
Instead ceph_osd_docker_cpu_limit parameter specifies the number of CPUs per OSD.

In case of HCI-DPDK, we specify the list of CPUs to be available for OSDs via ceph_osd_docker_cpuset_cpus.

Also the steps to derive/determine the Isolcpus and isolated_cores in cpu partitioning profile are not covered in our official guide  [1]

> My take is that we should run with the same kind of deployment of HCI+SRIOV,
> ie running Ceph on non-isolated CPUs.

So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests + PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added to it.
Also the isolated_cores (cpu partitioning tuned profile) shall be same as Isolcpus (kernelargs)

Chris
Please let me know if you think otherwise.

Yariv/Sanjay
Has this issue been found in HCI-SR-IOV deployment ?




[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/hyperconverged_infrastructure_guide/index#reserve-resources-compute

Comment 18 Christophe Fontaine 2020-04-02 10:05:11 UTC

(In reply to Karthik Sundaravel from comment #16)
> > My take is that we should run with the same kind of deployment of HCI+SRIOV,
> > ie running Ceph on non-isolated CPUs.
> 
> So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests +
> PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added
> to it.
> Also the isolated_cores (cpu partitioning tuned profile) shall be same as
> Isolcpus (kernelargs)
> 
> Chris
> Please let me know if you think otherwise.

+1. With that said, we may have to set ceph_osd_docker_cpuset_cpus to the unpinned CPUs: could you check the affinity of the process in the container, so it doesn't overlap with isolated CPUs ? (refer to BZ 1750781)

Comment 19 Karthik Sundaravel 2020-04-02 11:34:55 UTC

(In reply to Christophe Fontaine from comment #18)
> (In reply to Karthik Sundaravel from comment #16)
> > > My take is that we should run with the same kind of deployment of HCI+SRIOV,
> > > ie running Ceph on non-isolated CPUs.
> > 
> > So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests +
> > PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added
> > to it.
> > Also the isolated_cores (cpu partitioning tuned profile) shall be same as
> > Isolcpus (kernelargs)
> > 
> > Chris
> > Please let me know if you think otherwise.
> 
> +1. With that said, we may have to set ceph_osd_docker_cpuset_cpus to the
> unpinned CPUs: could you check the affinity of the process in the container,
> so it doesn't overlap with isolated CPUs ? (refer to BZ 1750781)
yes, Cpus_allowed_list does not overlap with isolated cpus.

Comment 20 Sanjay Upadhyay 2020-04-06 07:10:07 UTC

(In reply to Karthik Sundaravel from comment #16)
> (In reply to Christophe Fontaine from comment #15)
> > What is the current behaviour of HCI+SRIOV ?
> I couldn't get hold of a document explaining the Isolcpus parameter of HCI +
> SRIOV deployment.
> I understand that for HCI or HCI-SRIOV deployment, the specific list of CPUs
> are not provided.
> Instead ceph_osd_docker_cpu_limit parameter specifies the number of CPUs per
> OSD.
> 
> In case of HCI-DPDK, we specify the list of CPUs to be available for OSDs
> via ceph_osd_docker_cpuset_cpus.
> 
> Also the steps to derive/determine the Isolcpus and isolated_cores in cpu
> partitioning profile are not covered in our official guide  [1]
> 
> > My take is that we should run with the same kind of deployment of HCI+SRIOV,
> > ie running Ceph on non-isolated CPUs.
> 
> So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests +
> PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added
> to it.
> Also the isolated_cores (cpu partitioning tuned profile) shall be same as
> Isolcpus (kernelargs)
> 
> Chris
> Please let me know if you think otherwise.
> 
> Yariv/Sanjay
> Has this issue been found in HCI-SR-IOV deployment ?

Vadim, could you please update if we saw this issue in our deployment?

> 
> 
> 
> 
> [1]
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.
> 0/html-single/hyperconverged_infrastructure_guide/index#reserve-resources-
> compute

Comment 24 Karthik Sundaravel 2020-04-27 06:34:40 UTC

Hi Chuck

I created a doc BZ https://bugzilla.redhat.com/show_bug.cgi?id=1828134 with the details.
Please let me know if you have the details to work on it.

Comment 25 Chuck Copello 2020-04-28 13:46:40 UTC

Hi Karthik, thank you for the doc BZ. I think we have what we need to start.

Comment 26 Karthik Sundaravel 2020-05-05 09:34:23 UTC

There are no code changes required and the documentation needs change. I'll close the BZ having provided the documentation changes.

Comment 28 Red Hat Bugzilla 2023-09-14 05:54:48 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days