Bug 1817861
| Summary: | [HCI-DPDK] OSDs are using only one CPU, despite providing multiple CPUs in THT param ceph_osd_docker_cpuset_cpus | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Karthik Sundaravel <ksundara> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Karthik Sundaravel <ksundara> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 13.0 (Queens) | CC: | atheurer, bengland, ccopello, cfontain, cswanson, ekuric, fbaudin, hakhande, jdurgin, jmario, johfulto, kfida, ksundara, lhh, marjones, mburns, srangach, supadhya, vchundur, vkhitrin, yrachman |
| Target Milestone: | z11 | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: |
For HCI-DPDK deployment the below changes are recommended
Before
KernelArgs: default_hugepagesz=1GB hugepagesz=1G hugepages=<Number of HugePages> intel_iommu=on iommu=pt isolcpus=<all cpu’s excluding OvsDpdkCoreList>
IsolCpusList: <List of all CPU’s excluding OvsDpdkCoreList >
OvsPmdCoreList +NovaVcpuPinSet
Now
KernelArgs: default_hugepagesz=1GB hugepagesz=1G hugepages=<Number of HugePages> intel_iommu=on iommu=pt isolcpus=<OvsPmdCoreList + NovaVcpuPinSet >
IsolCpusList: < OvsPmdCoreList +NovaVcpuPinSet >
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-05 09:34:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1716326 | ||
Director only passes the parameters through to ceph-ansible, which in turn only passes the parameters through to docker (osp13) or podman (osp16) [1]. Maybe it's more accurate to open the bug on the container program (docker/podman) for the --cpuset-cpus flag. Or if the container program's --cpuset-cpus flag is working as designed, then we change how we achieve the original goal of isolation. E.g. Should we recommend instead using numactl [2]. Maybe input from the performance team on if we can use numactl in its place should be the next step. [1] https://github.com/ceph/ceph-ansible/commit/8cba44262cf7291091b2318b563a28380e5049fd [2] https://github.com/ceph/ceph-ansible/commit/b3eb9206fada05df811602217d8770db854e0adf Or we have director or ceph-ansible implement what's described here: https://github.com/moby/moby/issues/31086#issuecomment-323363442 > While this script is running, when we check the CPU usage in each of the HCI compute nodes,
> only CPU 4 is used 100% with 0% idle. All other CPUs "40,5,41,6,42,7,43" are in 100% idle.
>
> Even if the above script is made to run in parallel for /dev/vdc and /dev/vdd the CPU utilisation
> does not go beyond 1 CPU.
Isolcpus is working as advertised. You get no scheduler load balancing with isolcpus.
If you need quiet cpus (like what isolcpus gives you) but you also need the kernel scheduler's load balancing, have you looked at the cpu-partitioning tuned profile. It gives you exactly that.
(In reply to Joe Mario from comment #7) > > While this script is running, when we check the CPU usage in each of the HCI compute nodes, > > only CPU 4 is used 100% with 0% idle. All other CPUs "40,5,41,6,42,7,43" > are in 100% idle. > > > > Even if the above script is made to run in parallel for /dev/vdc and > /dev/vdd the CPU utilisation > > does not go beyond 1 CPU. > > Isolcpus is working as advertised. You get no scheduler load balancing > with isolcpus. > > If you need quiet cpus (like what isolcpus gives you) but you also need the > kernel scheduler's load balancing, have you looked at the cpu-partitioning > tuned profile. It gives you exactly that. For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile. Chirs/Franck, Should we consider using only cpu-partitioning profile ? Karthik:
> For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile.
Have you checked to see if the cpus you list for "isolcpus" is going beyond what you need for the PMD threads? Without seeing all the details of your setup, it feels like that's what is happening.
Doesn't the "no_balance_cores" feature in the cpu-partitioning-variables.conf file give you the "isolcpus" behavior you need for your subset of cpus needed for the dpdk PMDs?
(In reply to Joe Mario from comment #9) > Karthik: > > > For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile. > Have you checked to see if the cpus you list for "isolcpus" is going beyond > what you need for the PMD threads? Without seeing all the details of your > setup, it feels like that's what is happening. Yes, now the isolcpus now includes cpu for guests + DPDK PMDs + Cpu for Ceph OSD. Should I try with isolcpus = cpu for guests + DPDK PMDs ? > > Doesn't the "no_balance_cores" feature in the > cpu-partitioning-variables.conf file give you the "isolcpus" behavior you > need for your subset of cpus needed for the dpdk PMDs? We are not using this feature now. I am not sure of the differences between isolcpus and no_balance_cores. (In reply to Karthik Sundaravel from comment #10) > (In reply to Joe Mario from comment #9) > > Karthik: > > > > > For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile. > > Have you checked to see if the cpus you list for "isolcpus" is going beyond > > what you need for the PMD threads? Without seeing all the details of your > > setup, it feels like that's what is happening. > > Yes, now the isolcpus now includes cpu for guests + DPDK PMDs + Cpu for Ceph > OSD. Yes, this is why you only have 1 CPU used. > Should I try with isolcpus = cpu for guests + DPDK PMDs ? Yes. > > > > > Doesn't the "no_balance_cores" feature in the > > cpu-partitioning-variables.conf file give you the "isolcpus" behavior you > > need for your subset of cpus needed for the dpdk PMDs? It removes the load-balancing from those CPUs, but I don't think we are recommending using it instead of isolcpus yet. We need to fix another part of cpu-partitioning, to disable timer-migration instead of having it enabled (as it is today). > > We are not using this feature now. I am not sure of the differences between > isolcpus and no_balance_cores. (In reply to Andrew Theurer from comment #11) > (In reply to Karthik Sundaravel from comment #10) > > (In reply to Joe Mario from comment #9) > > > Karthik: > > > > > > > For DPDK deployments we use both isolcpus and cpu-partitioning tuned profile. > > > Have you checked to see if the cpus you list for "isolcpus" is going beyond > > > what you need for the PMD threads? Without seeing all the details of your > > > setup, it feels like that's what is happening. > > > > Yes, now the isolcpus now includes cpu for guests + DPDK PMDs + Cpu for Ceph > > OSD. > > Yes, this is why you only have 1 CPU used. > > > Should I try with isolcpus = cpu for guests + DPDK PMDs ? > > Yes. with this, the OSD's use all 8 cores. > > > > > > > > > Doesn't the "no_balance_cores" feature in the > > > cpu-partitioning-variables.conf file give you the "isolcpus" behavior you > > > need for your subset of cpus needed for the dpdk PMDs? > > It removes the load-balancing from those CPUs, but I don't think we are > recommending using it instead of isolcpus yet. We need to fix another part > of cpu-partitioning, to disable timer-migration instead of having it enabled > (as it is today). > > > > > We are not using this feature now. I am not sure of the differences between > > isolcpus and no_balance_cores. in cpu partitioning tuned profile, should the ``isolated_core`` be same as ``isolcpus`` ? or isolated core shall include ceph cpuset as well ?
> in cpu partitioning tuned profile, should the ``isolated_core`` be same as
> ``isolcpus`` ? or isolated core shall include ceph cpuset as well ?
isolated_cores only ensures these CPUs are not used by systemd, irqbalance, and some kernel threads, unless a program (like OVS, Nova, or perhaps Ceph OSD) chooses to change its cpumask to something that includes those isolated CPUs. It is not equivalent to isolcpus because isolcpus always disables load-balancing on those CPUs. If you have a program which does not need load balancing (OVS PMD threads, Nova pinning vcpu threads to pcpus), then not having load balancing is probably fine. If you have a program which does not follow a 1:1 thread:CPU model, the lack of load balancing will cause you to use just 1 CPU.
Thanks Andrew. So if I understand correctly, it all boils down to whether the CPUs used for ceph OSDs needs the features of isolated_cores in cpu partitioning tuned profile. John/Ben Can you please advise. What is the current behaviour of HCI+SRIOV ? My take is that we should run with the same kind of deployment of HCI+SRIOV, ie running Ceph on non-isolated CPUs. (In reply to Christophe Fontaine from comment #15) > What is the current behaviour of HCI+SRIOV ? I couldn't get hold of a document explaining the Isolcpus parameter of HCI + SRIOV deployment. I understand that for HCI or HCI-SRIOV deployment, the specific list of CPUs are not provided. Instead ceph_osd_docker_cpu_limit parameter specifies the number of CPUs per OSD. In case of HCI-DPDK, we specify the list of CPUs to be available for OSDs via ceph_osd_docker_cpuset_cpus. Also the steps to derive/determine the Isolcpus and isolated_cores in cpu partitioning profile are not covered in our official guide [1] > My take is that we should run with the same kind of deployment of HCI+SRIOV, > ie running Ceph on non-isolated CPUs. So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests + PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added to it. Also the isolated_cores (cpu partitioning tuned profile) shall be same as Isolcpus (kernelargs) Chris Please let me know if you think otherwise. Yariv/Sanjay Has this issue been found in HCI-SR-IOV deployment ? [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/hyperconverged_infrastructure_guide/index#reserve-resources-compute (In reply to Karthik Sundaravel from comment #16) > > My take is that we should run with the same kind of deployment of HCI+SRIOV, > > ie running Ceph on non-isolated CPUs. > > So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests + > PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added > to it. > Also the isolated_cores (cpu partitioning tuned profile) shall be same as > Isolcpus (kernelargs) > > Chris > Please let me know if you think otherwise. +1. With that said, we may have to set ceph_osd_docker_cpuset_cpus to the unpinned CPUs: could you check the affinity of the process in the container, so it doesn't overlap with isolated CPUs ? (refer to BZ 1750781) (In reply to Christophe Fontaine from comment #18) > (In reply to Karthik Sundaravel from comment #16) > > > My take is that we should run with the same kind of deployment of HCI+SRIOV, > > > ie running Ceph on non-isolated CPUs. > > > > So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests + > > PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added > > to it. > > Also the isolated_cores (cpu partitioning tuned profile) shall be same as > > Isolcpus (kernelargs) > > > > Chris > > Please let me know if you think otherwise. > > +1. With that said, we may have to set ceph_osd_docker_cpuset_cpus to the > unpinned CPUs: could you check the affinity of the process in the container, > so it doesn't overlap with isolated CPUs ? (refer to BZ 1750781) yes, Cpus_allowed_list does not overlap with isolated cpus. (In reply to Karthik Sundaravel from comment #16) > (In reply to Christophe Fontaine from comment #15) > > What is the current behaviour of HCI+SRIOV ? > I couldn't get hold of a document explaining the Isolcpus parameter of HCI + > SRIOV deployment. > I understand that for HCI or HCI-SRIOV deployment, the specific list of CPUs > are not provided. > Instead ceph_osd_docker_cpu_limit parameter specifies the number of CPUs per > OSD. > > In case of HCI-DPDK, we specify the list of CPUs to be available for OSDs > via ceph_osd_docker_cpuset_cpus. > > Also the steps to derive/determine the Isolcpus and isolated_cores in cpu > partitioning profile are not covered in our official guide [1] > > > My take is that we should run with the same kind of deployment of HCI+SRIOV, > > ie running Ceph on non-isolated CPUs. > > So IMHO, we shall go with Isolcpus (kernelargs) with the CPUs for Guests + > PMDs. The CPUs specified in ceph_osd_docker_cpuset_cpus shall not be added > to it. > Also the isolated_cores (cpu partitioning tuned profile) shall be same as > Isolcpus (kernelargs) > > Chris > Please let me know if you think otherwise. > > Yariv/Sanjay > Has this issue been found in HCI-SR-IOV deployment ? Vadim, could you please update if we saw this issue in our deployment? > > > > > [1] > https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16. > 0/html-single/hyperconverged_infrastructure_guide/index#reserve-resources- > compute Hi Chuck I created a doc BZ https://bugzilla.redhat.com/show_bug.cgi?id=1828134 with the details. Please let me know if you have the details to work on it. Hi Karthik, thank you for the doc BZ. I think we have what we need to start. There are no code changes required and the documentation needs change. I'll close the BZ having provided the documentation changes. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
In HCI-DPDK deployment (OSP13-Z10), the THT param ceph_osd_docker_cpuset_cpus specifies the list of CPUs available and reserved for OSD's ceph_osd_docker_cpuset_cpus: "4,40,5,41,6,42,7,43" After the system is deployed, the below script is used to generate stress in Ceph ========================================== echo "[vms]" > ips openstack server list -c Name -c Networks -f value | egrep "vm[0-9]" | sed s/tenant=//g | awk {'print $2'} >> ips namespace=$(ip netns | grep "(id: 0)" | awk {'print $1'}) common_params='/usr/local/bin/fio --name=karthik --filename=/dev/vdb --ramp_time=5 --startdelay=5' sudo ip netns exec $namespace ansible --ssh-extra-args "-o StrictHostKeyChecking=no" --private-key test.pem -u cloud-user -i ips -f 30 -b -m shell -a "$common_params --rw=randrw --bs=1k --direct=1 --size=40G --runtime=600 --time_based=1 --output=/home/cloud-user/rand-write.log" vms ========================================== While this script is running, when we check the CPU usage in each of the HCI compute nodes, only CPU 4 is used 100% with 0% idle. All other CPUs "40,5,41,6,42,7,43" are in 100% idle. Even if the above script is made to run in parallel for /dev/vdc and /dev/vdd the CPU utilisation does not go beyond 1 CPU. Note : the storage and storage management NICs are setup on 25Gbps network. Expectation: The Cpu load shall be shared with other CPUs as well Expectation: