Bug 1870804

Summary:

balancer module does not balance reads, only writes

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Ben England <bengland>

Component:

RADOS

Assignee:

Laura Flores <lflores>

Status:

CLOSED ERRATA

QA Contact:

Pawan <pdhiran>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.2

CC:

abose, aclewett, akraj, akupczyk, bbenshab, bhubbard, bniver, bperkins, ceph-eng-bugs, ekuric, flucifre, jdurgin, jhopper, jsalomon, kramdoss, lflores, nojha, owasserm, pdhiran, rpollack, rzarzyns, shberry, sseshasa, tserlin, twilkins, vumrao

Target Milestone:

---

Keywords:

Performance

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

ceph-19.1.1-8.el9cp

Doc Type:

Enhancement

Doc Text:

.Balanced primary placement groups can now be observed in a cluster Previously, users could only balance primaries with the offline `osdmaptool`. With this enhancement, autobalancing is available with the `upmap` balancer. Users can now choose between either the `upmap-read`or `read` mode. The `upmap-read` mode offers simultaneous upmap and read optimization. The `read` mode can only be used to optimize reads. For more information, see operations/proc_mgr_using-the-ceph-manager-balancer-module.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2024-11-25 08:58:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2317218

Attachments:

Description	Flags
screenshot of grafana dashboard for OCS 4.5 GA cluster	none

Description Ben England 2020-08-20 18:59:19 UTC

Description of problem:

Balancer module doesn't balance reads because distribution of primary OSD in PGs is not even across the set of OSDs. This was first observed by Manoj Pillai in bz 1790500, but that bz was closed with an increase in default PG count.

This is significant because it affects any Ceph cluster, whether Kubernetes Ceph or baremetal Ceph, whether private or public cloud. Many workloads are read-mostly so this would be a significant performance reduction for them.

Version-Release number of selected component (if applicable):

OCS 4.5.0-521 (rc?)
RHCOS 4.5 CoreOS 45.82.202008010929-0
OCP 4.5
quay.io/rhceph-dev/rhceph@sha256:df67e134c9037707118a8533670b2f77647355a4af2fdc1247d79679cb6bb676
rook 1.3
Ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus

How reproducible:

We need a couple of more tests with more OSDs in them to assess the scope and severity of the problem. But it shouldn't be happening at all.

Steps to Reproduce:
1. install OCS cluster in the usual way
2. run ripsaw fio workload that fills 70% of storage
3. observe primary OSD distribution in PGs

Actual results:

1 OSD is generating 30% less throughput than the others. This results in a 10% reduction in throughput.

Expected results:

expect that for fio read workload with uniform random access to data, I will see all OSDs getting approximately equal amounts of read requests.

Additional info:

I still am seeing the fio read test hitting one OSD less than the other 2 in my cluster, so the balancing problem that Manoj Pillai pointed out in bz 1790500 still is not completely fixed in OCS 4.5. Here's the grafana graph that shows it

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/rhocs/cnv/OSD-read-imbalance-2020-08-20.png

In this graph, the white portions are the reads, and the purple and blue portions are the random writes. The test prefills fio files, then runs randread followed by randwrite for these I/O sizes (KiB) in order: 4096, 1024, 256, 64, 16, 4, 1

here's the raw data which shows that this is the root cause. pgs are perfectly balanced across OSDs, but primary PGs are not. Master-0 host contains osd.2 and that's the OSD with only 3/4 of the primary OSDs of master-2, and sure enough, the read throughput is correspondingly asymmetric. For example, during the 4-MiB rand read test, I see /dev/nvme0n1 read throughput for these hosts:

host GB/s OSD primary PGs
master-0 1.240 2 49
master-1 1.8 +- 10% 0 69
master-2 1.78 +- 10% 1 58

estimated throughput loss = (1.8 - 1.240) x 100.0 / (1.24 + 1.8 + 1.78)
= 11%

Whereas for writes, throughput is perfectly balanced, as are PGs. Now this is with just 3 OSDs, how bad could the problem be if there were 100 OSDs? Maybe Elko's OSD scaling tests could tell us that.

While this is not as bad as before, it still could be better by just fixing the rebalancer perhaps?

# ceph pg dump
...
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 2.1 TiB 876 GiB 2.1 TiB 2.9 TiB [0,1] 176 49
1 2.1 TiB 876 GiB 2.1 TiB 2.9 TiB [0,2] 176 58
0 2.1 TiB 876 GiB 2.1 TiB 2.9 TiB [1,2] 176 69
sum 6.2 TiB 2.6 TiB 6.2 TiB 8.7 TiB
...
sh-4.4# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 8.73299 root default
-12 2.91100 rack rack0
-11 2.91100 host master-0
2 ssd 2.91100 osd.2 up 1.00000 1.00000
-8 2.91100 rack rack1
-7 2.91100 host master-1
0 ssd 2.91100 osd.0 up 1.00000 1.00000
-4 2.91100 rack rack2
-3 2.91100 host master-2
1 ssd 2.91100 osd.1 up 1.00000 1.00000

Comment 1 Yaniv Kaul 2020-08-25 11:42:28 UTC

I'd argue that for more OSDs I assume the impact is going to be less than 3, which is probably worst case scenario?

Comment 2 Ben England 2020-09-22 19:25:41 UTC

Well I'm not convinced it's an uncommon case, but this configuration is particularly visible to customers when they first encounter OCS because they start out with the low-end configuration to save money, and work their way up from there.   

Would be happy to try it out with more OSDs and see for you when time permits.    There is a workaround - increase PG count from default of 32, which is way too low for this case.   But this workaround is not practical for clusters with higher OSD counts, so you really do want to fix the balancer to handle this if at all possible.    

The next attachment shows it is still happening in OCS 4.5 GA.

Comment 3 Ben England 2020-09-22 19:26:42 UTC

Created attachment 1715745 [details]
screenshot of grafana dashboard for OCS 4.5 GA cluster

Here is a screenshot of a grafana dashboard from an OCS 4.5 test on baremetal with rook-ceph displaying block device IOPS in that illustrates how extreme this can be.  In this screenshot, we see a sequence of fio tests: 4096-KiB-randwrite, 4096-KiB-randread, 1024-KiB-randwrite, 1024-KiB-randread, etc... with size decreasing by factor of 4 each round.    The graphs for block device throughput in MB/s show the same behavior.  For the write tests, you can see that throughput is the same for all OSDs, but for reads, master-0 IOPS is 1/2 master-1 and master-2 IOPS, roughly.

Comment 4 Josh Durgin 2021-01-22 23:37:32 UTC

This isn't going to make the initial pacific release.

Comment 5 Ben England 2021-03-18 16:01:39 UTC

cc'ing Annette Clewett and others.   I have major concerns about this, it is an ancient problem and it is impacting at least one major customer now and we need it sooner rather than later.    Efficiency becomes much more important as you scale out, because the cost savings are amplified.   And we now have OCS customers with over 40 nodes (not much by Ceph standards but enough to make this significant impact for read-intensive workloads).  This ongoing balancing issue gives Ceph a black eye - customers really care if you can fully utilize the resources that they are paying for.  BTW if you don't balance read I/O, then this will adversely impact writes as well because most workloads are a mixture of reads and writes, so it will defeat the ability of ceph-mgr balancer module to balance writes across OSDs.

does this really have anything to do with Pacific release?   Or is it strictly that no one thinks this change is critical to sales of RHCS and OCS?   The initial example above is for a very small cluster, the effects may be much more severe in large clusters, do you need data on this?

Comment 6 Josh Durgin 2021-03-18 16:20:06 UTC

(In reply to Ben England from comment #5)
> cc'ing Annette Clewett and others.   I have major concerns about this, it is
> an ancient problem and it is impacting at least one major customer now and
> we need it sooner rather than later.    Efficiency becomes much more
> important as you scale out, because the cost savings are amplified.   And we
> now have OCS customers with over 40 nodes (not much by Ceph standards but
> enough to make this significant impact for read-intensive workloads).  This
> ongoing balancing issue gives Ceph a black eye - customers really care if
> you can fully utilize the resources that they are paying for.  BTW if you
> don't balance read I/O, then this will adversely impact writes as well
> because most workloads are a mixture of reads and writes, so it will defeat
> the ability of ceph-mgr balancer module to balance writes across OSDs.
> 
> does this really have anything to do with Pacific release?   Or is it
> strictly that no one thinks this change is critical to sales of RHCS and
> OCS?   The initial example above is for a very small cluster, the effects
> may be much more severe in large clusters, do you need data on this?

Hey Ben, we are working on this, and understand it's importance. It's not in pacific due to implementation complexity. Once it's working in master we'll likely backport to pacific and potentially earlier releases.

Comment 7 Ben England 2021-05-28 20:11:44 UTC

That's great Josh, thank you.   I think this will be a significant performance boost for Ceph in general.

Comment 9 Shekhar Berry 2021-08-09 15:48:32 UTC

Hi,

I am evaluating ODF managed service performance on AWS platform.

--------------------------------------------------------------------
Here's my version details:

oc get csv 
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-operator.v4.7.2                       OpenShift Container Storage   4.7.2                                                       Succeeded
ocs-osd-deployer.v1.0.0                   OCS OSD Deployer              1.0.0                                                       Succeeded
prometheusoperator.0.47.0                 Prometheus Operator           0.47.0            prometheusoperator.0.37.0                 Succeeded
route-monitor-operator.v0.1.353-20cc01d   Route Monitor Operator        0.1.353-20cc01d   route-monitor-operator.v0.1.351-26e4120   Succeeded

Server Version: 4.8.2
Kubernetes Version: v1.21.1+051ac4f

ceph version
ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)

------------------------------------------------------------------------

Now when I look at ceph pg dump

----
OSD_STAT USED    AVAIL   USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 
2        300 GiB 723 GiB  301 GiB 1 TiB    [0,1]    192             73 
0        300 GiB 723 GiB  301 GiB 1 TiB    [1,2]    192             67 
1        300 GiB 723 GiB  301 GiB 1 TiB    [0,2]    192             52 

If you look at the primary_pg_sum, pgs are not balanced at all. This actually lead to ~38% performance drop between OSD having  73 primary pg and pg having 52 pg. I ran a random read test for 8Kib, 16KiB and 64Kib blocksizes and I am attaching the IOPS numbers seen on NVMe. (See attachment) To summarize performance it looks like:

OSD.O : 2632 IOPS
OSD.1 : 1907 IOPS
OSD.2 : 3072 IOPS

Estimated IOPS loss in perf : (3072-1907) * 100 /(3072+1907+2632)= ~15%

ceph osd tree
ID  CLASS WEIGHT  TYPE NAME                                  STATUS REWEIGHT PRI-AFF 
 -1       3.00000 root default                                                       
 -6       3.00000     region us-east-1                                               
 -5       3.00000         zone us-east-1a                                            
 -4       1.00000             rack rack0                                             
 -3       1.00000                 host default-1-data-0959ww                         
  0   ssd 1.00000                     osd.0                      up  1.00000 1.00000 
-16       1.00000             rack rack1                                             
-15       1.00000                 host default-2-data-0z6v89                         
  2   ssd 1.00000                     osd.2                      up  1.00000 1.00000 
-12       1.00000             rack rack2                                             
-11       1.00000                 host default-0-data-0r7wwc                         
  1   ssd 1.00000                     osd.1                      up  1.00000 1.00000

Comment 11 Yaniv Kaul 2021-08-10 12:18:49 UTC

Raising severity/priority to High/High - due to the impact on OCS.
Not sure if it's worth repeating on Pacific based OCS, btw.

Comment 12 Josh Salomon 2021-08-11 11:08:05 UTC

CC: jsalomon

Comment 21 Ben England 2022-03-08 21:34:35 UTC

Is there an upstream tracker?

Comment 23 Ben England 2022-03-29 20:57:26 UTC

In grad school about 15 years ago, I studied an optimization method called "simulated annealing" -- this problem is like that only simpler.  

https://en.wikipedia.org/wiki/Simulated_annealing

The metaphor is that you heat up metal (i.e. random molecular motion) to make it liquid so you can pour it into a mold, and then you cool it down and temper it and pound it to get strength and shape.      With optimization methods, you can use random algorithm to get an initial rough distribution that approximates what you want (i.e. randomly assign PGs to OSD using CRUSH constraints), and then you use hill-climbing or greedy optimization to make the initial distribution better in some simple way.   For PG distribution across OSDs, a greedy algorithm should work fine

https://en.wikipedia.org/wiki/Greedy_algorithm

Specifically you can't make things worse for reads if you move a PG's primary OSD out of the most heavily loaded OSD (with the most primary PGs) and it's pretty easy to see how you can make things better for reads (move primary OSD to a less loaded OSD).   However, this alters PG placement and so it impacts distribution of PGs (including non-primary OSD) across OSDs, possibly adversely impacting writes.   So you have to optimize the primary OSD placement first, and then once this is out of the way, optimize the secondary OSD placement, which doesn't directly impact read performance.   I suspect would be an algorithm that can be proven to converge to an optimal solution in a bounded number of steps (I would guess O(N) where N is OSD count).

However, another problem is: how do we continue to provide access to the PG while changing the primary OSD, since that's the one the RADOS clients are all trying to talk to?   Surely there must be a way to handle loss of the primary OSD already, so I would think the problem of moving a PG's primary OSD would be not much harder than the problem of losing a PG's primary OSD.

Comment 24 Ben England 2022-03-30 12:55:24 UTC

Perhaps one automated test for the PG autoscaler and balancer could be to just force them to run using "ceph osd pool set your-pool target_size_ratio 0.99" and evaluate the evenness of the PG distribution when they are done, like Mark Nelson's https://github.com/ceph/cbt/blob/master/tools/readpgdump.py script does, rather than waiting for it to happen and then running an I/O test.   No data is required to be written if you use target_size_ratio, so in this case the PG autoscaler and balancer can run very fast.   If the PG distribution is good (and it's painfully obvious when it's not), then the I/O across OSDs should be evenly balanced in my experience.

Comment 65 errata-xmlrpc 2024-11-25 08:58:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:10216

Comment 66 Red Hat Bugzilla 2025-03-26 04:25:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days