Bug 1870804
Summary: | balancer module does not balance reads, only writes | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Ben England <bengland> | ||||
Component: | RADOS | Assignee: | Laura Flores <lflores> | ||||
Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.2 | CC: | abose, aclewett, akraj, akupczyk, bbenshab, bhubbard, bniver, bperkins, ceph-eng-bugs, ekuric, flucifre, jdurgin, jhopper, jsalomon, kramdoss, lflores, nojha, owasserm, pdhiran, rpollack, rzarzyns, shberry, sseshasa, tserlin, twilkins, vumrao | ||||
Target Milestone: | --- | Keywords: | Performance | ||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ceph-19.1.1-8.el9cp | Doc Type: | Enhancement | ||||
Doc Text: |
.Balanced primary placement groups can now be observed in a cluster
Previously, users could only balance primaries with the offline `osdmaptool`.
With this enhancement, autobalancing is available with the `upmap` balancer. Users can now choose between either the `upmap-read`or `read` mode. The `upmap-read` mode offers simultaneous upmap and read optimization. The `read` mode can only be used to optimize reads.
For more information, see operations/proc_mgr_using-the-ceph-manager-balancer-module.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2024-11-25 08:58:38 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2317218 | ||||||
Attachments: |
|
Description
Ben England
2020-08-20 18:59:19 UTC
I'd argue that for more OSDs I assume the impact is going to be less than 3, which is probably worst case scenario? Well I'm not convinced it's an uncommon case, but this configuration is particularly visible to customers when they first encounter OCS because they start out with the low-end configuration to save money, and work their way up from there. Would be happy to try it out with more OSDs and see for you when time permits. There is a workaround - increase PG count from default of 32, which is way too low for this case. But this workaround is not practical for clusters with higher OSD counts, so you really do want to fix the balancer to handle this if at all possible. The next attachment shows it is still happening in OCS 4.5 GA. Created attachment 1715745 [details]
screenshot of grafana dashboard for OCS 4.5 GA cluster
Here is a screenshot of a grafana dashboard from an OCS 4.5 test on baremetal with rook-ceph displaying block device IOPS in that illustrates how extreme this can be. In this screenshot, we see a sequence of fio tests: 4096-KiB-randwrite, 4096-KiB-randread, 1024-KiB-randwrite, 1024-KiB-randread, etc... with size decreasing by factor of 4 each round. The graphs for block device throughput in MB/s show the same behavior. For the write tests, you can see that throughput is the same for all OSDs, but for reads, master-0 IOPS is 1/2 master-1 and master-2 IOPS, roughly.
This isn't going to make the initial pacific release. cc'ing Annette Clewett and others. I have major concerns about this, it is an ancient problem and it is impacting at least one major customer now and we need it sooner rather than later. Efficiency becomes much more important as you scale out, because the cost savings are amplified. And we now have OCS customers with over 40 nodes (not much by Ceph standards but enough to make this significant impact for read-intensive workloads). This ongoing balancing issue gives Ceph a black eye - customers really care if you can fully utilize the resources that they are paying for. BTW if you don't balance read I/O, then this will adversely impact writes as well because most workloads are a mixture of reads and writes, so it will defeat the ability of ceph-mgr balancer module to balance writes across OSDs. does this really have anything to do with Pacific release? Or is it strictly that no one thinks this change is critical to sales of RHCS and OCS? The initial example above is for a very small cluster, the effects may be much more severe in large clusters, do you need data on this? (In reply to Ben England from comment #5) > cc'ing Annette Clewett and others. I have major concerns about this, it is > an ancient problem and it is impacting at least one major customer now and > we need it sooner rather than later. Efficiency becomes much more > important as you scale out, because the cost savings are amplified. And we > now have OCS customers with over 40 nodes (not much by Ceph standards but > enough to make this significant impact for read-intensive workloads). This > ongoing balancing issue gives Ceph a black eye - customers really care if > you can fully utilize the resources that they are paying for. BTW if you > don't balance read I/O, then this will adversely impact writes as well > because most workloads are a mixture of reads and writes, so it will defeat > the ability of ceph-mgr balancer module to balance writes across OSDs. > > does this really have anything to do with Pacific release? Or is it > strictly that no one thinks this change is critical to sales of RHCS and > OCS? The initial example above is for a very small cluster, the effects > may be much more severe in large clusters, do you need data on this? Hey Ben, we are working on this, and understand it's importance. It's not in pacific due to implementation complexity. Once it's working in master we'll likely backport to pacific and potentially earlier releases. That's great Josh, thank you. I think this will be a significant performance boost for Ceph in general. Hi, I am evaluating ODF managed service performance on AWS platform. -------------------------------------------------------------------- Here's my version details: oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.2 OpenShift Container Storage 4.7.2 Succeeded ocs-osd-deployer.v1.0.0 OCS OSD Deployer 1.0.0 Succeeded prometheusoperator.0.47.0 Prometheus Operator 0.47.0 prometheusoperator.0.37.0 Succeeded route-monitor-operator.v0.1.353-20cc01d Route Monitor Operator 0.1.353-20cc01d route-monitor-operator.v0.1.351-26e4120 Succeeded Server Version: 4.8.2 Kubernetes Version: v1.21.1+051ac4f ceph version ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable) ------------------------------------------------------------------------ Now when I look at ceph pg dump ---- OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 2 300 GiB 723 GiB 301 GiB 1 TiB [0,1] 192 73 0 300 GiB 723 GiB 301 GiB 1 TiB [1,2] 192 67 1 300 GiB 723 GiB 301 GiB 1 TiB [0,2] 192 52 If you look at the primary_pg_sum, pgs are not balanced at all. This actually lead to ~38% performance drop between OSD having 73 primary pg and pg having 52 pg. I ran a random read test for 8Kib, 16KiB and 64Kib blocksizes and I am attaching the IOPS numbers seen on NVMe. (See attachment) To summarize performance it looks like: OSD.O : 2632 IOPS OSD.1 : 1907 IOPS OSD.2 : 3072 IOPS Estimated IOPS loss in perf : (3072-1907) * 100 /(3072+1907+2632)= ~15% ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3.00000 root default -6 3.00000 region us-east-1 -5 3.00000 zone us-east-1a -4 1.00000 rack rack0 -3 1.00000 host default-1-data-0959ww 0 ssd 1.00000 osd.0 up 1.00000 1.00000 -16 1.00000 rack rack1 -15 1.00000 host default-2-data-0z6v89 2 ssd 1.00000 osd.2 up 1.00000 1.00000 -12 1.00000 rack rack2 -11 1.00000 host default-0-data-0r7wwc 1 ssd 1.00000 osd.1 up 1.00000 1.00000 Raising severity/priority to High/High - due to the impact on OCS. Not sure if it's worth repeating on Pacific based OCS, btw. CC: jsalomon Is there an upstream tracker? In grad school about 15 years ago, I studied an optimization method called "simulated annealing" -- this problem is like that only simpler. https://en.wikipedia.org/wiki/Simulated_annealing The metaphor is that you heat up metal (i.e. random molecular motion) to make it liquid so you can pour it into a mold, and then you cool it down and temper it and pound it to get strength and shape. With optimization methods, you can use random algorithm to get an initial rough distribution that approximates what you want (i.e. randomly assign PGs to OSD using CRUSH constraints), and then you use hill-climbing or greedy optimization to make the initial distribution better in some simple way. For PG distribution across OSDs, a greedy algorithm should work fine https://en.wikipedia.org/wiki/Greedy_algorithm Specifically you can't make things worse for reads if you move a PG's primary OSD out of the most heavily loaded OSD (with the most primary PGs) and it's pretty easy to see how you can make things better for reads (move primary OSD to a less loaded OSD). However, this alters PG placement and so it impacts distribution of PGs (including non-primary OSD) across OSDs, possibly adversely impacting writes. So you have to optimize the primary OSD placement first, and then once this is out of the way, optimize the secondary OSD placement, which doesn't directly impact read performance. I suspect would be an algorithm that can be proven to converge to an optimal solution in a bounded number of steps (I would guess O(N) where N is OSD count). However, another problem is: how do we continue to provide access to the PG while changing the primary OSD, since that's the one the RADOS clients are all trying to talk to? Surely there must be a way to handle loss of the primary OSD already, so I would think the problem of moving a PG's primary OSD would be not much harder than the problem of losing a PG's primary OSD. Perhaps one automated test for the PG autoscaler and balancer could be to just force them to run using "ceph osd pool set your-pool target_size_ratio 0.99" and evaluate the evenness of the PG distribution when they are done, like Mark Nelson's https://github.com/ceph/cbt/blob/master/tools/readpgdump.py script does, rather than waiting for it to happen and then running an I/O test. No data is required to be written if you use target_size_ratio, so in this case the PG autoscaler and balancer can run very fast. If the PG distribution is good (and it's painfully obvious when it's not), then the I/O across OSDs should be evenly balanced in my experience. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:10216 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |