Description of problem: ----------------------- My setup consists of 3 servers 1 MON and 3 clients.Each server has 12 OSDs. 4k LibRBD random reads are slightly slower on latest 3.0 developer bits. CEPH 2.4 Gold : 7322 kB/sec CEPH 3.0 2:12.2.0-2 : 5708 kB/sec Regression : ~23% I ran the same test on the same hardware on 2.4 bits , to rule out slower disks etc.The minute I upgrade to 3.0,I see a drop on rand reads. Exact Workload from CBT : /usr/local/bin/fio --ioengine=rbd --clientname=admin --pool=cbt-librbdfio --rbdname=cbt-librbdfio-`hostname -s`-0 --invalidate=0 --rw=randread --runtime=300 --ramp_time=None --numjobs=1 --direct=1 --bs=4096B --iodepth=128 --end_fsync=0 --write_iops_log=/tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 --write_bw_log=/tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 --write_lat_log=/tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 --log_avg_msec=100 --name=librbdfio-`hostname -s`-0 > /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 Version-Release number of selected component (if applicable): ------------------------------------------------------------- ceph-base.x86_64 2:12.2.0-2 How reproducible: ----------------- Every which way I try. Actual results: --------------- Perf Drop by 23% Expected results: ----------------- Regression Threshold between releases/dev builds for perf tests is +-10%
@Ambarish: can you please isolate the potential perf drop by testing a RHCS 3.x librbd against a RHCS 2.x cluster -- and vice-versa?
(In reply to Jason Dillaman from comment #2) > @Ambarish: can you please isolate the potential perf drop by testing a RHCS > 3.x librbd against a RHCS 2.x cluster -- and vice-versa? Hi Jason, I ran my tests on CEPH 2.4_async which is my baseline, reimaged my machines to CEPH 3.0 and ran CBT again and compiled the results here : https://docs.google.com/a/redhat.com/spreadsheets/d/14WEAGz2TeL9xJ5AJYipISTCTqMxOeBdg0LJFv63g14w/edit?usp=sharing (This has historical data shared by Ben Turner as well) 2:12.2.0-2. is what I tested for CEPH 3 => this was tested multiple times on fresh installs to ensure consistency.I could repro the regression on rand reads each time. I did install CEPH 2.4 again on my machines to rule out slower disks and other env problems,but each time I could get 7000-ish kB/s rand reads. I've been doing Perf regressions for Gluster for a while now.I am not sure what data would be helpful to a CEPH Developer for an RCA to be honest. Would it be helpful if narrow it down at build level (I think there are a couple of builds for 3.0,I tested only 2:12.2.0-2.) that which build introduced the regression?
@Ambarish: I am interested in first figuring out if the performance regression is in the OSDs or in the client. That's why I was wondering if you had tested using RHCS 2.x OSDs and a RHCS 3.x client.
(In reply to Jason Dillaman from comment #5) > @Ambarish: I am interested in first figuring out if the performance > regression is in the OSDs or in the client. That's why I was wondering if > you had tested using RHCS 2.x OSDs and a RHCS 3.x client. Ah,thanks for clearing that up. I'll get that ASAP.
Hi Jason, Just to reiterate.. *CEPH 2.4 Testbed* : 7322 kB/sec *CEPH 3.0 Testbed* : 5708 kB/sec And these are rand read throughputs on the combinations you asked me to test: *CEPH 3.0 Cluster/CEPH 2.4 Clients* : 7572 kB/sec *CEPH 2.4 Cluster/CEPH 3.0 Clients* : 5594 kB/s It Looks like the perf regression is coming from the OSDs.
@Ambarish: am I reading the results incorrectly? To me, it seems like the results show that the client is responsible for the slowdown since a 2.4 cluster w/ 3.0 clients shows the slower speed. If that is the case, the only major change from 2.4 to 3.0 was the switch from the original simple messenger layer to the new async messenger layer. Would it be possible for you to run a test where the "ceph.conf" had "ms type = simple" in the "[global]" section?
(In reply to Ambarish from comment #7) > Hi Jason, > > Just to reiterate.. > > *CEPH 2.4 Testbed* : 7322 kB/sec > > *CEPH 3.0 Testbed* : 5708 kB/sec > > > And these are rand read throughputs on the combinations you asked me to test: > > *CEPH 3.0 Cluster/CEPH 2.4 Clients* : 7572 kB/sec > > *CEPH 2.4 Cluster/CEPH 3.0 Clients* : 5594 kB/s > > It Looks like the perf regression is coming from the OSDs. ACCCCKKKKKK! My apologies for the typo,Jason. I meant to write clients there. Sure,I'll have the data and upstae the BZ.
I can repeat this locally on my development machine: Jewel v10.2.10: bw ( KiB/s): min=32864, max=66240, per=99.96%, avg=42622.84, stdev=3557.04, samples=491 iops : min= 8216, max=16560, avg=10655.70, stdev=889.26, samples=491 Luminous v12.2.1: bw ( KiB/s): min=28312, max=39184, per=99.93%, avg=33852.23, stdev=2500.48, samples=53 iops : min= 7078, max= 9796, avg=8463.06, stdev=625.08, samples=53
In my environment, the slowdown is caused by "lockdep" being enabled in my vstart'd development cluster. When I disable "lockdep" in my ceph.conf (which is disabled by default), my Luminous results are consistently better than my Jewel results: Jewel v10.2.10 w/ lockdep disabled: bw ( KiB/s): min=35256, max=56120, per=99.99%, avg=44213.17, stdev=3607.10, samples=120 iops : min= 8814, max=14030, avg=11053.28, stdev=901.78, samples=120 Luminous v12.2.1 w/ lockdep disabled: bw ( KiB/s): min=32424, max=55312, per=99.99%, avg=45659.46, stdev=4059.56, samples=120 iops : min= 8106, max=13828, avg=11414.86, stdev=1014.90, samples=120 @Ambarish: can you confirm whether or not lockdep is enabled in your client's ceph.conf?
Ambarish, this is one of the most serious open issues, please confirm it is resolved - we are running out ot time to work on bugs.
Hi Jason,Federico, My sincere apologies. I am currently OOO for Diwali and hence could not reply sooner. Lockdep is disabled. I did not manually change/edit anything differently for 3.0.The conf was exactly the same between releases. I've triggered a run with ceph-12.2.1-15 to check the latest stats.Will share an update.
@Ambarish: if you get a chance, if the results are the same on the retest, can you manually add "lockdep = false" under the ceph.conf's "[global]" section? My performance deltas basically mimic your results (percentage-wise) for the various block sizes when I have lockdep enabled and are consistently faster than RHCS 2.x when it's disabled.
(In reply to Jason Dillaman from comment #15) > @Ambarish: if you get a chance, if the results are the same on the retest, > can you manually add "lockdep = false" under the ceph.conf's "[global]" > section? My performance deltas basically mimic your results > (percentage-wise) for the various block sizes when I have lockdep enabled > and are consistently faster than RHCS 2.x when it's disabled. Jason, I do not see much improvement on my 3.0 testbed with lockdep disabled. lockdep enabled : 5640 kB/sec lockdep disabled : 5990 kB/sec The throughput is still not close to my baseline (close to 7k kB/s)
Testing on lab hardware shows that the issue appears to be within RHCS 3.0 librados: RHCS 3.0 librados w/ RHCS 3.0 librbd: bw ( KiB/s): min=81136, max=100240, per=100.00%, avg=93541.35, stdev=3007.49, samples=600 iops : min=20284, max=25060, avg=23385.31, stdev=751.86, samples=600 RHCS 3.0 librados w/ RHCS 2.3 librbd: bw ( KiB/s): min=77632, max=100520, per=100.00%, avg=93293.47, stdev=3271.76, samples=600 iops : min=19408, max=25130, avg=23323.35, stdev=817.94, samples=600 RHCS 2.3 librados w/ RHCS 3.0 librbd: bw ( KiB/s): min=95528, max=113920, per=100.00%, avg=108487.70, stdev=2943.85, samples=600 iops : min=23882, max=28480, avg=27121.89, stdev=735.96, samples=600
Disabling messenger logging (via debug ms = 0) restores most of the performance delta (-5%): RHCS 3.0: bw ( KiB/s): min=132216, max=161992, per=100.00%, avg=150387.60, stdev=4656.73, samples=379 iops : min=33056, max=40498, avg=37596.87, stdev=1164.15, samples=379 RHCS 2.3: bw ( KiB/s): min=148936, max=172664, per=100.00%, avg=158056.64, stdev=5141.18, samples=132 iops : min=37234, max=43166, avg=39514.15, stdev=1285.23, samples=132
Tweaking the new RHCS 3.0 backoff logic resulted in another ~8% performance gain: RHCS 3.0 w/ "debug ms = 0" + backoff tweak: bw ( KiB/s): min=158688, max=165488, per=100.00%, avg=162691.57, stdev=1706.77, samples=30 iops : min=39672, max=41372, avg=40672.87, stdev=426.71, samples=30
Upstream PR: https://github.com/ceph/ceph/pull/18418
Upstream PR: https://github.com/ceph/ceph/pull/18427
Comparable throughput to the baseline on ceph-12.2.1-24,Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387