Bug 1496674 - [Perf] : 4k Random reads are off target by ~23%.
Summary: [Perf] : 4k Random reads are off target by ~23%.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 3.0
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: 3.0
Assignee: Jason Dillaman
QA Contact: Ambarish
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-28 05:58 UTC by Ambarish
Modified: 2017-12-05 23:46 UTC (History)
10 users (show)

Fixed In Version: RHEL: ceph-12.2.1-24.el7cp Ubuntu: ceph_12.2.1-27redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-05 23:46:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 21844 0 None None None 2017-10-19 02:48:50 UTC
Ceph Project Bug Tracker 21845 0 None None None 2017-10-19 02:48:33 UTC
Ceph Project Bug Tracker 21846 0 None None None 2017-10-19 02:48:18 UTC
Ceph Project Bug Tracker 21860 0 None None None 2017-10-19 22:51:09 UTC
Red Hat Product Errata RHBA-2017:3387 0 normal SHIPPED_LIVE Red Hat Ceph Storage 3.0 bug fix and enhancement update 2017-12-06 03:03:45 UTC

Description Ambarish 2017-09-28 05:58:56 UTC
Description of problem:
-----------------------

My setup consists of 3 servers 1 MON and 3 clients.Each server has 12 OSDs.

4k LibRBD random reads are slightly slower on latest 3.0 developer bits.

CEPH 2.4 Gold : 7322 kB/sec
CEPH 3.0 2:12.2.0-2 : 5708 kB/sec

Regression : ~23%


I ran the same test on the same hardware on 2.4 bits , to rule out slower disks etc.The minute I upgrade to 3.0,I see a drop on rand reads.


Exact Workload from CBT : /usr/local/bin/fio --ioengine=rbd --clientname=admin --pool=cbt-librbdfio --rbdname=cbt-librbdfio-`hostname -s`-0 --invalidate=0 --rw=randread --runtime=300 --ramp_time=None --numjobs=1 --direct=1 --bs=4096B --iodepth=128 --end_fsync=0 --write_iops_log=/tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 --write_bw_log=/tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 --write_lat_log=/tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0 --log_avg_msec=100 --name=librbdfio-`hostname -s`-0  > /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-00004096/concurrent_procs-003/iodepth-128/randread/output.0

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

ceph-base.x86_64 2:12.2.0-2   

How reproducible:
-----------------

Every which way I try.


Actual results:
---------------

Perf Drop by 23%


Expected results:
-----------------

Regression Threshold between releases/dev builds for perf tests is +-10%

Comment 2 Jason Dillaman 2017-09-28 12:37:39 UTC
@Ambarish: can you please isolate the potential perf drop by testing a RHCS 3.x librbd against a RHCS 2.x cluster -- and vice-versa?

Comment 3 Ambarish 2017-09-29 04:12:25 UTC
(In reply to Jason Dillaman from comment #2)
> @Ambarish: can you please isolate the potential perf drop by testing a RHCS
> 3.x librbd against a RHCS 2.x cluster -- and vice-versa?

Hi Jason,

I ran my tests on CEPH 2.4_async which is my baseline, reimaged my machines to CEPH 3.0 and ran CBT again and compiled the results here :

https://docs.google.com/a/redhat.com/spreadsheets/d/14WEAGz2TeL9xJ5AJYipISTCTqMxOeBdg0LJFv63g14w/edit?usp=sharing

(This has historical data shared by Ben Turner as well)

 2:12.2.0-2. is what I tested for CEPH 3 => this was tested multiple times on fresh installs to ensure consistency.I could repro the regression on rand reads each time.

I did install CEPH 2.4 again on my machines to rule out slower disks and other env problems,but each time I could get 7000-ish kB/s rand reads.


I've been doing Perf regressions for Gluster for a while now.I am not sure what data would be helpful to a CEPH Developer for an RCA to be honest.

Would it be helpful if narrow it down at build level (I think there are a couple of builds  for 3.0,I tested only 2:12.2.0-2.) that which build introduced the regression?

Comment 5 Jason Dillaman 2017-09-29 12:34:17 UTC
@Ambarish: I am interested in first figuring out if the performance regression is in the OSDs or in the client. That's why I was wondering if you had tested using RHCS 2.x OSDs and a RHCS 3.x client.

Comment 6 Ambarish 2017-09-30 05:01:45 UTC
(In reply to Jason Dillaman from comment #5)
> @Ambarish: I am interested in first figuring out if the performance
> regression is in the OSDs or in the client. That's why I was wondering if
> you had tested using RHCS 2.x OSDs and a RHCS 3.x client.

Ah,thanks for clearing that up.

I'll get that ASAP.

Comment 7 Ambarish 2017-10-03 11:29:53 UTC
Hi Jason,

Just to reiterate..

*CEPH 2.4 Testbed* : 7322 kB/sec

*CEPH 3.0 Testbed* : 5708 kB/sec


And these are rand read throughputs on the combinations you asked me to test:

*CEPH 3.0 Cluster/CEPH 2.4 Clients* : 7572 kB/sec

*CEPH 2.4 Cluster/CEPH 3.0 Clients* : 5594 kB/s

It Looks like the perf regression is coming from the OSDs.

Comment 8 Jason Dillaman 2017-10-03 15:39:14 UTC
@Ambarish: am I reading the results incorrectly? To me, it seems like the results show that the client is responsible for the slowdown since a 2.4 cluster w/ 3.0 clients shows the slower speed. 

If that is the case, the only major change from 2.4 to 3.0 was the switch from the original simple messenger layer to the new async messenger layer. Would it be possible for you to run a test where the "ceph.conf" had "ms type = simple" in the "[global]" section?

Comment 9 Ambarish 2017-10-03 16:11:21 UTC
(In reply to Ambarish from comment #7)
> Hi Jason,
> 
> Just to reiterate..
> 
> *CEPH 2.4 Testbed* : 7322 kB/sec
> 
> *CEPH 3.0 Testbed* : 5708 kB/sec
> 
> 
> And these are rand read throughputs on the combinations you asked me to test:
> 
> *CEPH 3.0 Cluster/CEPH 2.4 Clients* : 7572 kB/sec
> 
> *CEPH 2.4 Cluster/CEPH 3.0 Clients* : 5594 kB/s
> 
> It Looks like the perf regression is coming from the OSDs.

ACCCCKKKKKK!

My apologies for the typo,Jason.

I meant to write clients there.

Sure,I'll have the data and upstae the BZ.

Comment 11 Jason Dillaman 2017-10-12 14:23:54 UTC
I can repeat this locally on my development machine:

Jewel v10.2.10:
   bw (  KiB/s): min=32864, max=66240, per=99.96%, avg=42622.84, stdev=3557.04, samples=491
   iops        : min= 8216, max=16560, avg=10655.70, stdev=889.26, samples=491

Luminous v12.2.1:
   bw (  KiB/s): min=28312, max=39184, per=99.93%, avg=33852.23, stdev=2500.48, samples=53
   iops        : min= 7078, max= 9796, avg=8463.06, stdev=625.08, samples=53

Comment 12 Jason Dillaman 2017-10-12 16:20:57 UTC
In my environment, the slowdown is caused by "lockdep" being enabled in my vstart'd development cluster. When I disable "lockdep" in my ceph.conf (which is disabled by default), my Luminous results are consistently better than my Jewel results:

Jewel v10.2.10 w/ lockdep disabled:
   bw (  KiB/s): min=35256, max=56120, per=99.99%, avg=44213.17, stdev=3607.10, samples=120
   iops        : min= 8814, max=14030, avg=11053.28, stdev=901.78, samples=120


Luminous v12.2.1 w/ lockdep disabled:
   bw (  KiB/s): min=32424, max=55312, per=99.99%, avg=45659.46, stdev=4059.56, samples=120
   iops        : min= 8106, max=13828, avg=11414.86, stdev=1014.90, samples=120


@Ambarish: can you confirm whether or not lockdep is enabled in your client's ceph.conf?

Comment 13 Federico Lucifredi 2017-10-13 21:07:12 UTC
Ambarish, this is one of the most serious open issues, please confirm it is resolved - we are running out ot time to work on bugs.

Comment 14 Ambarish 2017-10-16 17:58:39 UTC
Hi Jason,Federico,

My sincere apologies.

I am currently OOO for  Diwali and hence could not reply sooner.

Lockdep is disabled.

I did not manually change/edit anything differently for 3.0.The conf was exactly the same between releases.

I've triggered a run with ceph-12.2.1-15 to check the latest stats.Will share an update.

Comment 15 Jason Dillaman 2017-10-16 18:07:08 UTC
@Ambarish: if you get a chance, if the results are the same on the retest, can you manually add "lockdep = false" under the ceph.conf's "[global]" section? My performance deltas basically mimic your results (percentage-wise) for the various block sizes when I have lockdep enabled and are consistently faster than RHCS 2.x when it's disabled.

Comment 17 Ambarish 2017-10-17 07:44:40 UTC
(In reply to Jason Dillaman from comment #15)
> @Ambarish: if you get a chance, if the results are the same on the retest,
> can you manually add "lockdep = false" under the ceph.conf's "[global]"
> section? My performance deltas basically mimic your results
> (percentage-wise) for the various block sizes when I have lockdep enabled
> and are consistently faster than RHCS 2.x when it's disabled.

Jason,

I do not see much improvement on my 3.0 testbed with lockdep disabled.

lockdep enabled : 5640 kB/sec
lockdep disabled : 5990 kB/sec

The throughput is still not close to my baseline (close to 7k kB/s)

Comment 20 Jason Dillaman 2017-10-18 20:34:24 UTC
Testing on lab hardware shows that the issue appears to be within RHCS 3.0 librados:

RHCS 3.0 librados w/ RHCS 3.0 librbd:

   bw (  KiB/s): min=81136, max=100240, per=100.00%, avg=93541.35, stdev=3007.49, samples=600
   iops        : min=20284, max=25060, avg=23385.31, stdev=751.86, samples=600

RHCS 3.0 librados w/ RHCS 2.3 librbd:

   bw (  KiB/s): min=77632, max=100520, per=100.00%, avg=93293.47, stdev=3271.76, samples=600
   iops        : min=19408, max=25130, avg=23323.35, stdev=817.94, samples=600

RHCS 2.3 librados w/ RHCS 3.0 librbd:

   bw (  KiB/s): min=95528, max=113920, per=100.00%, avg=108487.70, stdev=2943.85, samples=600
   iops        : min=23882, max=28480, avg=27121.89, stdev=735.96, samples=600

Comment 21 Jason Dillaman 2017-10-18 21:10:13 UTC
Disabling messenger logging (via debug ms = 0) restores most of the performance delta (-5%):

RHCS 3.0:

   bw (  KiB/s): min=132216, max=161992, per=100.00%, avg=150387.60, stdev=4656.73, samples=379
   iops        : min=33056, max=40498, avg=37596.87, stdev=1164.15, samples=379

RHCS 2.3:

   bw (  KiB/s): min=148936, max=172664, per=100.00%, avg=158056.64, stdev=5141.18, samples=132
   iops        : min=37234, max=43166, avg=39514.15, stdev=1285.23, samples=132

Comment 22 Jason Dillaman 2017-10-19 02:47:17 UTC
Tweaking the new RHCS 3.0 backoff logic resulted in another ~8% performance gain:

RHCS 3.0 w/ "debug ms = 0" + backoff tweak:

   bw (  KiB/s): min=158688, max=165488, per=100.00%, avg=162691.57, stdev=1706.77, samples=30
   iops        : min=39672, max=41372, avg=40672.87, stdev=426.71, samples=30

Comment 23 Jason Dillaman 2017-10-20 13:27:14 UTC
Upstream PR: https://github.com/ceph/ceph/pull/18418

Comment 24 Jason Dillaman 2017-10-20 13:34:36 UTC
Upstream PR: https://github.com/ceph/ceph/pull/18427

Comment 28 Ambarish 2017-10-30 11:08:05 UTC
Comparable throughput to the baseline on ceph-12.2.1-24,Verified.

Comment 31 errata-xmlrpc 2017-12-05 23:46:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387


Note You need to log in before you can comment on or make changes to this bug.