Bug 1294662

Summary: Performance of a KRBD device from a SSD pool is not optimal as compare to SSD disk performance
Product: Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RBDAssignee: Josh Durgin <jdurgin>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 1.3.0CC: ceph-eng-bugs, flucifre, jdillama, kchai, rvijayan, sputhenp
Target Milestone: rc   
Target Release: 1.3.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-27 05:43:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vikhyat Umrao 2015-12-29 14:11:16 UTC
Description of problem:

KRBD device from a SSD pool  performance is not optimal as compare to SSD disk performance 

- Test environment:
 - 3 physical nodes with 2 SSDs each
 - two 10Gb network
 - RHCS 1.3.1
 - SSD pool has 3 osds. On each node, one SSD as osd disk and the other as journal. Configure 2 replicas.
- Test Tool: fio
- Methods and Result:
   1) test the raw SSD performance with fio, 4k block size, libaio ioengine, random write, which shows 1 SSD's IOPS is about 50000
   2) test the krbd's performance, which shows IOPS is about 3000


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.1 

How reproducible:
Reproducible in my testbed.

Comment 1 Vikhyat Umrao 2015-12-29 14:13:40 UTC
In my testbed:

##################################################
fio --filename=/dev/sda3 --direct=1 --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --size=10G --numjobs=16 --runtime=300 --group_reporting --name=randw-4k
randw-4k: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.2.8
Starting 16 processes
Jobs: 16 (f=16): [w(16)] [100.0% done] [0KB/142.7MB/0KB /s] [0/36.6K/0 iops] [eta 00m:00s]
randw-4k: (groupid=0, jobs=16): err= 0: pid=3403826: Tue Dec 29 15:34:38 2015
  write: io=39949MB, bw=136357KB/s, iops=34089, runt=300005msec
    slat (usec): min=2, max=525820, avg=222.07, stdev=2173.03
    clat (usec): min=100, max=690557, avg=7286.60, stdev=12666.64
     lat (usec): min=107, max=690888, avg=7508.79, stdev=12871.43
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    8],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   13], 95.00th=[   13],
     | 99.00th=[   14], 99.50th=[   15], 99.90th=[  249], 99.95th=[  281],
     | 99.99th=[  578]
    bw (KB  /s): min=   40, max=10347, per=6.32%, avg=8615.83, stdev=1526.34
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=31.73%, 10=35.62%, 20=32.54%, 50=0.01%
    lat (msec) : 250=0.02%, 500=0.08%, 750=0.02%
  cpu          : usr=0.37%, sys=1.43%, ctx=5765332, majf=0, minf=528
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=10226941/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=39949MB, aggrb=136356KB/s, minb=136356KB/s, maxb=136356KB/s, mint=300005msec, maxt=300005msec

Disk stats (read/write):
  sda: ios=0/10221411, merge=0/0, ticks=0/42782523, in_queue=42794280, util=100.00%

==========================================================================================

# rbd create ssd/data-disk1 -s 204800 --image-format 2
# rbd -p ssd ls -l
NAME                                                                                       SIZE PARENT FMT PROT LOCK 
data-disk1                                                                                 200G          2    


fio --filename=/dev/rbd0 --direct=1 --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --size=10G --numjobs=16 --runtime=300 --group_reporting --name=randw-4k
randw-4k: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.2.8
Starting 16 processes
Jobs: 16 (f=16): [w(16)] [100.0% done] [0KB/17136KB/0KB /s] [0/4284/0 iops] [eta 00m:00s]
randw-4k: (groupid=0, jobs=16): err= 0: pid=7010: Tue Dec 29 15:44:02 2015
  write: io=4631.9MB, bw=15808KB/s, iops=3952, runt=300032msec
    slat (usec): min=1, max=819598, avg=2625.54, stdev=13087.39
    clat (msec): min=1, max=900, avg=62.14, stdev=41.26
     lat (msec): min=1, max=910, avg=64.77, stdev=42.99
    clat percentiles (msec):
     |  1.00th=[   10],  5.00th=[   15], 10.00th=[   18], 20.00th=[   31],
     | 30.00th=[   51], 40.00th=[   57], 50.00th=[   61], 60.00th=[   66],
     | 70.00th=[   71], 80.00th=[   80], 90.00th=[   98], 95.00th=[  114],
     | 99.00th=[  208], 99.50th=[  251], 99.90th=[  486], 99.95th=[  635],
     | 99.99th=[  824]
    bw (KB  /s): min=    7, max= 1564, per=6.30%, avg=995.24, stdev=179.54
    lat (msec) : 2=0.01%, 4=0.03%, 10=1.23%, 20=14.31%, 50=13.97%
    lat (msec) : 100=61.17%, 250=8.77%, 500=0.42%, 750=0.06%, 1000=0.03%
  cpu          : usr=0.09%, sys=0.19%, ctx=324229, majf=0, minf=528
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1185755/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=4631.9MB, aggrb=15808KB/s, minb=15808KB/s, maxb=15808KB/s, mint=300032msec, maxt=300032msec

Disk stats (read/write):
  rbd0: ios=91/1184944, merge=0/0, ticks=39/42422287, in_queue=42426816, util=100.00%

Comment 2 Vikhyat Umrao 2015-12-29 14:15:16 UTC
After decreasing the replicated size to 2 performance got increased as customer is also running with replicated size as 2.


# fio --filename=/dev/rbd0 --direct=1 --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --size=10G --numjobs=16 --runtime=300 --group_reporting --name=randw-4k
randw-4k: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.2.8
Starting 16 processes
Jobs: 9 (f=0): [w(3),_(1),E(2),w(3),E(1),w(1),_(1),E(2),w(2)] [33.4% done] [0KB/23448KB/0KB /s] [0/5862/0 iops] [eta 09m:59s]
randw-4k: (groupid=0, jobs=16): err= 0: pid=61612: Tue Dec 29 18:41:43 2015
  write: io=6355.4MB, bw=21687KB/s, iops=5421, runt=300078msec
    slat (usec): min=1, max=741815, avg=1961.81, stdev=11582.04
    clat (msec): min=1, max=796, avg=45.25, stdev=41.45
     lat (msec): min=1, max=802, avg=47.21, stdev=42.83
    clat percentiles (usec):
     |  1.00th=[ 1736],  5.00th=[ 2992], 10.00th=[ 4320], 20.00th=[ 9152],
     | 30.00th=[15808], 40.00th=[33024], 50.00th=[47872], 60.00th=[54528],
     | 70.00th=[60160], 80.00th=[68096], 90.00th=[86528], 95.00th=[102912],
     | 99.00th=[152576], 99.50th=[218112], 99.90th=[518144], 99.95th=[577536],
     | 99.99th=[716800]
    bw (KB  /s): min=    6, max= 2064, per=6.31%, avg=1368.08, stdev=254.61
    lat (msec) : 2=1.63%, 4=7.19%, 10=12.40%, 20=15.59%, 50=16.26%
    lat (msec) : 100=41.23%, 250=5.34%, 500=0.23%, 750=0.13%, 1000=0.01%
  cpu          : usr=0.13%, sys=0.30%, ctx=689854, majf=0, minf=545
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1626960/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=6355.4MB, aggrb=21687KB/s, minb=21687KB/s, maxb=21687KB/s, mint=300078msec, maxt=300078msec

Disk stats (read/write):
  rbd0: ios=91/1625836, merge=0/0, ticks=38/42239755, in_queue=42243677, util=100.00%

Comment 3 Vikhyat Umrao 2015-12-29 14:25:01 UTC
I and Kefu discussed about this issue and we would be focusing on below given commands output from customer environment as well as from our test environment.

1. rados bench seq and rand
2. rbd bench-write seq and rand 
3. rbd.fio
[global]
ioengine=rbd
clientname=admin
pool=ssd
rbdname=data-disk1
rw=randwrite
bs=4k
[rbd_iodepth32]
iodepth=32

4. iostat -x
5. ceph daemon osd.x perf dump

Comment 4 Vikhyat Umrao 2015-12-29 14:55:49 UTC
Performance Data from our test environment 
#########################################

# rados bench -p ssd 10 write --no-cleanup
 Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or 0 objects
 Object prefix: benchmark_data_dell-per630-13.gsslab.pnq2.re_3423984
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        28        12   47.9883        48  0.919265  0.609388
     2      16        48        32   63.9888        80  0.437292  0.768286
     3      16        64        48   63.9901        64   1.41216  0.784257
     4      16        76        60   59.9916        48   1.40439  0.872014
     5      16        85        69   55.1925        36   1.78703  0.988722
     6      16       103        87   57.9926        72  0.861156   1.03164
     7      16       116       100   57.1356        52   1.03079   1.01897
     8      16       133       117   58.4927        68  0.981331   1.01232
     9      16       148       132   58.6594        60   1.13716   1.02272
    10      16       157       141    56.393        36   1.00005   1.03103
    11      16       158       142     51.63         4   2.05458   1.03824
    12      16       158       142   47.3275         0         -   1.03824
    13      16       158       142   43.6869         0         -   1.03824
 Total time run:         13.707367
Total writes made:      158
Write size:             4194304
Bandwidth (MB/sec):     46.107 

Stddev Bandwidth:       28.7956
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average Latency:        1.37976
Stddev Latency:         1.09322
Max latency:            5.02589
Min latency:            0.251668


]# rados bench -p ssd 10 seq
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        72        56   223.954       224  0.496107  0.201843
     2      16       109        93    185.97       1480.00683427  0.285011
     3      16       150       134    178.64       164  0.768358  0.318389
 Total time run:        3.715525
Total reads made:     158
Read size:            4194304
Bandwidth (MB/sec):    170.097 

Average Latency:       0.366777
Max latency:           1.12778
Min latency:           0.00555107

# rados bench -p ssd 10 rand
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        62        46   183.962       184   0.81561  0.216174
     2      16       104        88   175.971       1680.00618282  0.280205
     3      16       147       131   174.641       1720.00602811  0.319645
     4      16       195       179   178.975       192  0.647227  0.327878
     5      16       228       212   169.577       132  0.742245   0.35018
     6      16       269       253   168.643       164  0.776957  0.357648
     7      16       311       295   168.549       168  0.998243   0.35966
     8      16       359       343   171.477       192 0.0793933  0.355814
     9      16       399       383   170.199       160  0.699738   0.36026
    10      16       447       431   172.377       192  0.107735  0.353402
 Total time run:        10.561860
Total reads made:     448
Read size:            4194304
Bandwidth (MB/sec):    169.667 

Average Latency:       0.373629
Max latency:           1.4776
Min latency:           0.00578113

============================

$ ceph osd map ssd data-disk1
osdmap e523 pool 'ssd' (1) object 'data-disk1' -> pg 1.214a5c88 (1.8) -> up ([6,8], p6) acting ([6,8], p6)

$ rbd bench-write ssd/data-disk1 --io-size 4096 --io-threads 16 --io-total 10000000000 --io-pattern rand
bench-write  io_size 4096 io_threads 16 bytes 10000000000 pattern rand
  SEC       OPS   OPS/SEC   BYTES/SEC
    1    102517  102536.53  419989643.36
    2    204584  102301.52  419027012.28
    3    304691  101570.03  416030822.44
    4    405830  101462.33  415589714.74
    5    505676  101139.12  414265825.86
    6    606379  100614.88  412118536.05
    7    706234  100330.01  410951737.70
    8    807049  100471.68  411532010.64
    9    907766  100387.08  411185486.74
   10   1008782  100621.17  412144312.22
   11   1110309  100943.96  413466459.98
   12   1210524  100858.06  413114602.24
   13   1311350  100860.07  413122844.29
   14   1410969  100640.75  412224516.18
   15   1511387  100521.00  411734030.08
   16   1610369  100011.94  409648916.21
   17   1709624  99819.93  408862428.89
   18   1808832  99496.44  407537418.39
   19   1908772  99560.45  407799617.81
   20   2008219  99366.34  407004544.36
   21   2108790  99684.20  408306477.40
   22   2208608  99796.87  408767966.43
   23   2308675  99968.57  409471279.55
   24   2409279  100101.43  410015465.69
elapsed:    24  ops:  2441407  ops/sec: 100239.90  bytes/sec: 410582617.04

$ ceph daemon osd.6 perf dump > osd.6_dump_perf_rbd_bench-write_rand.txt 2>&1

^^ rand was going on. 

=> I will attach "osd.6_dump_perf_rbd_bench-write_rand.txt" of primary osd.6.
 

$ rbd bench-write ssd/data-disk1 --io-size 4096 --io-threads 16 --io-total 10000000000 --io-pattern seq
bench-write  io_size 4096 io_threads 16 bytes 10000000000 pattern seq
  SEC       OPS   OPS/SEC   BYTES/SEC
    1    101507  101529.63  415865382.31
    2    203166  101594.48  416130991.82
    3    304550  101524.29  415843482.76
    4    405401  101355.74  415153124.04
    5    506183  101236.24  414663652.50
    6    605318  100762.18  412721884.20
    7    704868  100340.31  410993892.39
    8    802683  99626.63  408070694.71
    9    902760  99471.96  407437168.09
   10   1002389  99246.04  406511774.36
   11   1102805  99497.37  407541231.07
   12   1202628  99552.13  407765506.76
   13   1303217  100106.69  410037000.94
   14   1402660  99979.94  409517844.53
   15   1502070  99936.17  409338571.47
   16   1601864  99811.89  408829507.21
   17   1702195  99913.32  409244964.32
   18   1801235  99603.68  407976684.55
   19   1902273  99922.70  409283372.99
   20   2002655  100117.07  410079505.52
   21   2102955  100218.15  410493535.71
   22   2202947  100150.44  410216210.69
   23   2302726  100298.11  410821053.60
   24   2402406  100026.53  409708668.77
elapsed:    24  ops:  2441407  ops/sec: 99961.54  bytes/sec: 409442464.86

$ ceph daemon osd.6 perf dump > osd.6_dump_perf_rbd_bench-write_seq.txt 2>&1

=> I will attach "osd.6_dump_perf_rbd_bench-write_seq.txt" of primary osd.6.

==============================

Comment 7 Vikhyat Umrao 2015-12-29 15:08:00 UTC
$iostat -x
Linux 3.10.0-229.14.1.el7.x86_64 (dell-per630-11.gsslab.pnq2.redhat.com) 	12/29/2015 	_x86_64_	(6 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.56    0.00    0.26    0.01    0.00   98.17

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.05    0.02    1.22     0.16    43.07    69.54     0.01    5.06    0.45    5.14   0.14   0.02
sdc               0.00     0.01    0.06    0.06    12.72     8.23   340.45     0.00   38.61   22.95   52.63   3.40   0.04
sdb               0.00     0.08    0.04    0.71     7.23    46.39   143.45     0.01   12.48   23.84   11.92   1.15   0.09
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     7.98     0.00  130.50  152.80   23.47   7.19   0.00
dm-1              0.00     0.00    0.00    0.02     0.07     0.81    80.79     0.00   85.88   55.04   91.33   2.51   0.01
dm-4              0.00     0.00    0.00    0.68     0.02    38.91   113.98     0.00    4.71   87.14    4.57   0.78   0.05

# cat /proc/scsi/scsi | grep -i ssd
  Vendor: ATA      Model: INTEL SSDSC2BB12 Rev: DL13

and here "sda" is ssd device.

Comment 8 Vikhyat Umrao 2015-12-29 15:28:38 UTC
$cat rbd.fio 
[global]
ioengine=rbd
clientname=admin
pool=ssd
rbdname=data-disk1
rw=randwrite
bs=4k
[rbd_iodepth32]
iodepth=32


$fio rbd.fio 
rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.2.8
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/16291KB/0KB /s] [0/4072/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=96955: Tue Dec 29 20:46:58 2015
  write: io=10240MB, bw=16727KB/s, iops=4181, runt=626866msec
    slat (usec): min=0, max=525, avg= 1.06, stdev= 1.74
    clat (msec): min=1, max=824, avg= 7.62, stdev=11.12
     lat (msec): min=1, max=824, avg= 7.62, stdev=11.12
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    5], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    8], 80.00th=[    9], 90.00th=[   12], 95.00th=[   16],
     | 99.00th=[   55], 99.50th=[   60], 99.90th=[  114], 99.95th=[  167],
     | 99.99th=[  392]
    bw (KB  /s): min= 1233, max=23864, per=100.00%, avg=16827.12, stdev=3030.88
    lat (msec) : 2=0.02%, 4=11.60%, 10=75.13%, 20=10.46%, 50=1.46%
    lat (msec) : 100=1.24%, 250=0.08%, 500=0.02%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.05%, sys=0.14%, ctx=185896, majf=0, minf=10
  IO depths    : 1=0.7%, 2=2.4%, 4=7.6%, 8=23.4%, 16=61.0%, 32=4.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=96.2%, 8=0.1%, 16=0.4%, 32=3.3%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=2621440/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=10240MB, aggrb=16727KB/s, minb=16727KB/s, maxb=16727KB/s, mint=626866msec, maxt=626866msec

Disk stats (read/write):
    dm-0: ios=0/12, merge=0/0, ticks=0/136, in_queue=136, util=0.02%, aggrios=0/5337, aggrmerge=0/2333, aggrticks=0/21884, aggrin_queue=21882, aggrutil=2.05%
  sdc: ios=0/5337, merge=0/2333, ticks=0/21884, in_queue=21882, util=2.05%

Comment 17 Josh Durgin 2016-01-05 01:44:15 UTC
A few comments on the setup first:

1. osd + kernel client (like krbd) on the same host has the potential for deadlock in low-memory conditions, just like loopback nfs. A separate machine that does not host osds should be used as a client.

2. For ssd setups, it does not help to separate the journal onto a separate device, since the speed of data disk is not the bottleneck. The journal should be on the same device as the data. For simplicity, it can just be a file in the osd data partition.

3. To get more parallelism and reduce locking overhead, multiple osds can be run on one high-speed ssd. In this case, it looks like the ssds could handle at least 2 osds/ssd. To make sure 2 replicas end up on different devices, the ssd should be added to the crush hierarchy as a container between host and osd.

When optimizing for IOPS, the limiting factor is generally latency. Each layer above the raw device (osd daemon, network, krbd client) adds latency and reduces the potential IOPS. For high-speed devices, the overhead of debug logging and authentication on the osds becomes significant. This can be turned of via ceph.conf options, as shown on slide 19 of http://www.slideshare.net/Inktank_Ceph/accelerating-cassandra-workloads-on-ceph-with-allflash-pcie-ssds

When using more than one rbd image with krbd, using the -o noshare option to 'rbd map' is important so each device has its own ceph client instance, to increase parallelism.

I'd suggest trying these optimizations, and also keeping an eye on CPU usage on the OSD hosts when re-running benchmarks. For rbd bench-write runs, you should use the --no-rbd-cache option to be more comparable to krbd.

Comment 19 Vikhyat Umrao 2016-01-27 05:43:51 UTC
######### Below given update I have suggested to customer #######

- including Josh suggestion in Comment#17 

- including Neil suggestion in case  "Upgrade this setup to RHEL 7.2 which has gperftools-2.4 (tcmalloc) with default thread-cache size as 128 MB as compare to  gperftool-v2.1 in RHEL 7.1"
   
- I have suggested them to remove RAID also from SSDs

After all of this suggestion , customer had setup with :

**RHEL 7.2 having Red Hat Ceph Storage 1.3.1 (0.94.3).**

Customer reply :

"After removing the RAID card from the machine and updated OS to RHEL7.2, the performance is really improved, looks like the same level as in your lab testing system. The IOPS is 7347 in one test. Attached is the test data."

############# fio Results for random write ###########

[root@node1 fio]# ./fio ../rbd.fio 
rbd_iodepth16: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
...
fio-2.2.10
Starting 16 processes
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
rbd engine: RBD version: 0.1.9
Jobs: 16 (f=16): [w(16)] [100.0% done] [0KB/29440KB/0KB /s] [0/7360/0 iops] [eta 00m:00s]
rbd_iodepth16: (groupid=0, jobs=16): err= 0: pid=12516: Fri Jan 15 19:33:00 2016
  write: io=8612.4MB, bw=29389KB/s, iops=7347, runt=300080msec
    slat (usec): min=0, max=11866, avg= 4.17, stdev=58.09
    clat (usec): min=972, max=393859, avg=34793.12, stdev=23156.15
     lat (usec): min=976, max=393860, avg=34797.28, stdev=23156.31
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    8], 10.00th=[   12], 20.00th=[   19],
     | 30.00th=[   23], 40.00th=[   26], 50.00th=[   31], 60.00th=[   35],
     | 70.00th=[   41], 80.00th=[   49], 90.00th=[   62], 95.00th=[   77],
     | 99.00th=[  119], 99.50th=[  141], 99.90th=[  194], 99.95th=[  219],
     | 99.99th=[  273]
    bw (KB  /s): min=  869, max= 3033, per=6.26%, avg=1840.04, stdev=231.73
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.25%, 4=1.59%, 10=6.06%, 20=16.61%, 50=57.22%
    lat (msec) : 100=16.38%, 250=1.87%, 500=0.02%
  cpu          : usr=0.17%, sys=0.04%, ctx=257824, majf=0, minf=18
  IO depths    : 1=1.5%, 2=5.5%, 4=19.5%, 8=63.7%, 16=9.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=92.5%, 8=1.7%, 16=5.8%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=2204768/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=8612.4MB, aggrb=29389KB/s, minb=29389KB/s, maxb=29389KB/s, mint=300080msec, maxt=300080msec

Disk stats (read/write):
    dm-0: ios=0/2356, merge=0/0, ticks=0/19608, in_queue=19608, util=2.47%, aggrios=0/1355, aggrmerge=0/1001, aggrticks=0/12121, aggrin_queue=12120, aggrutil=2.47%
  sde: ios=0/1355, merge=0/1001, ticks=0/12121, in_queue=12120, util=2.47%

##############

Closing for now wit NOTABUG.