Bug 994209 - libgfapi has poor sequential performance at small transfer sizes
Summary: libgfapi has poor sequential performance at small transfer sizes
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: samba
Version: 2.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: rhs-smb@redhat.com
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard: perf
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-06 18:08 UTC by Ben England
Modified: 2015-12-03 17:18 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-03 17:18:59 UTC
Embargoed:


Attachments (Terms of Use)

Description Ben England 2013-08-06 18:08:41 UTC
Description of problem:

Overall libgfapi is performing well but on reads, here's a case where libgfapi should outperform FUSE but loses badly.  This means we're not maximizing the benefit of libgfapi on writes either.  This defeats whole purpose of libgfapi, which is to reduce Gluster filesystem overhead.

Version-Release number of selected component (if applicable):

glusterfs-3.4.0.12
kernel-2.6.32-358

How reproducible:

always

Steps to Reproduce:
1. download and compile glfs_io_test.c (see comments at top) at this URL:

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/glfs_io_test.c

2.create a Gluster volume like one shown below, with a 10-GbE link between client and server.  Replication is not needed for this test.

3.create a 16-GB file in the volume, then read it with the glfs_io_test.c program using the parameters shown below.

# GFAPI_HOSTNAME=perf86-ib GFAPI_VOLNAME=nossd GFAPI_FSZ=16384 GFAPI_RECSZ=1 GFAPI_LOAD=seq-rd ./glfs_io_test

Actual results:

GLUSTER: vol=nossd xport=tcp host=perf86-ib port=24007 fuse?No
WORKLOAD: type = seq-rd , file name = x.tmp , file size = 16384 MB, record size = 1 KB
total transfers = 16777216
elapsed time    = 144.59    sec
throughput      = 113.31    MB/sec
IOPS            = 116033.97 (sequential read)

Expected results:

should be able to run at line speed because readahead buffer is already in user process's address space, no context switching required.

Additional info:

"top" with "H" option (per thread display shows that one glfs_io_test process thread is at 99% utilization, so we have a CPU bottleneck.
The attached screenshot of perf top shows where the hotspot is in libgfapi, at this URL:

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/seq-rd-1kb.jpeg

Here are graphs of libgfapi performance, note that sequential writes do not have the desired performance either on small transfer size.

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/ssd-tests.ods


This gluster volume profile shows that readahead translator is doing its job and all reads across the wire are max RPC size of 128 KB.  The file is cached in memory on the brick server.

Interval 6 Stats:
   Block Size:             131072b+
 No. of Reads:                31392
No. of Writes:                    0
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     140.98 us      37.00 us   60440.00 us          31392        READ
 
    Duration: 35 seconds
   Data Read: 4114612224 bytes
Data Written: 0 bytes


here's the volume parameters:

Volume Name: nossd
Type: Distribute
Volume ID: e8b8997f-d5b6-4c05-ac7b-2283402e0640
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: perf86-ib:/nossd/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.write-behind-window-size: 1048576
performance.read-ahead-page-count: 16
performance.read-ahead: on
performance.stat-prefetch: on
performance.open-behind: off
performance.write-behind: on
performance.io-cache: off
performance.quick-read: on
cluster.eager-lock: on

Comment 3 Ben England 2013-10-17 13:42:56 UTC
may be affected by upcoming fix to bz 1009134, needs retest.

Comment 4 Ben England 2013-10-29 16:06:55 UTC
folks, this bug still exists in RHS 2.1 U2.  I see the same behavior on reads as before with small record size.  I think I know something more about what it's doing.  See files at

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/rhs21u2/

r.log contains throughput for a single-threaded gfapi sequential read of an 8-GB file as a function of record size.  It hasn't changed since initial post.

s.log is a system call trace with 1-KB record size.  You'll note that the sequence is repeated roughly 128 times between RPCs (writev/readv)

31499 geteuid()                         = 0
31499 getegid()                         = 0
31499 getgroups(200, [0])               = 1

I think this is what happens every time the app reads 1 KB.  Why does it need to poll security info this often?  I can see doing that once per RPC but once data is already in the user's address space, the battle is lost!

rhs21u2-gfapi-perf-top-rsz1k.jpg , you'll see perf top.  There is some sort of interaction with Gluster logging, I suspect you need to avoid calling the gluster logging routine (which forces construction of the arguments?) unless you have DEBUG log level established.  

I'm running rpms from build at

baseurl=http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131027.n.0/2.1u2/RHS/x86_64/os

on client:

[root@gprfc093 ~]# rpm -qa | grep glusterfs
glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64

and on server:

[root@gprfc093 ~]# ssh gprfs048 rpm -qa | grep glusterfs
glusterfs-geo-replication-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.35.1u2rhs-1.el6rhs.x86_64

Since libgfapi glfs.h no longer is in devel RPM, I have to pull it from the source RPM to compile my test program, which hasn't changed.

Comment 5 Ben England 2014-05-30 11:27:46 UTC
This benchmark is now open-source at https://github.com/bengland2/parallel-libgfapi .

Comment 6 Vivek Agarwal 2015-12-03 17:18:59 UTC
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.


Note You need to log in before you can comment on or make changes to this bug.