Bug 994209 - libgfapi has poor sequential performance at small transfer sizes
libgfapi has poor sequential performance at small transfer sizes
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: samba (Show other bugs)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: rhs-smb@redhat.com
Depends On:
  Show dependency treegraph
Reported: 2013-08-06 14:08 EDT by Ben England
Modified: 2015-12-03 12:18 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2015-12-03 12:18:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Ben England 2013-08-06 14:08:41 EDT
Description of problem:

Overall libgfapi is performing well but on reads, here's a case where libgfapi should outperform FUSE but loses badly.  This means we're not maximizing the benefit of libgfapi on writes either.  This defeats whole purpose of libgfapi, which is to reduce Gluster filesystem overhead.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. download and compile glfs_io_test.c (see comments at top) at this URL:


2.create a Gluster volume like one shown below, with a 10-GbE link between client and server.  Replication is not needed for this test.

3.create a 16-GB file in the volume, then read it with the glfs_io_test.c program using the parameters shown below.

# GFAPI_HOSTNAME=perf86-ib GFAPI_VOLNAME=nossd GFAPI_FSZ=16384 GFAPI_RECSZ=1 GFAPI_LOAD=seq-rd ./glfs_io_test

Actual results:

GLUSTER: vol=nossd xport=tcp host=perf86-ib port=24007 fuse?No
WORKLOAD: type = seq-rd , file name = x.tmp , file size = 16384 MB, record size = 1 KB
total transfers = 16777216
elapsed time    = 144.59    sec
throughput      = 113.31    MB/sec
IOPS            = 116033.97 (sequential read)

Expected results:

should be able to run at line speed because readahead buffer is already in user process's address space, no context switching required.

Additional info:

"top" with "H" option (per thread display shows that one glfs_io_test process thread is at 99% utilization, so we have a CPU bottleneck.
The attached screenshot of perf top shows where the hotspot is in libgfapi, at this URL:


Here are graphs of libgfapi performance, note that sequential writes do not have the desired performance either on small transfer size.


This gluster volume profile shows that readahead translator is doing its job and all reads across the wire are max RPC size of 128 KB.  The file is cached in memory on the brick server.

Interval 6 Stats:
   Block Size:             131072b+
 No. of Reads:                31392
No. of Writes:                    0
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     140.98 us      37.00 us   60440.00 us          31392        READ
    Duration: 35 seconds
   Data Read: 4114612224 bytes
Data Written: 0 bytes

here's the volume parameters:

Volume Name: nossd
Type: Distribute
Volume ID: e8b8997f-d5b6-4c05-ac7b-2283402e0640
Status: Started
Number of Bricks: 1
Transport-type: tcp
Brick1: perf86-ib:/nossd/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.write-behind-window-size: 1048576
performance.read-ahead-page-count: 16
performance.read-ahead: on
performance.stat-prefetch: on
performance.open-behind: off
performance.write-behind: on
performance.io-cache: off
performance.quick-read: on
cluster.eager-lock: on
Comment 3 Ben England 2013-10-17 09:42:56 EDT
may be affected by upcoming fix to bz 1009134, needs retest.
Comment 4 Ben England 2013-10-29 12:06:55 EDT
folks, this bug still exists in RHS 2.1 U2.  I see the same behavior on reads as before with small record size.  I think I know something more about what it's doing.  See files at


r.log contains throughput for a single-threaded gfapi sequential read of an 8-GB file as a function of record size.  It hasn't changed since initial post.

s.log is a system call trace with 1-KB record size.  You'll note that the sequence is repeated roughly 128 times between RPCs (writev/readv)

31499 geteuid()                         = 0
31499 getegid()                         = 0
31499 getgroups(200, [0])               = 1

I think this is what happens every time the app reads 1 KB.  Why does it need to poll security info this often?  I can see doing that once per RPC but once data is already in the user's address space, the battle is lost!

rhs21u2-gfapi-perf-top-rsz1k.jpg , you'll see perf top.  There is some sort of interaction with Gluster logging, I suspect you need to avoid calling the gluster logging routine (which forces construction of the arguments?) unless you have DEBUG log level established.  

I'm running rpms from build at


on client:

[root@gprfc093 ~]# rpm -qa | grep glusterfs

and on server:

[root@gprfc093 ~]# ssh gprfs048 rpm -qa | grep glusterfs

Since libgfapi glfs.h no longer is in devel RPM, I have to pull it from the source RPM to compile my test program, which hasn't changed.
Comment 5 Ben England 2014-05-30 07:27:46 EDT
This benchmark is now open-source at https://github.com/bengland2/parallel-libgfapi .
Comment 6 Vivek Agarwal 2015-12-03 12:18:59 EST
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.