Description of problem: Overall libgfapi is performing well but on reads, here's a case where libgfapi should outperform FUSE but loses badly. This means we're not maximizing the benefit of libgfapi on writes either. This defeats whole purpose of libgfapi, which is to reduce Gluster filesystem overhead. Version-Release number of selected component (if applicable): glusterfs-3.4.0.12 kernel-2.6.32-358 How reproducible: always Steps to Reproduce: 1. download and compile glfs_io_test.c (see comments at top) at this URL: http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/glfs_io_test.c 2.create a Gluster volume like one shown below, with a 10-GbE link between client and server. Replication is not needed for this test. 3.create a 16-GB file in the volume, then read it with the glfs_io_test.c program using the parameters shown below. # GFAPI_HOSTNAME=perf86-ib GFAPI_VOLNAME=nossd GFAPI_FSZ=16384 GFAPI_RECSZ=1 GFAPI_LOAD=seq-rd ./glfs_io_test Actual results: GLUSTER: vol=nossd xport=tcp host=perf86-ib port=24007 fuse?No WORKLOAD: type = seq-rd , file name = x.tmp , file size = 16384 MB, record size = 1 KB total transfers = 16777216 elapsed time = 144.59 sec throughput = 113.31 MB/sec IOPS = 116033.97 (sequential read) Expected results: should be able to run at line speed because readahead buffer is already in user process's address space, no context switching required. Additional info: "top" with "H" option (per thread display shows that one glfs_io_test process thread is at 99% utilization, so we have a CPU bottleneck. The attached screenshot of perf top shows where the hotspot is in libgfapi, at this URL: http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/seq-rd-1kb.jpeg Here are graphs of libgfapi performance, note that sequential writes do not have the desired performance either on small transfer size. http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/ssd-tests.ods This gluster volume profile shows that readahead translator is doing its job and all reads across the wire are max RPC size of 128 KB. The file is cached in memory on the brick server. Interval 6 Stats: Block Size: 131072b+ No. of Reads: 31392 No. of Writes: 0 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 140.98 us 37.00 us 60440.00 us 31392 READ Duration: 35 seconds Data Read: 4114612224 bytes Data Written: 0 bytes here's the volume parameters: Volume Name: nossd Type: Distribute Volume ID: e8b8997f-d5b6-4c05-ac7b-2283402e0640 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: perf86-ib:/nossd/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.write-behind-window-size: 1048576 performance.read-ahead-page-count: 16 performance.read-ahead: on performance.stat-prefetch: on performance.open-behind: off performance.write-behind: on performance.io-cache: off performance.quick-read: on cluster.eager-lock: on
may be affected by upcoming fix to bz 1009134, needs retest.
folks, this bug still exists in RHS 2.1 U2. I see the same behavior on reads as before with small record size. I think I know something more about what it's doing. See files at http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/gfapi/rhs21u2/ r.log contains throughput for a single-threaded gfapi sequential read of an 8-GB file as a function of record size. It hasn't changed since initial post. s.log is a system call trace with 1-KB record size. You'll note that the sequence is repeated roughly 128 times between RPCs (writev/readv) 31499 geteuid() = 0 31499 getegid() = 0 31499 getgroups(200, [0]) = 1 I think this is what happens every time the app reads 1 KB. Why does it need to poll security info this often? I can see doing that once per RPC but once data is already in the user's address space, the battle is lost! rhs21u2-gfapi-perf-top-rsz1k.jpg , you'll see perf top. There is some sort of interaction with Gluster logging, I suspect you need to avoid calling the gluster logging routine (which forces construction of the arguments?) unless you have DEBUG log level established. I'm running rpms from build at baseurl=http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131027.n.0/2.1u2/RHS/x86_64/os on client: [root@gprfc093 ~]# rpm -qa | grep glusterfs glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64 and on server: [root@gprfc093 ~]# ssh gprfs048 rpm -qa | grep glusterfs glusterfs-geo-replication-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64 glusterfs-rdma-3.4.0.35.1u2rhs-1.el6rhs.x86_64 Since libgfapi glfs.h no longer is in devel RPM, I have to pull it from the source RPM to compile my test program, which hasn't changed.
This benchmark is now open-source at https://github.com/bengland2/parallel-libgfapi .
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.