1644308 – Gluster-file Volume under-performing than Gluster-block Volume for MongoDB using YCSB workload in OCS environment

Bug 1644308 - Gluster-file Volume under-performing than Gluster-block Volume for MongoDB using YCSB workload in OCS environment

Summary: Gluster-file Volume under-performing than Gluster-block Volume for MongoDB us...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHGS-Bad-IO
TreeView+	depends on / blocked

Reported:	2018-10-30 12:46 UTC by Shekhar Berry
Modified:	2020-01-02 07:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-12 22:17:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shekhar Berry 2018-10-30 12:46:39 UTC

Description of problem:

In a OCS environment when MongoDB workload was run on gluster file volume and gluster block volume respectively, Gluster Block Volume outperforms gluster file volume by 35%.

Both Gluster Block volume and Gluster File volume was backed by Replica 3 volume carved out of NVME drive.

Here is the command used:

oc exec mongodb-1-8jpzk -- scl enable rh-mongodb32 -- mongo -u redhat -p redhat 10.131.0.29:27017/testdb --eval "db.usertable.remove({})"
oc exec ycsb-1-wtrgx -- ./bin/ycsb load mongodb -s -threads 20 -P workloads/workloadf -p mongodb.url=mongodb://redhat:redhat.0.29:27017/testdb -p recordcount=1000000 -p operationcount=3000000
 oc exec ycsb-1-wtrgx -- ./bin/ycsb run mongodb -s -threads 20 -P workloads/workloadf -p mongodb.url=mongodb://redhat:redhat.0.29:27017/testdb -p recordcount=1000000 -p operationcount=3000000

1st Command deletes previously created records, 2nd Command Loads the record and 3rd command runs operations against the record.

YCSB tool was used to generate workloadf. (Note: There can be various workload using YCSB, workloada, workloadb etc, details here: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads)

The OPS achieved were as follows:
-----------------------------------

Gluster Block Volume: 18006 OPS (Average over 10 RUNS)
Gluster File Volume : 11732 OPS (Average over 10 RUNS)


Version-Release number of selected component (if applicable):

Server Side Gluster Version
---------------------------
glusterfs-libs-3.12.2-24.el7rhgs.x86_64
glusterfs-cli-3.12.2-24.el7rhgs.x86_64
glusterfs-fuse-3.12.2-24.el7rhgs.x86_64
glusterfs-server-3.12.2-24.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-24.el7rhgs.x86_64
gluster-block-0.2.1-26.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-24.el7rhgs.x86_64
glusterfs-3.12.2-24.el7rhgs.x86_64
glusterfs-api-3.12.2-24.el7rhgs.x86_64
python2-gluster-3.12.2-24.el7rhgs.x86_64

Client Side Gluster Version
----------------------------
glusterfs-3.12.2-24.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-24.el7rhgs.x86_64
glusterfs-fuse-3.12.2-24.el7rhgs.x86_64
glusterfs-libs-3.12.2-24.el7rhgs.x86_64

oc version
----------
oc v3.10.45
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://gprfs018.sbu.lab.eng.bos.redhat.com:8443
openshift v3.10.45
kubernetes v1.10.0+b81c8f8

MongoDb Version
---------------
Mongodb 3.2

How reproducible:
Always

Additional info:

Gluster Volume Profile was captured both for gluster file and gluster block volume during MongoDb Run phase.

Profiles are here: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/mongo/

Comment 2 Xavi Hernandez 2018-10-31 10:42:28 UTC

On a first quick analysis, I see the following amount of data read/written:

Block:

         172.17.40.33    172.17.40.34    172.17.40.35
time:       00:03:08        00:03:08        00:03:08
read:      12.686 MiB      16.719 MiB      22.766 MiB
written:   12.861 GiB      12.861 GiB      12.861 GiB

File:

         172.17.40.33    172.17.40.34    172.17.40.35
time:       00:04:45        00:04:45        00:04:45
read:      20.931 KiB      20.110 GiB      21.485 MiB
written:   26.957 GiB      26.957 GiB      26.957 GiB

So we see more than twice data written on gluster file. We also see 20 GiB read from one of the bricks (.34), while another one (.35) shows an amount similar to gluster block and the third one (.33) shows minimal reads.

Looking at profile I also see INODELK requests for gluster file. These requests are typically made by self-heal (regular operations use FINODELK). However the read/write block size for self-heal should be 128 KiB, but we have very few requests of this size.

It would be interesting to check self-heal and mount logs to see if a self-heal was running or not. If not, we definitely need to know what caused the read of 20 GiB of information.

One possibility is that this amount of data is read because of a less efficient cache (gluster only keeps data in kernel cache for 1 second by default). If that's the case, we can try with a higher timeout, just to see if it helps and how much.

Comment 3 Yaniv Kaul 2018-10-31 19:06:58 UTC

Can we test with latest OCS (3.11) and OCP (3.11) releases?
Is that converged mode or independent? If the latter, note that RHGS 3.4 batch update 1 has been released as well (I kinda remember the INODELK issues - was it in https://bugzilla.redhat.com/show_bug.cgi?id=1630688 ?)

Please set severity as well.

Comment 4 Csaba Henk 2018-11-01 16:20:03 UTC

Hi Xavi, when you say, "gluster only keeps data in kernel cache for 1 second by default". which cache do you mean? As there are three caches in kernel we usually interact with -- attribute cache, entry cache, and page cache. Attribute cache stores file metadata, entry cache stores path resolution data, page cache stores file content. From the context I infer you are most likely to talk about page cache. However, the 1 second timeout -- and even the fact that the given type of cache is subject to a timeout -- holds only for the attribute and entry caches. Page cache is handled in a different manner.

There are following page cache related options in the FUSE prootocol:
- FUSE_AUTO_INVAL_DATA: overall fuse option, enabling an in-kernel invalidation logic
- FOPEN_KEEP_CACHE: per open flag, requesting not to drop the page cache on a new open (which is the default behavior)
- FOPEN_DIRECT_IO: per open flag, requesting avoidance of the page cache

Glusterfs implements a common logic for management of the first two, contollable by the "--fopen-keep-cache" option (see https://review.gluster.org/c/glusterfs/+/5770), and there is "--direct-io-mode" for controlling the latter.

There is also the FUSE_NOTIFY_INVAL_INODE reverse message that can be used to trigger inode data ivalidation on demand. It's wrapped in the inode_invalidate() libglusterfs function (and is in use by certain xlators).

TL;DR for the longevity of the page cache you can try to pass "--fopen-keep-cache" glusterfs option (or "-ofopen-keep-cache" mount flag) -- but whether it has any effects depends on the IO pattern.

Comment 5 Yaniv Kaul 2018-11-18 14:41:59 UTC

Amar, what's the next step here? Who owns this?

Comment 7 Amar Tumballi 2018-11-19 05:20:22 UTC

> Amar, what's the next step here? Who owns this?

Yaniv, currently this is a known issue as the pattern with which the application accesses the storage, it would be suited for the setup where client cache is solid (in this case gluster-block).

We need a aggressive caching implemented in glusterfs to solve this. (https://github.com/gluster/glusterfs/issues/436)

Right now, we are trying to keep all the performance related bugs in one place, and then analyze them as a group. Plan is to have a group of people working on this. Some effort on this is already started, with few people. This is a significant work, as we know the bottlenecks.

Comment 8 Yaniv Kaul 2018-11-29 09:27:01 UTC

Elvir, is that a dup of your BZ on perf. of block vs file?

Comment 9 Elvir Kuric 2018-11-29 09:49:19 UTC

(In reply to Yaniv Kaul from comment #8)
> Elvir, is that a dup of your BZ on perf. of block vs file?

This BZ was opened after test I ran on environment that is different from block vs file what we are doing now in scale lab. 
The plan is to do same test as part of "gluter-block/ceph rbd / glusterfs " which is in process now in scale lab and this BZ could be visible and relevant for that test too - that is reason I linked it in test document.

Comment 12 Amar Tumballi 2019-03-07 06:20:53 UTC

Can we make a new profile which says mongo-perf-profile to gluster repo. It helps us to keep different profiles in one place, and also allows us to provide better results to users.

Comment 13 Raghavendra G 2019-03-07 06:59:59 UTC

Now that the consistency issues are fixed upstream (which was the primary reason for evolving db-profile), I don't see specific reasons to keep having different profiles. The default configuration (which has write-behind enabled) should do the job and there is nothing specific to Mongodb itself. Note that the only dependency is bz 1648781, which again benefits broader workloads too. My primary motivation is to discourage too many configurations.

So, to summarize, we _can_ have mongodb specific profile, but I don't see the need of maintaining such a profile.

Comment 14 Sahina Bose 2019-11-14 05:24:06 UTC

Anything further to be done on bz, or can we close this?

Comment 15 Yaniv Kaul 2019-12-12 22:17:00 UTC

Closing.

Note You need to log in before you can comment on or make changes to this bug.