Description of problem: In a OCS environment when MongoDB workload was run on gluster file volume and gluster block volume respectively, Gluster Block Volume outperforms gluster file volume by 35%. Both Gluster Block volume and Gluster File volume was backed by Replica 3 volume carved out of NVME drive. Here is the command used: oc exec mongodb-1-8jpzk -- scl enable rh-mongodb32 -- mongo -u redhat -p redhat 10.131.0.29:27017/testdb --eval "db.usertable.remove({})" oc exec ycsb-1-wtrgx -- ./bin/ycsb load mongodb -s -threads 20 -P workloads/workloadf -p mongodb.url=mongodb://redhat:redhat.0.29:27017/testdb -p recordcount=1000000 -p operationcount=3000000 oc exec ycsb-1-wtrgx -- ./bin/ycsb run mongodb -s -threads 20 -P workloads/workloadf -p mongodb.url=mongodb://redhat:redhat.0.29:27017/testdb -p recordcount=1000000 -p operationcount=3000000 1st Command deletes previously created records, 2nd Command Loads the record and 3rd command runs operations against the record. YCSB tool was used to generate workloadf. (Note: There can be various workload using YCSB, workloada, workloadb etc, details here: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads) The OPS achieved were as follows: ----------------------------------- Gluster Block Volume: 18006 OPS (Average over 10 RUNS) Gluster File Volume : 11732 OPS (Average over 10 RUNS) Version-Release number of selected component (if applicable): Server Side Gluster Version --------------------------- glusterfs-libs-3.12.2-24.el7rhgs.x86_64 glusterfs-cli-3.12.2-24.el7rhgs.x86_64 glusterfs-fuse-3.12.2-24.el7rhgs.x86_64 glusterfs-server-3.12.2-24.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-24.el7rhgs.x86_64 gluster-block-0.2.1-26.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-24.el7rhgs.x86_64 glusterfs-3.12.2-24.el7rhgs.x86_64 glusterfs-api-3.12.2-24.el7rhgs.x86_64 python2-gluster-3.12.2-24.el7rhgs.x86_64 Client Side Gluster Version ---------------------------- glusterfs-3.12.2-24.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-24.el7rhgs.x86_64 glusterfs-fuse-3.12.2-24.el7rhgs.x86_64 glusterfs-libs-3.12.2-24.el7rhgs.x86_64 oc version ---------- oc v3.10.45 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://gprfs018.sbu.lab.eng.bos.redhat.com:8443 openshift v3.10.45 kubernetes v1.10.0+b81c8f8 MongoDb Version --------------- Mongodb 3.2 How reproducible: Always Additional info: Gluster Volume Profile was captured both for gluster file and gluster block volume during MongoDb Run phase. Profiles are here: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/mongo/
On a first quick analysis, I see the following amount of data read/written: Block: 172.17.40.33 172.17.40.34 172.17.40.35 time: 00:03:08 00:03:08 00:03:08 read: 12.686 MiB 16.719 MiB 22.766 MiB written: 12.861 GiB 12.861 GiB 12.861 GiB File: 172.17.40.33 172.17.40.34 172.17.40.35 time: 00:04:45 00:04:45 00:04:45 read: 20.931 KiB 20.110 GiB 21.485 MiB written: 26.957 GiB 26.957 GiB 26.957 GiB So we see more than twice data written on gluster file. We also see 20 GiB read from one of the bricks (.34), while another one (.35) shows an amount similar to gluster block and the third one (.33) shows minimal reads. Looking at profile I also see INODELK requests for gluster file. These requests are typically made by self-heal (regular operations use FINODELK). However the read/write block size for self-heal should be 128 KiB, but we have very few requests of this size. It would be interesting to check self-heal and mount logs to see if a self-heal was running or not. If not, we definitely need to know what caused the read of 20 GiB of information. One possibility is that this amount of data is read because of a less efficient cache (gluster only keeps data in kernel cache for 1 second by default). If that's the case, we can try with a higher timeout, just to see if it helps and how much.
Can we test with latest OCS (3.11) and OCP (3.11) releases? Is that converged mode or independent? If the latter, note that RHGS 3.4 batch update 1 has been released as well (I kinda remember the INODELK issues - was it in https://bugzilla.redhat.com/show_bug.cgi?id=1630688 ?) Please set severity as well.
Hi Xavi, when you say, "gluster only keeps data in kernel cache for 1 second by default". which cache do you mean? As there are three caches in kernel we usually interact with -- attribute cache, entry cache, and page cache. Attribute cache stores file metadata, entry cache stores path resolution data, page cache stores file content. From the context I infer you are most likely to talk about page cache. However, the 1 second timeout -- and even the fact that the given type of cache is subject to a timeout -- holds only for the attribute and entry caches. Page cache is handled in a different manner. There are following page cache related options in the FUSE prootocol: - FUSE_AUTO_INVAL_DATA: overall fuse option, enabling an in-kernel invalidation logic - FOPEN_KEEP_CACHE: per open flag, requesting not to drop the page cache on a new open (which is the default behavior) - FOPEN_DIRECT_IO: per open flag, requesting avoidance of the page cache Glusterfs implements a common logic for management of the first two, contollable by the "--fopen-keep-cache" option (see https://review.gluster.org/c/glusterfs/+/5770), and there is "--direct-io-mode" for controlling the latter. There is also the FUSE_NOTIFY_INVAL_INODE reverse message that can be used to trigger inode data ivalidation on demand. It's wrapped in the inode_invalidate() libglusterfs function (and is in use by certain xlators). TL;DR for the longevity of the page cache you can try to pass "--fopen-keep-cache" glusterfs option (or "-ofopen-keep-cache" mount flag) -- but whether it has any effects depends on the IO pattern.
Amar, what's the next step here? Who owns this?
> Amar, what's the next step here? Who owns this? Yaniv, currently this is a known issue as the pattern with which the application accesses the storage, it would be suited for the setup where client cache is solid (in this case gluster-block). We need a aggressive caching implemented in glusterfs to solve this. (https://github.com/gluster/glusterfs/issues/436) Right now, we are trying to keep all the performance related bugs in one place, and then analyze them as a group. Plan is to have a group of people working on this. Some effort on this is already started, with few people. This is a significant work, as we know the bottlenecks.
Elvir, is that a dup of your BZ on perf. of block vs file?
(In reply to Yaniv Kaul from comment #8) > Elvir, is that a dup of your BZ on perf. of block vs file? This BZ was opened after test I ran on environment that is different from block vs file what we are doing now in scale lab. The plan is to do same test as part of "gluter-block/ceph rbd / glusterfs " which is in process now in scale lab and this BZ could be visible and relevant for that test too - that is reason I linked it in test document.
Can we make a new profile which says mongo-perf-profile to gluster repo. It helps us to keep different profiles in one place, and also allows us to provide better results to users.
Now that the consistency issues are fixed upstream (which was the primary reason for evolving db-profile), I don't see specific reasons to keep having different profiles. The default configuration (which has write-behind enabled) should do the job and there is nothing specific to Mongodb itself. Note that the only dependency is bz 1648781, which again benefits broader workloads too. My primary motivation is to discourage too many configurations. So, to summarize, we _can_ have mongodb specific profile, but I don't see the need of maintaining such a profile.
Anything further to be done on bz, or can we close this?
Closing.