Description of problem: By default, fuse invalidates the page cache for an inode on every file open. This is generally inefficient, particularly on read-only or read-mostly workloads. Version-Release number of selected component (if applicable): 3.3 How reproducible: 100% Steps to Reproduce: 1. Create a largish file on a glusterfs volume but small enough to fit into local page cache (i.e., 1GB). 2. Repeatedly cat the file. On a single VM, this takes a few seconds for each complete file read. 3. Alternatively, observe the cached memory drop and repopulate on each read (via free or top). Expected results: Repeated reads should ideally read file data from the local page cache. This reduces the total read time to the order of milliseconds and eliminates the need for more read requests passed down into gluster and over the network.
The fuse kernel module provides the FOPEN_KEEP_CACHE flag to bypass the invalidation on open. I've prototyped integration of this flag into mount/fuse via the 'fopen-keep-cache' mount option (or glusterfs '--fopen-open-cache' command line option). This change includes extra validation of locally cached inode attributes against newly received attributes to detect remote changes. The end result is replacement of unconditional local cache invalidations with conditional validations when we know the remote side has been modified. I have run a 16-thread read-only (i.e., object files to local storage) kernel compile job against a single brick volume to test the effects of improved local caching. The gluster brick is an XFS formatted ramdisk. The results, in terms of time to complete, are as follows: - gluster NFS: 1:47 - Default glusterfs graph: 7:53 - No client-side cache xlators: 9:32 - No client-side cache xlators, fopen-keep-cache enabled: 6:01 - "" + fuse hacks to disable atime* invalidations: 5:19 * - FUSE appears to unconditionally invalidate cached attributes on read operations to pick up atime changes. This assumes the user cares to track atime in the first place. Disabling these invalidations has a further positive effect on this test, but this is something we'll have to try and address in fuse...
The proposed fix has been posted for review: http://review.gluster.com/3584
CHANGE: http://review.gluster.com/3584 (fuse/md-cache: add support for the 'fopen-keep-cache' mount option) merged in master by Anand Avati (avati)
(In reply to comment #3) > The proposed fix has been posted for review: > > http://review.gluster.com/3584 Brian, the patch is very comprehensive, I see just a minor issue with it: if kernel features FUSE < 7.12, then invalidation functionality is not available, and the invalidate callback will silently become a no-op. AFAICS, with --fopen-keep-cache invalidation is not just hinting the kernel about disposable memory, but correct operation relies on this. So it would be better to fail if --fopen-keep-cache is used with such a kernel.
Hmm, yes --fopen-keep-cache depends on the fuse invalidation functionality. I'll look into fixing that up. Thanks for the review Csaba.
Fix posted to address Csaba's comment: http://review.gluster.com/3690
CHANGE: http://review.gluster.com/3690 (mount/fuse: check for fuse inval notify support when fopen-keep-cache enabled) merged in master by Anand Avati (avati)
Verified the fix on the build: =============================== glusterfs 3.4.0.23rhs built on Aug 26 2013 09:03:20 ================================================================================ Test Case: ================================================================================ 1. Create 1 x 2 replicate volume . Start the volume 2. Create fuse mount. 3. Create a 512MB file from mount point : dd if=/dev/urandom of=./test_file bs=1M count=512 4. On the client node perform the following: a. Record the cache memory drop and repopulate on every read. Execute : free -m -s 1 b. Record the time taken in each read From mount point execute : for i in `seq 1 10`; do time cat ./test_file > /dev/null ; done Repeat the above testcase for the following scenarios ================================================================================ Scenario 1: No options while creating fuse mount. Scenario 2. Use "fopen-keep-cache" mount option while creating fuse mount. Expected Result:- ================== Scenario 1. Time taken to read the file after the first read is almost in the same range as the first read. Scenario 2. Time taken to read the file after the first read should be very much less when fopen-keep-cache mount option is set. Actual Result:- ================== ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Scenario 1 :- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@darrel [Aug-27-2013-14:34:26] >for i in `seq 1 10`; do time cat ./test_file > /dev/null ; sleep 1 ; done real 0m1.113s user 0m0.006s sys 0m0.338s real 0m1.503s user 0m0.013s sys 0m0.534s real 0m1.123s user 0m0.002s sys 0m0.414s real 0m1.445s user 0m0.004s sys 0m0.535s real 0m1.055s user 0m0.006s sys 0m0.338s real 0m1.194s user 0m0.004s sys 0m0.361s real 0m1.223s user 0m0.008s sys 0m0.443s real 0m1.052s user 0m0.007s sys 0m0.371s real 0m1.207s user 0m0.006s sys 0m0.400s real 0m1.064s user 0m0.007s sys 0m0.392s ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Scenario 2:- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@darrel [Aug-27-2013-15:01:14] >for i in `seq 1 10`; do time cat ./test_file > /dev/null ; sleep 1 ; done real 0m1.196s user 0m0.005s sys 0m0.347s real 0m0.179s user 0m0.002s sys 0m0.173s real 0m0.148s user 0m0.003s sys 0m0.141s real 0m0.143s user 0m0.000s sys 0m0.139s real 0m0.146s user 0m0.002s sys 0m0.140s real 0m0.147s user 0m0.000s sys 0m0.141s real 0m0.144s user 0m0.001s sys 0m0.139s real 0m0.147s user 0m0.001s sys 0m0.141s real 0m0.147s user 0m0.000s sys 0m0.143s real 0m0.146s user 0m0.001s sys 0m0.141s Bug is fixed. Moving it to Verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html