Description of problem: On a simple single brick distribute volume, I'm running tests to validate glusterfs-fuse client's use of page cache. The tests are indicating that a read following a write is reading from the brick, not from client cache. In contrast, a 2nd read gets data from the client cache. Version-Release number of selected component (if applicable): glusterfs-*5.2-1.el7.x86_64 kernel-3.10.0-957.el7.x86_64 (RHEL 7.6) How reproducible: Consistently Steps to Reproduce: 1. use fio to create a data set that would fit easily in the page cache. My client has 128 GB RAM; I'll create a 64 GB data set: fio --name=initialwrite --ioengine=sync --rw=write \ --direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \ --directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \ --filesize=16g --size=16g --numjobs=4 2. run an fio read test that reads the data set from step 1, without invalidating the page cache: fio --name=readtest --ioengine=sync --rw=read --invalidate=0 \ --direct=0 --bs=128k --directory=/mnt/glustervol/ \ --filename_format=f.\$jobnum.\$filenum --filesize=16g \ --size=16g --numjobs=4 Read throughput is much lower than it would be if reading from page cache: READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB (68.7GB), run=114171-114419msec Reads are going over the 10GbE network as shown in (edited) sar output: 05:01:04 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s 05:01:06 AM em1 755946.26 40546.26 1116287.75 3987.24 0.00 [There is some read amplification here: application is getting lower throughput than what client is reading over the n/w. More on that later] 3. Run the read test in step 2 again. This time read throughput is really high, indicating read from cache, rather than over the network: READ: bw=14.8GiB/s (15.9GB/s), 3783MiB/s-4270MiB/s (3967MB/s-4477MB/s), io=64.0GiB (68.7GB), run=3837-4331msec Expected results: The read test in step 2 should be reading from page cache, and should be giving throughput close to what we get in step 3. Additional Info: gluster volume info: Volume Name: perfvol Type: Distribute Volume ID: 7033539b-0331-44b1-96cf-46ddc6ee2255 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: 172.16.70.128:/mnt/rhs_brick1 Options Reconfigured: transport.address-family: inet nfs.disable: on
(In reply to Manoj Pillai from comment #0) [...] > 1. use fio to create a data set that would fit easily in the page cache. My > client has 128 GB RAM; I'll create a 64 GB data set: > > fio --name=initialwrite --ioengine=sync --rw=write \ > --direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \ > --directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \ > --filesize=16g --size=16g --numjobs=4 > Memory usage on the client while the write test is running: <excerpt> # sar -r 5 Linux 3.10.0-957.el7.x86_64 (c09-h08-r630.rdu.openstack.engineering.redhat.com) 01/10/2019 _x86_64_ (56 CPU) 05:35:36 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 05:35:41 AM 126671972 4937712 3.75 0 2974352 256704 0.18 1878020 1147776 36 05:35:46 AM 126671972 4937712 3.75 0 2974352 256704 0.18 1878020 1147776 36 05:35:51 AM 126666904 4942780 3.76 0 2974324 259900 0.19 1879948 1147772 16 05:35:56 AM 126665820 4943864 3.76 0 2974348 261300 0.19 1880304 1147776 24 05:36:01 AM 126663136 4946548 3.76 0 2974348 356356 0.25 1881500 1147772 20 05:36:06 AM 126663028 4946656 3.76 0 2974348 356356 0.25 1881540 1147772 20 05:36:11 AM 126664444 4945240 3.76 0 2974388 356356 0.25 1880648 1147788 32 05:36:16 AM 126174984 5434700 4.13 0 3449508 930284 0.66 1892912 1622536 32 05:36:21 AM 120539884 11069800 8.41 0 9076076 930284 0.66 1893784 7247852 32 05:36:26 AM 114979592 16630092 12.64 0 14620932 930284 0.66 1893796 12793472 32 05:36:31 AM 109392488 22217196 16.88 0 20192112 930284 0.66 1893796 18365764 32 05:36:36 AM 104113900 27495784 20.89 0 25457272 930284 0.66 1895152 23630336 32 05:36:41 AM 98713688 32895996 25.00 0 30842800 930284 0.66 1895156 29015400 32 05:36:46 AM 93355560 38254124 29.07 0 36190264 930688 0.66 1897548 34361664 32 05:36:51 AM 87640900 43968784 33.41 0 41885972 930688 0.66 1897556 40057860 32 05:36:56 AM 81903068 49706616 37.77 0 47626388 930688 0.66 1897004 45798848 0 05:37:01 AM 76209860 55399824 42.09 0 53303272 930688 0.66 1897004 51475716 0 05:37:06 AM 70540340 61069344 46.40 0 58956264 930688 0.66 1897004 57128836 0 05:37:11 AM 64872776 66736908 50.71 0 64609648 930688 0.66 1897000 62782624 0 05:37:16 AM 59376144 72233540 54.88 0 70096880 930688 0.66 1897368 68270084 0 05:37:21 AM 71333376 60276308 45.80 0 58169584 356740 0.25 1891388 56342848 0 05:37:26 AM 126653336 4956348 3.77 0 2974476 356740 0.25 1891392 1148348 0 05:37:31 AM 126654360 4955324 3.77 0 2974388 356740 0.25 1891380 1147784 0 05:37:36 AM 126654376 4955308 3.77 0 2974388 356740 0.25 1891380 1147784 0 05:37:41 AM 126654376 4955308 3.77 0 2974388 356740 0.25 1891380 1147784 0 </excerpt> So as the write test progresses, kbcached steadily increases. But looks like the cached data is dropped subsequently.
When I run the same sequence of tests on an XFS file system on the server, I get expected results: both step 2. and step 3. of comment #0 report high read throughput (15+GiB/s) indicating data is read from the page cache.
(In reply to Manoj Pillai from comment #0) [...] > > Read throughput is much lower than it would be if reading from page cache: > READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB > (68.7GB), run=114171-114419msec > > Reads are going over the 10GbE network as shown in (edited) sar output: > 05:01:04 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s > 05:01:06 AM em1 755946.26 40546.26 1116287.75 3987.24 0.00 > > [There is some read amplification here: application is getting lower > throughput than what client is reading over the n/w. More on that later] > This turned out to be primarily read-ahead related. Open a new bug for it: https://bugzilla.redhat.com/show_bug.cgi?id=1665029.
From preliminary tests I see two reasons for this: 1. inode-invalidations triggered by md-cache 2. Fuse auto invalidations With a hacky fix removing both of the above, I can see read after write being served from kernel page-cache. I'll update the bug with more details discussing validity/limitations with the above two approaches later.
(In reply to Raghavendra G from comment #4) > From preliminary tests I see two reasons for this: > 1. inode-invalidations triggered by md-cache > 2. Fuse auto invalidations Trying with kernel NFS, another distributed fs solution. I see that cache is retained at the end of the write test, and both read-after-write and read-after-read are served from the page cache. In principle, if kNFS can do it, FUSE should be able to do it. I think :D.
REVIEW: https://review.gluster.org/22109 (mount/fuse: expose fuse-auto-invalidation as a mount option) posted (#1) for review on master by Raghavendra G
(In reply to Manoj Pillai from comment #5) > (In reply to Raghavendra G from comment #4) > > From preliminary tests I see two reasons for this: > > 1. inode-invalidations triggered by md-cache > > 2. Fuse auto invalidations > > Trying with kernel NFS, another distributed fs solution. I see that cache is > retained at the end of the write test, and both read-after-write and > read-after-read are served from the page cache. > > In principle, if kNFS can do it, FUSE should be able to do it. I think :D. kNFS and FUSE have different invalidation policies. * kNFS provides close-to-open consistency. To quote from their FAQ [1] "Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged." For the workload used in this bz, file is not changed between close and open. Hence two values of stat fetched - at close and open - match and hence page-cache is retained. * FUSE auto-invalidation compares times of stats cached with the values got from the underlying filesystem implementation at all codepaths where stat is fetched. This means comparision happens in lookup, (f)stat, (f)setattr etc codepaths. Since (f)stat, lookup can happen asynchronously and concurrently wrt writes, they'll end up identifying delta between two values of stats resulting in cache purge. Please note that the consistency offered by FUSE is stronger than close-to-open consistency, which means it also provides close-to-open consistency along with consistency in codepaths like lookup, fstat etc. We have following options: * disable auto-invalidations and use glusterfs custom designed invalidation policy. The invalidation policy can be the same as NFS close-to-open consistency or something more stronger. * check whether the current form of auto-invalidation (though stricter) provides any added benefits to close-to-open consistency which are useful. If no, change FUSE auto-invalidation to close-to-open consistency. [1] http://nfs.sourceforge.net/#faq_a8
Miklos, It would be helpful if you can comment on comment #7. regards, Raghavendra
Note that a lease based invalidation policy would be a complete solution, but it will take some time to implement that and get it working in Glusterfs.
REVIEW: https://review.gluster.org/22109 (mount/fuse: expose auto-invalidation as a mount option) merged (#13) on master by Amar Tumballi
The underlying problem is that auto invalidate cannot differentiate local and remote modification based on mtime alone. What NFS apprently does is refresh attributes immediately after a write (not sure how often it does this, I guess not after each individual write). FUSE maybe should do this if auto invalidation is enabled, but if the filesystem can do its own invalidation, possibly based on better information than c/mtime, then that seem to be a better option.
REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to --auto-invalidation in mount script) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to --auto-invalidation in mount script) merged (#2) on master by Raghavendra G
REVIEW: https://review.gluster.org/22111 (performance/md-cache: introduce an option to control invalidation of inodes) merged (#18) on master by Raghavendra G
REVIEW: https://review.gluster.org/22193 (performance/md-cache: change the op-version of \"global-cache-invalidation\") posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/22193 (performance/md-cache: change the op-version of \"global-cache-invalidation\") merged (#2) on master by Amar Tumballi
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report. glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html [2] https://www.gluster.org/pipermail/gluster-users/