1670710 – glusterfs-fuse client not benefiting from page cache on read after write

Bug 1670710 - glusterfs-fuse client not benefiting from page cache on read after write

Summary: glusterfs-fuse client not benefiting from page cache on read after write

Keywords:
Status:	CLOSED DUPLICATE of bug 1676468
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	fuse
Sub Component:
Version:	rhgs-3.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Csaba Henk
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:	1664934 1674364
Blocks:	1629589
TreeView+	depends on / blocked

Reported:	2019-01-30 08:54 UTC by Raghavendra G
Modified:	2019-12-31 07:28 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1664934
Environment:
Last Closed:	2019-04-22 12:14:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Raghavendra G 2019-01-30 08:54:57 UTC

+++ This bug was initially created as a clone of Bug #1664934 +++

Description of problem:
On a simple single brick distribute volume, I'm running tests to validate glusterfs-fuse client's use of page cache. The tests are indicating that a read following a write is reading from the brick, not from client cache. In contrast, a 2nd read gets data from the client cache.

Version-Release number of selected component (if applicable):

glusterfs-*5.2-1.el7.x86_64
kernel-3.10.0-957.el7.x86_64 (RHEL 7.6)

How reproducible:

Consistently

Steps to Reproduce:
1. use fio to create a data set that would fit easily in the page cache. My client has 128 GB RAM; I'll create a 64 GB data set:

fio --name=initialwrite --ioengine=sync --rw=write \
--direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
--directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
--filesize=16g --size=16g --numjobs=4

2. run an fio read test that reads the data set from step 1, without invalidating the page cache:

fio --name=readtest --ioengine=sync --rw=read --invalidate=0 \
--direct=0 --bs=128k --directory=/mnt/glustervol/ \
--filename_format=f.\$jobnum.\$filenum --filesize=16g \
--size=16g --numjobs=4

Read throughput is much lower than it would be if reading from page cache:
READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB (68.7GB), run=114171-114419msec

Reads are going over the 10GbE network as shown in (edited) sar output:
05:01:04 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s 
05:01:06 AM       em1 755946.26  40546.26 1116287.75   3987.24      0.00

[There is some read amplification here: application is getting lower throughput than what client is reading over the n/w. More on that later]      

3. Run the read test in step 2 again. This time read throughput is really high, indicating read from cache, rather than over the network:
READ: bw=14.8GiB/s (15.9GB/s), 3783MiB/s-4270MiB/s (3967MB/s-4477MB/s), io=64.0GiB (68.7GB), run=3837-4331msec


Expected results:

The read test in step 2 should be reading from page cache, and should be giving throughput close to what we get in step 3.

Additional Info:

gluster volume info:

Volume Name: perfvol
Type: Distribute
Volume ID: 7033539b-0331-44b1-96cf-46ddc6ee2255
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 172.16.70.128:/mnt/rhs_brick1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

--- Additional comment from Manoj Pillai on 2019-01-10 05:43:53 UTC ---

(In reply to Manoj Pillai from comment #0)
[...]
> 1. use fio to create a data set that would fit easily in the page cache. My
> client has 128 GB RAM; I'll create a 64 GB data set:
> 
> fio --name=initialwrite --ioengine=sync --rw=write \
> --direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
> --directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
> --filesize=16g --size=16g --numjobs=4
> 

Memory usage on the client while the write test is running:

<excerpt>
# sar -r 5
Linux 3.10.0-957.el7.x86_64 (c09-h08-r630.rdu.openstack.engineering.redhat.com)         01/10/2019      _x86_64_ (56 CPU)

05:35:36 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
05:35:41 AM 126671972   4937712      3.75         0   2974352    256704      0.18   1878020   1147776        36
05:35:46 AM 126671972   4937712      3.75         0   2974352    256704      0.18   1878020   1147776        36
05:35:51 AM 126666904   4942780      3.76         0   2974324    259900      0.19   1879948   1147772        16
05:35:56 AM 126665820   4943864      3.76         0   2974348    261300      0.19   1880304   1147776        24
05:36:01 AM 126663136   4946548      3.76         0   2974348    356356      0.25   1881500   1147772        20
05:36:06 AM 126663028   4946656      3.76         0   2974348    356356      0.25   1881540   1147772        20
05:36:11 AM 126664444   4945240      3.76         0   2974388    356356      0.25   1880648   1147788        32
05:36:16 AM 126174984   5434700      4.13         0   3449508    930284      0.66   1892912   1622536        32
05:36:21 AM 120539884  11069800      8.41         0   9076076    930284      0.66   1893784   7247852        32
05:36:26 AM 114979592  16630092     12.64         0  14620932    930284      0.66   1893796  12793472        32
05:36:31 AM 109392488  22217196     16.88         0  20192112    930284      0.66   1893796  18365764        32
05:36:36 AM 104113900  27495784     20.89         0  25457272    930284      0.66   1895152  23630336        32
05:36:41 AM  98713688  32895996     25.00         0  30842800    930284      0.66   1895156  29015400        32
05:36:46 AM  93355560  38254124     29.07         0  36190264    930688      0.66   1897548  34361664        32
05:36:51 AM  87640900  43968784     33.41         0  41885972    930688      0.66   1897556  40057860        32
05:36:56 AM  81903068  49706616     37.77         0  47626388    930688      0.66   1897004  45798848         0
05:37:01 AM  76209860  55399824     42.09         0  53303272    930688      0.66   1897004  51475716         0
05:37:06 AM  70540340  61069344     46.40         0  58956264    930688      0.66   1897004  57128836         0
05:37:11 AM  64872776  66736908     50.71         0  64609648    930688      0.66   1897000  62782624         0
05:37:16 AM  59376144  72233540     54.88         0  70096880    930688      0.66   1897368  68270084         0
05:37:21 AM  71333376  60276308     45.80         0  58169584    356740      0.25   1891388  56342848         0
05:37:26 AM 126653336   4956348      3.77         0   2974476    356740      0.25   1891392   1148348         0
05:37:31 AM 126654360   4955324      3.77         0   2974388    356740      0.25   1891380   1147784         0
05:37:36 AM 126654376   4955308      3.77         0   2974388    356740      0.25   1891380   1147784         0
05:37:41 AM 126654376   4955308      3.77         0   2974388    356740      0.25   1891380   1147784         0
</excerpt>

So as the write test progresses, kbcached steadily increases. But looks like the cached data is dropped subsequently.

--- Additional comment from Manoj Pillai on 2019-01-10 05:52:14 UTC ---

When I run the same sequence of tests on an XFS file system on the server, I get expected results: both step 2. and step 3. of comment #0 report high read throughput (15+GiB/s) indicating data is read from the page cache.

--- Additional comment from Manoj Pillai on 2019-01-10 11:01:23 UTC ---

(In reply to Manoj Pillai from comment #0)
[...]
> 
> Read throughput is much lower than it would be if reading from page cache:
> READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB
> (68.7GB), run=114171-114419msec
> 
> Reads are going over the 10GbE network as shown in (edited) sar output:
> 05:01:04 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s 
> 05:01:06 AM       em1 755946.26  40546.26 1116287.75   3987.24      0.00
> 
> [There is some read amplification here: application is getting lower
> throughput than what client is reading over the n/w. More on that later]    
> 

This turned out to be primarily read-ahead related. Open a new bug for it: https://bugzilla.redhat.com/show_bug.cgi?id=1665029.

--- Additional comment from Raghavendra G on 2019-01-23 13:04:54 UTC ---

From preliminary tests I see two reasons for this:
1. inode-invalidations triggered by md-cache
2. Fuse auto invalidations

With a hacky fix removing both of the above, I can see read after write being served from kernel page-cache. I'll update the bug with more details discussing validity/limitations with the above two approaches later.

--- Additional comment from Manoj Pillai on 2019-01-24 04:43:40 UTC ---

(In reply to Raghavendra G from comment #4)
> From preliminary tests I see two reasons for this:
> 1. inode-invalidations triggered by md-cache
> 2. Fuse auto invalidations

Trying with kernel NFS, another distributed fs solution. I see that cache is retained at the end of the write test, and both read-after-write and read-after-read are served from the page cache.

In principle, if kNFS can do it, FUSE should be able to do it. I think :D.

--- Additional comment from Worker Ant on 2019-01-29 03:15:45 UTC ---

REVIEW: https://review.gluster.org/22109 (mount/fuse: expose fuse-auto-invalidation as a mount option) posted (#1) for review on master by Raghavendra G

--- Additional comment from Raghavendra G on 2019-01-30 05:41:39 UTC ---

(In reply to Manoj Pillai from comment #5)
> (In reply to Raghavendra G from comment #4)
> > From preliminary tests I see two reasons for this:
> > 1. inode-invalidations triggered by md-cache
> > 2. Fuse auto invalidations
> 
> Trying with kernel NFS, another distributed fs solution. I see that cache is
> retained at the end of the write test, and both read-after-write and
> read-after-read are served from the page cache.
> 
> In principle, if kNFS can do it, FUSE should be able to do it. I think :D.

kNFS and FUSE have different invalidation policies.

* kNFS provides close-to-open consistency. To quote from their FAQ [1]

"Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged."

For the workload used in this bz, file is not changed between close and open. Hence two values of stat fetched - at close and open - match and hence page-cache is retained.

* FUSE auto-invalidation compares times of stats cached with the values got from the underlying filesystem implementation at all codepaths where stat is fetched. This means comparision happens in lookup, (f)stat, (f)setattr etc codepaths. Since (f)stat, lookup can happen asynchronously and concurrently wrt writes, they'll end up identifying delta between two values of stats resulting in cache purge. Please note that the consistency offered by FUSE is stronger than close-to-open consistency, which means it also provides close-to-open consistency along with consistency in codepaths like lookup, fstat etc.

We have following options:

* disable auto-invalidations and use glusterfs custom designed invalidation policy. The invalidation policy can be the same as NFS close-to-open consistency or something more stronger.
* check whether the current form of auto-invalidation (though stricter) provides any added benefits to close-to-open consistency which are useful. If no, change FUSE auto-invalidation to close-to-open consistency.

[1] http://nfs.sourceforge.net/#faq_a8

--- Additional comment from Raghavendra G on 2019-01-30 05:45:23 UTC ---

Miklos,

It would be helpful if you can comment on comment #7.

regards,
Raghavendra

--- Additional comment from Raghavendra G on 2019-01-30 05:59:06 UTC ---

Note that a lease based invalidation policy would be a complete solution, but it will take some time to implement that and get it working in Glusterfs.

Comment 3 Yaniv Kaul 2019-04-22 12:14:39 UTC


*** This bug has been marked as a duplicate of bug 1676468 ***

Note You need to log in before you can comment on or make changes to this bug.