Bug 1664934

Summary:	glusterfs-fuse client not benefiting from page cache on read after write
Product:	[Community] GlusterFS	Reporter:	Manoj Pillai <mpillai>
Component:	fuse	Assignee:	Csaba Henk <csaba>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5	CC:	amukherj, bugs, csaba, guillaume.pavese, mszeredi, pasik, rgowdapp, shberry
Target Milestone:	---	Keywords:	Performance
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-6.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1670710 1674364 1676468 (view as bug list)		Environment:
Last Closed:	2019-03-25 16:33:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1670710, 1674364, 1676468

Description Manoj Pillai 2019-01-10 05:17:17 UTC

Description of problem:
On a simple single brick distribute volume, I'm running tests to validate glusterfs-fuse client's use of page cache. The tests are indicating that a read following a write is reading from the brick, not from client cache. In contrast, a 2nd read gets data from the client cache.

Version-Release number of selected component (if applicable):

glusterfs-*5.2-1.el7.x86_64
kernel-3.10.0-957.el7.x86_64 (RHEL 7.6)

How reproducible:

Consistently

Steps to Reproduce:
1. use fio to create a data set that would fit easily in the page cache. My client has 128 GB RAM; I'll create a 64 GB data set:

fio --name=initialwrite --ioengine=sync --rw=write \
--direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
--directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
--filesize=16g --size=16g --numjobs=4

2. run an fio read test that reads the data set from step 1, without invalidating the page cache:

fio --name=readtest --ioengine=sync --rw=read --invalidate=0 \
--direct=0 --bs=128k --directory=/mnt/glustervol/ \
--filename_format=f.\$jobnum.\$filenum --filesize=16g \
--size=16g --numjobs=4

Read throughput is much lower than it would be if reading from page cache:
READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB (68.7GB), run=114171-114419msec

Reads are going over the 10GbE network as shown in (edited) sar output:
05:01:04 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s 
05:01:06 AM       em1 755946.26  40546.26 1116287.75   3987.24      0.00

[There is some read amplification here: application is getting lower throughput than what client is reading over the n/w. More on that later]      

3. Run the read test in step 2 again. This time read throughput is really high, indicating read from cache, rather than over the network:
READ: bw=14.8GiB/s (15.9GB/s), 3783MiB/s-4270MiB/s (3967MB/s-4477MB/s), io=64.0GiB (68.7GB), run=3837-4331msec


Expected results:

The read test in step 2 should be reading from page cache, and should be giving throughput close to what we get in step 3.

Additional Info:

gluster volume info:

Volume Name: perfvol
Type: Distribute
Volume ID: 7033539b-0331-44b1-96cf-46ddc6ee2255
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 172.16.70.128:/mnt/rhs_brick1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

Comment 1 Manoj Pillai 2019-01-10 05:43:53 UTC

(In reply to Manoj Pillai from comment #0)
[...]
> 1. use fio to create a data set that would fit easily in the page cache. My
> client has 128 GB RAM; I'll create a 64 GB data set:
> 
> fio --name=initialwrite --ioengine=sync --rw=write \
> --direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
> --directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
> --filesize=16g --size=16g --numjobs=4
> 

Memory usage on the client while the write test is running:

<excerpt>
# sar -r 5
Linux 3.10.0-957.el7.x86_64 (c09-h08-r630.rdu.openstack.engineering.redhat.com)         01/10/2019      _x86_64_ (56 CPU)

05:35:36 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
05:35:41 AM 126671972   4937712      3.75         0   2974352    256704      0.18   1878020   1147776        36
05:35:46 AM 126671972   4937712      3.75         0   2974352    256704      0.18   1878020   1147776        36
05:35:51 AM 126666904   4942780      3.76         0   2974324    259900      0.19   1879948   1147772        16
05:35:56 AM 126665820   4943864      3.76         0   2974348    261300      0.19   1880304   1147776        24
05:36:01 AM 126663136   4946548      3.76         0   2974348    356356      0.25   1881500   1147772        20
05:36:06 AM 126663028   4946656      3.76         0   2974348    356356      0.25   1881540   1147772        20
05:36:11 AM 126664444   4945240      3.76         0   2974388    356356      0.25   1880648   1147788        32
05:36:16 AM 126174984   5434700      4.13         0   3449508    930284      0.66   1892912   1622536        32
05:36:21 AM 120539884  11069800      8.41         0   9076076    930284      0.66   1893784   7247852        32
05:36:26 AM 114979592  16630092     12.64         0  14620932    930284      0.66   1893796  12793472        32
05:36:31 AM 109392488  22217196     16.88         0  20192112    930284      0.66   1893796  18365764        32
05:36:36 AM 104113900  27495784     20.89         0  25457272    930284      0.66   1895152  23630336        32
05:36:41 AM  98713688  32895996     25.00         0  30842800    930284      0.66   1895156  29015400        32
05:36:46 AM  93355560  38254124     29.07         0  36190264    930688      0.66   1897548  34361664        32
05:36:51 AM  87640900  43968784     33.41         0  41885972    930688      0.66   1897556  40057860        32
05:36:56 AM  81903068  49706616     37.77         0  47626388    930688      0.66   1897004  45798848         0
05:37:01 AM  76209860  55399824     42.09         0  53303272    930688      0.66   1897004  51475716         0
05:37:06 AM  70540340  61069344     46.40         0  58956264    930688      0.66   1897004  57128836         0
05:37:11 AM  64872776  66736908     50.71         0  64609648    930688      0.66   1897000  62782624         0
05:37:16 AM  59376144  72233540     54.88         0  70096880    930688      0.66   1897368  68270084         0
05:37:21 AM  71333376  60276308     45.80         0  58169584    356740      0.25   1891388  56342848         0
05:37:26 AM 126653336   4956348      3.77         0   2974476    356740      0.25   1891392   1148348         0
05:37:31 AM 126654360   4955324      3.77         0   2974388    356740      0.25   1891380   1147784         0
05:37:36 AM 126654376   4955308      3.77         0   2974388    356740      0.25   1891380   1147784         0
05:37:41 AM 126654376   4955308      3.77         0   2974388    356740      0.25   1891380   1147784         0
</excerpt>

So as the write test progresses, kbcached steadily increases. But looks like the cached data is dropped subsequently.

Comment 2 Manoj Pillai 2019-01-10 05:52:14 UTC

When I run the same sequence of tests on an XFS file system on the server, I get expected results: both step 2. and step 3. of comment #0 report high read throughput (15+GiB/s) indicating data is read from the page cache.

Comment 3 Manoj Pillai 2019-01-10 11:01:23 UTC

(In reply to Manoj Pillai from comment #0)
[...]
> 
> Read throughput is much lower than it would be if reading from page cache:
> READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB
> (68.7GB), run=114171-114419msec
> 
> Reads are going over the 10GbE network as shown in (edited) sar output:
> 05:01:04 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s 
> 05:01:06 AM       em1 755946.26  40546.26 1116287.75   3987.24      0.00
> 
> [There is some read amplification here: application is getting lower
> throughput than what client is reading over the n/w. More on that later]    
> 

This turned out to be primarily read-ahead related. Open a new bug for it: https://bugzilla.redhat.com/show_bug.cgi?id=1665029.

Comment 4 Raghavendra G 2019-01-23 13:04:54 UTC

From preliminary tests I see two reasons for this:
1. inode-invalidations triggered by md-cache
2. Fuse auto invalidations

With a hacky fix removing both of the above, I can see read after write being served from kernel page-cache. I'll update the bug with more details discussing validity/limitations with the above two approaches later.

Comment 5 Manoj Pillai 2019-01-24 04:43:40 UTC

(In reply to Raghavendra G from comment #4)
> From preliminary tests I see two reasons for this:
> 1. inode-invalidations triggered by md-cache
> 2. Fuse auto invalidations

Trying with kernel NFS, another distributed fs solution. I see that cache is retained at the end of the write test, and both read-after-write and read-after-read are served from the page cache.

In principle, if kNFS can do it, FUSE should be able to do it. I think :D.

Comment 6 Worker Ant 2019-01-29 03:15:45 UTC

REVIEW: https://review.gluster.org/22109 (mount/fuse: expose fuse-auto-invalidation as a mount option) posted (#1) for review on master by Raghavendra G

Comment 7 Raghavendra G 2019-01-30 05:41:39 UTC

(In reply to Manoj Pillai from comment #5)
> (In reply to Raghavendra G from comment #4)
> > From preliminary tests I see two reasons for this:
> > 1. inode-invalidations triggered by md-cache
> > 2. Fuse auto invalidations
> 
> Trying with kernel NFS, another distributed fs solution. I see that cache is
> retained at the end of the write test, and both read-after-write and
> read-after-read are served from the page cache.
> 
> In principle, if kNFS can do it, FUSE should be able to do it. I think :D.

kNFS and FUSE have different invalidation policies.

* kNFS provides close-to-open consistency. To quote from their FAQ [1]

"Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged."

For the workload used in this bz, file is not changed between close and open. Hence two values of stat fetched - at close and open - match and hence page-cache is retained.

* FUSE auto-invalidation compares times of stats cached with the values got from the underlying filesystem implementation at all codepaths where stat is fetched. This means comparision happens in lookup, (f)stat, (f)setattr etc codepaths. Since (f)stat, lookup can happen asynchronously and concurrently wrt writes, they'll end up identifying delta between two values of stats resulting in cache purge. Please note that the consistency offered by FUSE is stronger than close-to-open consistency, which means it also provides close-to-open consistency along with consistency in codepaths like lookup, fstat etc.

We have following options:

* disable auto-invalidations and use glusterfs custom designed invalidation policy. The invalidation policy can be the same as NFS close-to-open consistency or something more stronger.
* check whether the current form of auto-invalidation (though stricter) provides any added benefits to close-to-open consistency which are useful. If no, change FUSE auto-invalidation to close-to-open consistency.

[1] http://nfs.sourceforge.net/#faq_a8

Comment 8 Raghavendra G 2019-01-30 05:45:23 UTC

Miklos,

It would be helpful if you can comment on comment #7.

regards,
Raghavendra

Comment 9 Raghavendra G 2019-01-30 05:59:06 UTC

Note that a lease based invalidation policy would be a complete solution, but it will take some time to implement that and get it working in Glusterfs.

Comment 10 Worker Ant 2019-02-02 03:08:22 UTC

REVIEW: https://review.gluster.org/22109 (mount/fuse: expose auto-invalidation as a mount option) merged (#13) on master by Amar Tumballi

Comment 11 Miklos Szeredi 2019-02-04 09:53:18 UTC

The underlying problem is that auto invalidate cannot differentiate local and remote modification based on mtime alone.

What NFS apprently does is refresh attributes immediately after a write (not sure how often it does this, I guess not after each individual write).

FUSE maybe should do this if auto invalidation is enabled, but if the filesystem can do its own invalidation, possibly based on better information than c/mtime, then that seem to be a better option.

Comment 12 Worker Ant 2019-02-08 12:14:58 UTC

REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to --auto-invalidation in mount script) posted (#1) for review on master by Raghavendra G

Comment 13 Worker Ant 2019-02-09 18:41:54 UTC

REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to --auto-invalidation in mount script) merged (#2) on master by Raghavendra G

Comment 14 Worker Ant 2019-02-11 11:16:18 UTC

REVIEW: https://review.gluster.org/22111 (performance/md-cache: introduce an option to control invalidation of inodes) merged (#18) on master by Raghavendra G

Comment 15 Worker Ant 2019-02-12 03:46:47 UTC

REVIEW: https://review.gluster.org/22193 (performance/md-cache: change the op-version of \"global-cache-invalidation\") posted (#1) for review on master by Raghavendra G

Comment 16 Worker Ant 2019-02-12 12:40:24 UTC

REVIEW: https://review.gluster.org/22193 (performance/md-cache: change the op-version of \"global-cache-invalidation\") merged (#2) on master by Amar Tumballi

Comment 17 Shyamsundar 2019-03-25 16:33:00 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/