1220845 – Gluster clients seem to be reading blocks from server multiple times

Bug 1220845 - Gluster clients seem to be reading blocks from server multiple times

Summary: Gluster clients seem to be reading blocks from server multiple times

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1595088
TreeView+	depends on / blocked

Reported:	2015-05-12 15:28 UTC by Barry Marson
Modified:	2018-11-06 07:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1595088 (view as bug list)
Environment:
Last Closed:	2018-11-06 07:50:52 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Example of gluster reading data from file more than once . (6.50 KB, text/plain) 2015-05-12 15:28 UTC, Barry Marson	no flags	Details
View All

Description Barry Marson 2015-05-12 15:28:49 UTC

Created attachment 1024653 [details]
Example of gluster reading data from file more than once .

Description of problem:

While trying to get SAS Grid Calibration workload to function successfully in a gluster environment, I noticed we were doing an enormous amount of I/O (reads) compared to other shared file system environments.  Tools we have to capture I/O rates and store aggregate I/O information are showing we are reading 2-3 times more than we should.

This was verified by running a simple test that drops the page cache on a client and reads a file from the gluster volume.  Monitoring network data rates and page cache utilization we were able to show more than 2x the network read traffic.  Additionally the way the page cache was being filled implies we are reading the same blocks on the gluster server more than once.

To get SAS Calibration work to function without errors, we have had to turn off virtually all of the gluster translators.  Only the read ahead translator is on.

Using the attached test case to show this problem in its simplest fashion, I learned that the problem stems from having that read ahead translator on and the stat-prefetch translator off.  If both are on, or both off, everything works fine.

The interesting part of the attachement is the cache and ETHiA (accumulated) values.  They should be virtually the same but ETHiA is much higher.  The dstat tool confirms this 

Version-Release number of selected component (if applicable):

RHEL6.6 -504 kernel on clients and servers
Gluster version glusterfs-3.6.0.53-1.el6rhs.x86_64

How reproducible:
every time

Steps to Reproduce:
1. See attachment
2.
3.

Actual results:

network reads were 13GB

The interesting part of the attachement is the cache and ETHiA (accumulated) values.  They should be virtually the same.  The dstat tool confirms this behavior as well.

Expected results:

network reads should have been around 5.3GB

Additional info:

Gluster config ...
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: f7f43d39-0607-4e27-8917-83f3ecb4709d
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gprfs025-b-10ge:/brick/vol0
Brick2: gprfs026-b-10ge:/brick/vol0
Brick3: gprfs027-b-10ge:/brick/vol0
Brick4: gprfs028-b-10ge:/brick/vol0
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.readdir-ahead: off
server.allow-insecure: on
performance.io-cache: off
performance.stat-prefetch: off
performance.quick-read: off
performance.read-ahead: on
performance.write-behind: off
cluster.read-hash-mode: 1
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

clients mount with gid-timeout=2,entry-timeout=0 as well

Comment 4 Amar Tumballi 2018-02-07 04:20:03 UTC

Thank you for your report. This bug is filed against a component for which no further new development is being undertaken.

If you find the bug relevant in the higher versions too, please feel free to reopen the bug.

Comment 5 Raghavendra G 2018-06-05 15:17:20 UTC

(In reply to Barry Marson from comment #0)
> Created attachment 1024653 [details]
> Example of gluster reading data from file more than once .
> 
> Description of problem:
> 
> While trying to get SAS Grid Calibration workload to function successfully
> in a gluster environment, I noticed we were doing an enormous amount of I/O
> (reads) compared to other shared file system environments.  Tools we have to
> capture I/O rates and store aggregate I/O information are showing we are
> reading 2-3 times more than we should.
> 
> This was verified by running a simple test that drops the page cache on a
> client and reads a file from the gluster volume.  Monitoring network data
> rates and page cache utilization we were able to show more than 2x the
> network read traffic.  Additionally the way the page cache was being filled
> implies we are reading the same blocks on the gluster server more than once.
> 
> To get SAS Calibration work to function without errors, we have had to turn
> off virtually all of the gluster translators.  Only the read ahead
> translator is on.
> 
> Using the attached test case to show this problem in its simplest fashion, I
> learned that the problem stems from having that read ahead translator on and
> the stat-prefetch translator off.  If both are on, or both off, everything
> works fine.

I think I've an hypothesis which explains this behaviour. read-ahead flushes its cache on (f)stat. The only reason I can think of for this behaviour is atime updates. With stat-prefetch turned on, stats are absorbed by stat-prefetch and read-ahead won't see them. So, no flushing of cache.

> 
> The interesting part of the attachement is the cache and ETHiA (accumulated)
> values.  They should be virtually the same but ETHiA is much higher.  The
> dstat tool confirms this 
> 
> Version-Release number of selected component (if applicable):
> 
> RHEL6.6 -504 kernel on clients and servers
> Gluster version glusterfs-3.6.0.53-1.el6rhs.x86_64
> 
> How reproducible:
> every time
> 
> Steps to Reproduce:
> 1. See attachment
> 2.
> 3.
> 
> Actual results:
> 
> network reads were 13GB
> 
> The interesting part of the attachement is the cache and ETHiA (accumulated)
> values.  They should be virtually the same.  The dstat tool confirms this
> behavior as well.
> 
> Expected results:
> 
> network reads should have been around 5.3GB
> 
> Additional info:
> 
> Gluster config ...
> Volume Name: vol0
> Type: Distributed-Replicate
> Volume ID: f7f43d39-0607-4e27-8917-83f3ecb4709d
> Status: Started
> Snap Volume: no
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: gprfs025-b-10ge:/brick/vol0
> Brick2: gprfs026-b-10ge:/brick/vol0
> Brick3: gprfs027-b-10ge:/brick/vol0
> Brick4: gprfs028-b-10ge:/brick/vol0
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> performance.readdir-ahead: off
> server.allow-insecure: on
> performance.io-cache: off
> performance.stat-prefetch: off
> performance.quick-read: off
> performance.read-ahead: on
> performance.write-behind: off
> cluster.read-hash-mode: 1
> auto-delete: disable
> snap-max-soft-limit: 90
> snap-max-hard-limit: 256
> 
> clients mount with gid-timeout=2,entry-timeout=0 as well

Comment 6 Raghavendra G 2018-06-08 04:08:48 UTC

(In reply to Raghavendra G from comment #5)
> 
> I think I've an hypothesis which explains this behaviour. read-ahead flushes
> its cache on (f)stat. The only reason I can think of for this behaviour is
> atime updates. With stat-prefetch turned on, stats are absorbed by
> stat-prefetch and read-ahead won't see them. So, no flushing of cache.

To (dis)prove the hypothesis, we either need to:
1. capture fops on read-ahead while the test (cat) is being done. This can be done by loading debug/trace xlator on top of read-ahead. Its also better to see what fops are going out of read-ahead. So, loading another instance of debug/trace as a child of read-ahead will help.
2. capture fusedump.

I think 1 is sufficient to correctly debug the issue.

Comment 7 Raghavendra G 2018-06-08 09:21:12 UTC

Looks to be a duplicate of bz 1214489

Comment 9 Raghavendra G 2018-06-08 21:13:08 UTC

(In reply to Raghavendra G from comment #6)
> (In reply to Raghavendra G from comment #5)
> > 
> > I think I've an hypothesis which explains this behaviour. read-ahead flushes
> > its cache on (f)stat. The only reason I can think of for this behaviour is
> > atime updates. With stat-prefetch turned on, stats are absorbed by
> > stat-prefetch and read-ahead won't see them. So, no flushing of cache.
> 
> To (dis)prove the hypothesis, we either need to:
> 1. capture fops on read-ahead while the test (cat) is being done. This can
> be done by loading debug/trace xlator on top of read-ahead. Its also better
> to see what fops are going out of read-ahead. So, loading another instance
> of debug/trace as a child of read-ahead will help.
> 2. capture fusedump.
> 
> I think 1 is sufficient to correctly debug the issue.

From client logs attached to bz 1393419, I could see that reads from kernel are interspersed with attr calls. These fstat calls flush the read-ahead cache. So, data is read more than once - once for read-ahead and later when application actually issues read. This explains the extra data read over network.

From the same logs, it also looks read-ahead logic is bit aggressive making this problem more prominent. Had there been no fstat calls from kernel, the prefetched data would be eventually consumed as cache hit and it would not have been a problem.

Even with stat-prefetch/md-cache turned on we can hit this bug as default timeout for md-cache is 1s and there is a very good chance that this cache might've timedout when it actually sees an fstat. So, if we are using md-cache we need to turn on "group metadata-cache" profile which makes sure larger timeouts are set on md-cache and upcall used for handling cache-coherency issues.

Comment 11 Amar Tumballi 2018-11-06 07:50:52 UTC

Upstream bug is opened for this (1595088). And we would take this work to Upstream, if any.

Note You need to log in before you can comment on or make changes to this bug.