1084508 – read-ahead not working if open-behind is turned on

Bug 1084508 - read-ahead not working if open-behind is turned on

Summary: read-ahead not working if open-behind is turned on

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	read-ahead
Sub Component:
Version:	mainline
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1393419 1426044
TreeView+	depends on / blocked

Reported:	2014-04-04 14:15 UTC by hchen
Modified:	2019-05-11 11:30 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-6.x
Clone Of:
Clones:	1393419 (view as bug list)
Environment:
Last Closed:	2019-05-11 11:30:33 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description hchen 2014-04-04 14:15:53 UTC

Description of problem:

open-behind xlator is turned on by default when creating a new volume. This appears to prevent read-ahead from working.

Version-Release number of selected component (if applicable):

release-3.4 branch.

How reproducible:


Steps to Reproduce:

1. create a volume called vol4
[root@bd-vm ~]# mkdir /test/vol4
[root@bd-vm ~]# gluster volume create vol4 bd-vm:/test/vol4 force
volume create: vol4: success: please start the volume to access data
[root@bd-vm ~]# gluster volume start vol4
volume start: vol4: success
[root@bd-vm ~]# gluster volume info vol4
 
Volume Name: vol4
Type: Distribute
Volume ID: 85af878b-0119-4f99-b01f-caf4577cb4d4
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bd-vm:/test/vol4

2. mount the volume

[root@bd-vm ~]# mkdir /mnt4
[root@bd-vm ~]# mount -t glusterfs localhost:/vol4 /mnt4

3. write a 4GB file (= RAM size)
[root@bd-vm fio]# dd if=/dev/zero of=/mnt4/4g bs=1M count=4K
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 23.0355 s, 186 MB/s

4. first read, with read-ahead = 1, got throughput 99MB/s

[root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 1
volume set: success
[root@bd-vm ~]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 43.0906 s, 99.7 MB/s

5. second read, read-ahead=16, got throughput 107 MB/s, not much difference

[root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 16
volume set: success
[root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.1117 s, 107 MB/s

6. third read, read-ahead=16, open-behind=off, got throughput 269MB/s

[root@bd-vm ~]# gluster volume set vol4 performance.open-behind off
volume set: success
[root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 15.982 s, 269 MB/s


Actual results:

read-ahead has no impact on sequential read and re-read

Expected results:

read-ahead should improve sequential re-read

Additional info:

I built gluster from git source as of Mar 25, 2014, branch release-3.4.

Comment 1 Ben England 2014-05-30 10:34:51 UTC

to assess priority, how many folks are using open-behind volume option?  Open-behind translator is an optimization for small-file workloads, correct?  Has anyone measured performance of open-behind on vs off?  Does it help?

Comment 2 Raghavendra G 2016-01-20 07:49:43 UTC

@Poornima/Anuradha,

Can you take a look at this bug?

regards,
Raghavendra

Comment 3 Raghavendra G 2016-04-26 04:47:56 UTC

I think the issue is because of open-behind using anonymous-fd. See the following option in open-behind:

        { .key  = {"read-after-open"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "no",
	  .description = "read is sent only after actual open happens and real "
          "fd is obtained, instead of doing on anonymous fd (similar to write)",
        },

The read-ahead cache is per-fd and stored in the context of fd. If open-behind is using anonymous fds for doing reads, read is never sent on the fd which read-ahead has seen (during open from application). So, there is no read-ahead cache.

Can you retry the tests by setting option "read-after-open" in open-behind to "yes"?

[root@unused glusterfs]# gluster volume set dist-rep performance.read-after-open on
volume set: success
[root@unused glusterfs]# gluster volume info
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3
Status: Created
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: booradley:/home/export-2/dist-rep1
Brick2: booradley:/home/export-2/dist-rep2
Brick3: booradley:/home/export-2/dist-rep3
Brick4: booradley:/home/export-2/dist-rep4
Options Reconfigured:
performance.read-after-open: on
performance.readdir-ahead: on

Comment 4 Raghavendra G 2016-04-26 13:27:15 UTC

(In reply to Raghavendra G from comment #3)
> I think the issue is because of open-behind using anonymous-fd. See the
> following option in open-behind:
> 
>         { .key  = {"read-after-open"},
>           .type = GF_OPTION_TYPE_BOOL,
>           .default_value = "no",
> 	  .description = "read is sent only after actual open happens and real "
>           "fd is obtained, instead of doing on anonymous fd (similar to
> write)",
>         },
> 
> The read-ahead cache is per-fd and stored in the context of fd. If
> open-behind is using anonymous fds for doing reads, read is never sent on
> the fd which read-ahead has seen (during open from application). So, there
> is no read-ahead cache.

This RCA is not valid. The reason is during read-request, fd is stored in local and in response cache is stored on the fd stored in local. So, even though open-behind sends read on anonymous fd, read-ahead stores the cache in the fd passed to application/kernel.

> 
> Can you retry the tests by setting option "read-after-open" in open-behind
> to "yes"?
> 
> [root@unused glusterfs]# gluster volume set dist-rep
> performance.read-after-open on
> volume set: success
> [root@unused glusterfs]# gluster volume info
>  
> Volume Name: dist-rep
> Type: Distributed-Replicate
> Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3
> Status: Created
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: booradley:/home/export-2/dist-rep1
> Brick2: booradley:/home/export-2/dist-rep2
> Brick3: booradley:/home/export-2/dist-rep3
> Brick4: booradley:/home/export-2/dist-rep4
> Options Reconfigured:
> performance.read-after-open: on
> performance.readdir-ahead: on

Comment 5 Raghavendra G 2016-04-26 18:27:44 UTC

(In reply to Raghavendra G from comment #4)
> (In reply to Raghavendra G from comment #3)
> > I think the issue is because of open-behind using anonymous-fd. See the
> > following option in open-behind:
> > 
> >         { .key  = {"read-after-open"},
> >           .type = GF_OPTION_TYPE_BOOL,
> >           .default_value = "no",
> > 	  .description = "read is sent only after actual open happens and real "
> >           "fd is obtained, instead of doing on anonymous fd (similar to
> > write)",
> >         },
> > 
> > The read-ahead cache is per-fd and stored in the context of fd. If
> > open-behind is using anonymous fds for doing reads, read is never sent on
> > the fd which read-ahead has seen (during open from application). So, there
> > is no read-ahead cache.
> 
> This RCA is not valid. The reason is during read-request, fd is stored in
> local and in response cache is stored on the fd stored in local. So, even
> though open-behind sends read on anonymous fd, read-ahead stores the cache
> in the fd passed to application/kernel.

Well, the core of the RCA - read-ahead is disabled because of open-behind using anonymous fds - is still valid :). What was wrong was the mechanism through which read-ahead is turned off. In our current configuration read-ahead is loaded below open-behind. So, with "read-after-open" turned off, read-ahead never receives an open. Without an open, read-ahead doesn't create a context in fd, which is where all the cache is stored.

There are two solutions to this problem:
1. Load read-ahead as one of the ancestors of open-behind. This way read-ahead witnesses the open sent by application before open-behind stopping it.
2. Turn "read-after-open" option on, so that open behind does an open.

Comment 6 Raghavendra G 2018-07-15 13:40:38 UTC

https://review.gluster.org/#/c/20511/

Comment 7 Amar Tumballi 2018-09-18 09:32:54 UTC

The above patch is merged in mainline, and would be in release-5.0

Note You need to log in before you can comment on or make changes to this bug.