Bug 1393419 - read-ahead not working if open-behind is turned on
Summary: read-ahead not working if open-behind is turned on
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: read-ahead
Version: mainline
Hardware: All
OS: All
medium
medium
Target Milestone: ---
Assignee: Milind Changire
QA Contact:
URL:
Whiteboard:
Depends On: 1084508
Blocks: 1426044
TreeView+ depends on / blocked
 
Reported: 2016-11-09 13:57 UTC by Mohammed Rafi KC
Modified: 2019-05-11 13:05 UTC (History)
11 users (show)

Fixed In Version:
Clone Of: 1084508
: 1426044 (view as bug list)
Environment:
Last Closed: 2019-05-11 13:05:51 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
perf numbers with patch (1.13 KB, text/plain)
2017-07-26 13:02 UTC, Milind Changire
no flags Details
perf numbers without patch (931 bytes, text/plain)
2017-07-26 13:13 UTC, Milind Changire
no flags Details

Description Mohammed Rafi KC 2016-11-09 13:57:29 UTC
+++ This bug was initially created as a clone of Bug #1084508 +++

Description of problem:

open-behind xlator is turned on by default when creating a new volume. This appears to prevent read-ahead from working.

Version-Release number of selected component (if applicable):

release-3.4 branch.

How reproducible:


Steps to Reproduce:

1. create a volume called vol4
[root@bd-vm ~]# mkdir /test/vol4
[root@bd-vm ~]# gluster volume create vol4 bd-vm:/test/vol4 force
volume create: vol4: success: please start the volume to access data
[root@bd-vm ~]# gluster volume start vol4
volume start: vol4: success
[root@bd-vm ~]# gluster volume info vol4
 
Volume Name: vol4
Type: Distribute
Volume ID: 85af878b-0119-4f99-b01f-caf4577cb4d4
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bd-vm:/test/vol4

2. mount the volume

[root@bd-vm ~]# mkdir /mnt4
[root@bd-vm ~]# mount -t glusterfs localhost:/vol4 /mnt4

3. write a 4GB file (= RAM size)
[root@bd-vm fio]# dd if=/dev/zero of=/mnt4/4g bs=1M count=4K
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 23.0355 s, 186 MB/s

4. first read, with read-ahead = 1, got throughput 99MB/s

[root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 1
volume set: success
[root@bd-vm ~]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 43.0906 s, 99.7 MB/s

5. second read, read-ahead=16, got throughput 107 MB/s, not much difference

[root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 16
volume set: success
[root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.1117 s, 107 MB/s

6. third read, read-ahead=16, open-behind=off, got throughput 269MB/s

[root@bd-vm ~]# gluster volume set vol4 performance.open-behind off
volume set: success
[root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 15.982 s, 269 MB/s


Actual results:

read-ahead has no impact on sequential read and re-read

Expected results:

read-ahead should improve sequential re-read

Additional info:

I built gluster from git source as of Mar 25, 2014, branch release-3.4.

--- Additional comment from Ben England on 2014-05-30 06:34:51 EDT ---

to assess priority, how many folks are using open-behind volume option?  Open-behind translator is an optimization for small-file workloads, correct?  Has anyone measured performance of open-behind on vs off?  Does it help?

--- Additional comment from Raghavendra G on 2016-01-20 02:49:43 EST ---

@Poornima/Anuradha,

Can you take a look at this bug?

regards,
Raghavendra

--- Additional comment from Raghavendra G on 2016-04-26 00:47:56 EDT ---

I think the issue is because of open-behind using anonymous-fd. See the following option in open-behind:

        { .key  = {"read-after-open"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "no",
	  .description = "read is sent only after actual open happens and real "
          "fd is obtained, instead of doing on anonymous fd (similar to write)",
        },

The read-ahead cache is per-fd and stored in the context of fd. If open-behind is using anonymous fds for doing reads, read is never sent on the fd which read-ahead has seen (during open from application). So, there is no read-ahead cache.

Can you retry the tests by setting option "read-after-open" in open-behind to "yes"?

[root@unused glusterfs]# gluster volume set dist-rep performance.read-after-open on
volume set: success
[root@unused glusterfs]# gluster volume info
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3
Status: Created
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: booradley:/home/export-2/dist-rep1
Brick2: booradley:/home/export-2/dist-rep2
Brick3: booradley:/home/export-2/dist-rep3
Brick4: booradley:/home/export-2/dist-rep4
Options Reconfigured:
performance.read-after-open: on
performance.readdir-ahead: on

--- Additional comment from Raghavendra G on 2016-04-26 09:27:15 EDT ---

(In reply to Raghavendra G from comment #3)
> I think the issue is because of open-behind using anonymous-fd. See the
> following option in open-behind:
> 
>         { .key  = {"read-after-open"},
>           .type = GF_OPTION_TYPE_BOOL,
>           .default_value = "no",
> 	  .description = "read is sent only after actual open happens and real "
>           "fd is obtained, instead of doing on anonymous fd (similar to
> write)",
>         },
> 
> The read-ahead cache is per-fd and stored in the context of fd. If
> open-behind is using anonymous fds for doing reads, read is never sent on
> the fd which read-ahead has seen (during open from application). So, there
> is no read-ahead cache.

This RCA is not valid. The reason is during read-request, fd is stored in local and in response cache is stored on the fd stored in local. So, even though open-behind sends read on anonymous fd, read-ahead stores the cache in the fd passed to application/kernel.

> 
> Can you retry the tests by setting option "read-after-open" in open-behind
> to "yes"?
> 
> [root@unused glusterfs]# gluster volume set dist-rep
> performance.read-after-open on
> volume set: success
> [root@unused glusterfs]# gluster volume info
>  
> Volume Name: dist-rep
> Type: Distributed-Replicate
> Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3
> Status: Created
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: booradley:/home/export-2/dist-rep1
> Brick2: booradley:/home/export-2/dist-rep2
> Brick3: booradley:/home/export-2/dist-rep3
> Brick4: booradley:/home/export-2/dist-rep4
> Options Reconfigured:
> performance.read-after-open: on
> performance.readdir-ahead: on

--- Additional comment from Raghavendra G on 2016-04-26 14:27:44 EDT ---

(In reply to Raghavendra G from comment #4)
> (In reply to Raghavendra G from comment #3)
> > I think the issue is because of open-behind using anonymous-fd. See the
> > following option in open-behind:
> > 
> >         { .key  = {"read-after-open"},
> >           .type = GF_OPTION_TYPE_BOOL,
> >           .default_value = "no",
> > 	  .description = "read is sent only after actual open happens and real "
> >           "fd is obtained, instead of doing on anonymous fd (similar to
> > write)",
> >         },
> > 
> > The read-ahead cache is per-fd and stored in the context of fd. If
> > open-behind is using anonymous fds for doing reads, read is never sent on
> > the fd which read-ahead has seen (during open from application). So, there
> > is no read-ahead cache.
> 
> This RCA is not valid. The reason is during read-request, fd is stored in
> local and in response cache is stored on the fd stored in local. So, even
> though open-behind sends read on anonymous fd, read-ahead stores the cache
> in the fd passed to application/kernel.

Well, the core of the RCA - read-ahead is disabled because of open-behind using anonymous fds - is still valid :). What was wrong was the mechanism through which read-ahead is turned off. In our current configuration read-ahead is loaded below open-behind. So, with "read-after-open" turned off, read-ahead never receives an open. Without an open, read-ahead doesn't create a context in fd, which is where all the cache is stored.

There are two solutions to this problem:
1. Load read-ahead as one of the ancestors of open-behind. This way read-ahead witnesses the open sent by application before open-behind stopping it.
2. Turn "read-after-open" option on, so that open behind does an open.

Comment 1 Worker Ant 2016-11-09 14:05:37 UTC
REVIEW: http://review.gluster.org/15811 (glusterd/volgen: Changing the order of read-ahead xlator) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

Comment 2 Prashanth Pai 2016-11-22 12:56:29 UTC
I don't know if this is relevant but just a heads up - I remember the option "root-squash" was dependent on or tied to the option "read-after-open". If you plan to change "read-after-open", please do take a look at the code that handles "server.root-squash" too.

Comment 3 Worker Ant 2017-07-24 09:31:40 UTC
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#2) for review on master by Milind Changire (mchangir)

Comment 4 Worker Ant 2017-07-25 04:47:56 UTC
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#3) for review on master by Milind Changire (mchangir)

Comment 5 Milind Changire 2017-07-25 11:26:21 UTC
here's output of a single run as per test case in comment #0
(without dropping kernel caches):
(all nodes are virtual machines on a single laptop with SSD)

16:17:46 [root@f25node1 ~] # gluster volume info dist-rep
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 54b4340d-cb1f-4742-a0ca-3d3ac499d84c
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: f25node1:/glustervols/dist-rep/brick
Brick2: f25node2:/glustervols/dist-rep/brick
Brick3: f25node3:/glustervols/dist-rep/brick
Brick4: f25node4:/glustervols/dist-rep/brick
Brick5: f25node5:/glustervols/dist-rep/brick
Brick6: f25node6:/glustervols/dist-rep/brick
Options Reconfigured:
performance.open-behind: off
performance.read-ahead-page-count: 16
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: off

15:11:17 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep

performance.read-ahead-page-count       4                                      

15:13:46 [root@f25client1 ~] # dd if=/dev/zero of=/mnt/dist-rep/1g bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 263.88 s, 4.1 MB/s

performance.read-ahead-page-count       1

16:07:16 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 10.1683 s, 106 MB/s

performance.read-ahead-page-count       16

16:14:31 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.162729 s, 6.6 GB/s

performance.open-behind                 off
performance.read-ahead-page-count       16

16:18:38 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.43123 s, 167 MB/s

Comment 6 Raghavendra G 2017-07-25 17:01:01 UTC
(In reply to Milind Changire from comment #5)
> here's output of a single run as per test case in comment #0
> (without dropping kernel caches):

Its better to drop caches. If we are doing a fresh mount for each test, I guess its ok.

> performance.open-behind: off

Was open-behind off in all runs of dd?

Comment 7 Milind Changire 2017-07-26 03:33:59 UTC
(In reply to Raghavendra G from comment #6)
> (In reply to Milind Changire from comment #5)
> > here's output of a single run as per test case in comment #0
> > (without dropping kernel caches):
> 
> Its better to drop caches. If we are doing a fresh mount for each test, I
> guess its ok.

I'll do another round of test by doing a fresh mount before each run of "dd"

> 
> > performance.open-behind: off
> 
> Was open-behind off in all runs of dd?

open-behind was turned "off" only for the last run of "dd"

Comment 8 Milind Changire 2017-07-26 04:26:15 UTC
Test re-run by doing fresh mount before every "dd" run:

09:05:47 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep
09:05:52 [root@f25client1 ~] # dd if=/dev/zero of=/mnt/dist-rep/1g bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 77.2741 s, 13.9 MB/s

09:09:04 [root@f25client1 ~] # umount /mnt/dist-rep 

09:06:12 [root@f25node1 ~] # gluster volume set dist-rep performance.read-ahead-page-count 1
volume set: success
09:06:39 [root@f25node1 ~] # gluster volume set dist-rep performance.open-behind on
volume set: success

09:11:12 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep
09:11:25 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.16453 s, 132 MB/s

09:12:07 [root@f25client1 ~] # umount /mnt/dist-rep 

09:06:54 [root@f25node1 ~] # gluster volume set dist-rep performance.read-ahead-page-count 16

09:13:40 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep
09:13:44 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 20.1027 s, 53.4 MB/s

09:13:21 [root@f25node1 ~] # gluster volume set dist-rep performance.open-behind off
volume set: success

09:16:32 [root@f25client1 ~] # umount /mnt/dist-rep 

09:40:07 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep
09:40:31 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 10.2474 s, 105 MB/s

Comment 9 Raghavendra G 2017-07-26 05:26:36 UTC
@Milind,

I am curious to know whether you see same behavior by change read-ahead page count without applying the fix. Can you try that?

regards,
Raghavendra

Comment 10 Milind Changire 2017-07-26 08:34:25 UTC
(In reply to Raghavendra G from comment #9)
> @Milind,
> 
> I am curious to know whether you see same behavior by change read-ahead page
> count without applying the fix. Can you try that?
> 
> regards,
> Raghavendra

Following is the test run *without* the patch:

13:50:23 [root@f25client1 ~] # dd if=/dev/zero of=/mnt/dist-rep/1g bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 101.623 s, 10.6 MB/s

13:53:11 [root@f25client1 ~] # umount /mnt/dist-rep 

13:54:30 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep

13:54:36 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.44487 s, 127 MB/s

13:50:40 [root@f25node1 ~] # gluster volume set dist-rep performance.read-ahead-page-count 16

13:56:10 [root@f25client1 ~] # umount /mnt/dist-rep 
13:56:35 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep
13:56:59 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.23783 s, 205 MB/s

13:56:52 [root@f25node1 ~] # gluster volume set dist-rep performance.open-behind off

13:57:34 [root@f25client1 ~] # umount /mnt/dist-rep 
13:57:57 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep
13:58:00 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.52333 s, 194 MB/s

Comment 11 Milind Changire 2017-07-26 13:02:53 UTC
Created attachment 1304787 [details]
perf numbers with patch

Comment 12 Milind Changire 2017-07-26 13:13:37 UTC
Created attachment 1304790 [details]
perf numbers without patch

Comment 13 Raghavendra G 2017-07-27 04:15:40 UTC
We can see that performance gets affected by the number of read-ahead page count in both scenarios - with/without fix. Milind will do the same tests by setting client log-level to TRACE and attach logs to the bz. Hence setting needinfo on him.

Once logs are attached, csaba has agreed to take the ownership of bz. Hence assigning the bz to csaba.

Comment 14 Raghavendra G 2017-07-27 04:18:08 UTC
(In reply to Raghavendra G from comment #13)
> We can see that performance gets affected by the number of read-ahead page
> count in both scenarios - with/without fix.

1. If read-ahead is indeed turned off, there should be no effect on performance by tweaking page-count.
2. When read-ahead is enabled, performance with page-count 1 seems to be better than performance with higher page-count.

Both of the above behavior need to be explained.

Comment 15 Milind Changire 2017-07-27 04:31:19 UTC
(In reply to Raghavendra G from comment #13)
> We can see that performance gets affected by the number of read-ahead page
> count in both scenarios - with/without fix. Milind will do the same tests by
> setting client log-level to TRACE and attach logs to the bz. Hence setting
> needinfo on him.
> 
> Once logs are attached, csaba has agreed to take the ownership of bz. Hence
> assigning the bz to csaba.

Perf numbers for the TRACEL level logging test run are available as attachments.

TRACE level logs are available at:
1. https://nofile.io/f/is0XjWnaLFK/bz-1393419-test-run-without-patch-client-TRACE.log.gz
2. https://nofile.io/f/Oh3lkzaxMSe/bz-1393419-test-run-with-patch-client-TRACE.log.gz

Comment 16 Worker Ant 2017-09-07 10:02:15 UTC
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#4) for review on master by mohammed rafi  kc (rkavunga)

Comment 17 Worker Ant 2017-09-07 10:13:34 UTC
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#5) for review on master by mohammed rafi  kc (rkavunga)

Comment 18 Raghavendra G 2018-06-11 04:38:29 UTC
(In reply to Milind Changire from comment #15)
> (In reply to Raghavendra G from comment #13)
> > We can see that performance gets affected by the number of read-ahead page
> > count in both scenarios - with/without fix. 

without fix, there is some percentage of opens where open-behind actually winds down and hence read-ahead will be working for those fds. I guess variance due to change of page-count can be attributed to this.

Also, note that as found on bz 1214489, reads are interspersed with getattr. So, the cache is thrown away. With large value of page-count more pages are wasted and hence performance degrades for larger value of page count.

I verified from logs attached that patch https://review.gluster.org/15811 makes sure that all opens are witnessed by read-ahead and hence a fix to current bug.

To summarize patch #15811 is a necessary fix and solves the problem at hand. Problem with read-ahead performance are tracked by other bzs like bz 1214489.

> > Milind will do the same tests by
> > setting client log-level to TRACE and attach logs to the bz. Hence setting
> > needinfo on him.
> > 
> > Once logs are attached, csaba has agreed to take the ownership of bz. Hence
> > assigning the bz to csaba.
> 
> Perf numbers for the TRACEL level logging test run are available as
> attachments.
> 
> TRACE level logs are available at:
> 1.
> https://nofile.io/f/is0XjWnaLFK/bz-1393419-test-run-without-patch-client-
> TRACE.log.gz
> 2.
> https://nofile.io/f/Oh3lkzaxMSe/bz-1393419-test-run-with-patch-client-TRACE.
> log.gz

Comment 19 Worker Ant 2018-06-11 05:54:30 UTC
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#6) for review on master by Milind Changire

Comment 20 Yaniv Kaul 2019-04-24 09:16:26 UTC
Status?

Comment 21 Milind Changire 2019-04-24 09:19:35 UTC
Raghavendra G should be able to answer this aptly.
Redirecting needinfo to him.

Comment 22 Raghavendra G 2019-05-11 13:05:51 UTC
This is fixed by https://review.gluster.org/r/Ifa52d8ff017f115e83247f3396b9d27f0295ce3f


Note You need to log in before you can comment on or make changes to this bug.