+++ This bug was initially created as a clone of Bug #1084508 +++ Description of problem: open-behind xlator is turned on by default when creating a new volume. This appears to prevent read-ahead from working. Version-Release number of selected component (if applicable): release-3.4 branch. How reproducible: Steps to Reproduce: 1. create a volume called vol4 [root@bd-vm ~]# mkdir /test/vol4 [root@bd-vm ~]# gluster volume create vol4 bd-vm:/test/vol4 force volume create: vol4: success: please start the volume to access data [root@bd-vm ~]# gluster volume start vol4 volume start: vol4: success [root@bd-vm ~]# gluster volume info vol4 Volume Name: vol4 Type: Distribute Volume ID: 85af878b-0119-4f99-b01f-caf4577cb4d4 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: bd-vm:/test/vol4 2. mount the volume [root@bd-vm ~]# mkdir /mnt4 [root@bd-vm ~]# mount -t glusterfs localhost:/vol4 /mnt4 3. write a 4GB file (= RAM size) [root@bd-vm fio]# dd if=/dev/zero of=/mnt4/4g bs=1M count=4K 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 23.0355 s, 186 MB/s 4. first read, with read-ahead = 1, got throughput 99MB/s [root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 1 volume set: success [root@bd-vm ~]# dd if=/mnt4/4g bs=1M of=/dev/null 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 43.0906 s, 99.7 MB/s 5. second read, read-ahead=16, got throughput 107 MB/s, not much difference [root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 16 volume set: success [root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 40.1117 s, 107 MB/s 6. third read, read-ahead=16, open-behind=off, got throughput 269MB/s [root@bd-vm ~]# gluster volume set vol4 performance.open-behind off volume set: success [root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 15.982 s, 269 MB/s Actual results: read-ahead has no impact on sequential read and re-read Expected results: read-ahead should improve sequential re-read Additional info: I built gluster from git source as of Mar 25, 2014, branch release-3.4. --- Additional comment from Ben England on 2014-05-30 06:34:51 EDT --- to assess priority, how many folks are using open-behind volume option? Open-behind translator is an optimization for small-file workloads, correct? Has anyone measured performance of open-behind on vs off? Does it help? --- Additional comment from Raghavendra G on 2016-01-20 02:49:43 EST --- @Poornima/Anuradha, Can you take a look at this bug? regards, Raghavendra --- Additional comment from Raghavendra G on 2016-04-26 00:47:56 EDT --- I think the issue is because of open-behind using anonymous-fd. See the following option in open-behind: { .key = {"read-after-open"}, .type = GF_OPTION_TYPE_BOOL, .default_value = "no", .description = "read is sent only after actual open happens and real " "fd is obtained, instead of doing on anonymous fd (similar to write)", }, The read-ahead cache is per-fd and stored in the context of fd. If open-behind is using anonymous fds for doing reads, read is never sent on the fd which read-ahead has seen (during open from application). So, there is no read-ahead cache. Can you retry the tests by setting option "read-after-open" in open-behind to "yes"? [root@unused glusterfs]# gluster volume set dist-rep performance.read-after-open on volume set: success [root@unused glusterfs]# gluster volume info Volume Name: dist-rep Type: Distributed-Replicate Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3 Status: Created Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: booradley:/home/export-2/dist-rep1 Brick2: booradley:/home/export-2/dist-rep2 Brick3: booradley:/home/export-2/dist-rep3 Brick4: booradley:/home/export-2/dist-rep4 Options Reconfigured: performance.read-after-open: on performance.readdir-ahead: on --- Additional comment from Raghavendra G on 2016-04-26 09:27:15 EDT --- (In reply to Raghavendra G from comment #3) > I think the issue is because of open-behind using anonymous-fd. See the > following option in open-behind: > > { .key = {"read-after-open"}, > .type = GF_OPTION_TYPE_BOOL, > .default_value = "no", > .description = "read is sent only after actual open happens and real " > "fd is obtained, instead of doing on anonymous fd (similar to > write)", > }, > > The read-ahead cache is per-fd and stored in the context of fd. If > open-behind is using anonymous fds for doing reads, read is never sent on > the fd which read-ahead has seen (during open from application). So, there > is no read-ahead cache. This RCA is not valid. The reason is during read-request, fd is stored in local and in response cache is stored on the fd stored in local. So, even though open-behind sends read on anonymous fd, read-ahead stores the cache in the fd passed to application/kernel. > > Can you retry the tests by setting option "read-after-open" in open-behind > to "yes"? > > [root@unused glusterfs]# gluster volume set dist-rep > performance.read-after-open on > volume set: success > [root@unused glusterfs]# gluster volume info > > Volume Name: dist-rep > Type: Distributed-Replicate > Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3 > Status: Created > Number of Bricks: 2 x 2 = 4 > Transport-type: tcp > Bricks: > Brick1: booradley:/home/export-2/dist-rep1 > Brick2: booradley:/home/export-2/dist-rep2 > Brick3: booradley:/home/export-2/dist-rep3 > Brick4: booradley:/home/export-2/dist-rep4 > Options Reconfigured: > performance.read-after-open: on > performance.readdir-ahead: on --- Additional comment from Raghavendra G on 2016-04-26 14:27:44 EDT --- (In reply to Raghavendra G from comment #4) > (In reply to Raghavendra G from comment #3) > > I think the issue is because of open-behind using anonymous-fd. See the > > following option in open-behind: > > > > { .key = {"read-after-open"}, > > .type = GF_OPTION_TYPE_BOOL, > > .default_value = "no", > > .description = "read is sent only after actual open happens and real " > > "fd is obtained, instead of doing on anonymous fd (similar to > > write)", > > }, > > > > The read-ahead cache is per-fd and stored in the context of fd. If > > open-behind is using anonymous fds for doing reads, read is never sent on > > the fd which read-ahead has seen (during open from application). So, there > > is no read-ahead cache. > > This RCA is not valid. The reason is during read-request, fd is stored in > local and in response cache is stored on the fd stored in local. So, even > though open-behind sends read on anonymous fd, read-ahead stores the cache > in the fd passed to application/kernel. Well, the core of the RCA - read-ahead is disabled because of open-behind using anonymous fds - is still valid :). What was wrong was the mechanism through which read-ahead is turned off. In our current configuration read-ahead is loaded below open-behind. So, with "read-after-open" turned off, read-ahead never receives an open. Without an open, read-ahead doesn't create a context in fd, which is where all the cache is stored. There are two solutions to this problem: 1. Load read-ahead as one of the ancestors of open-behind. This way read-ahead witnesses the open sent by application before open-behind stopping it. 2. Turn "read-after-open" option on, so that open behind does an open.
REVIEW: http://review.gluster.org/15811 (glusterd/volgen: Changing the order of read-ahead xlator) posted (#1) for review on master by mohammed rafi kc (rkavunga)
I don't know if this is relevant but just a heads up - I remember the option "root-squash" was dependent on or tied to the option "read-after-open". If you plan to change "read-after-open", please do take a look at the code that handles "server.root-squash" too.
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#2) for review on master by Milind Changire (mchangir)
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#3) for review on master by Milind Changire (mchangir)
here's output of a single run as per test case in comment #0 (without dropping kernel caches): (all nodes are virtual machines on a single laptop with SSD) 16:17:46 [root@f25node1 ~] # gluster volume info dist-rep Volume Name: dist-rep Type: Distributed-Replicate Volume ID: 54b4340d-cb1f-4742-a0ca-3d3ac499d84c Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: f25node1:/glustervols/dist-rep/brick Brick2: f25node2:/glustervols/dist-rep/brick Brick3: f25node3:/glustervols/dist-rep/brick Brick4: f25node4:/glustervols/dist-rep/brick Brick5: f25node5:/glustervols/dist-rep/brick Brick6: f25node6:/glustervols/dist-rep/brick Options Reconfigured: performance.open-behind: off performance.read-ahead-page-count: 16 transport.address-family: inet nfs.disable: on cluster.brick-multiplex: off 15:11:17 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep performance.read-ahead-page-count 4 15:13:46 [root@f25client1 ~] # dd if=/dev/zero of=/mnt/dist-rep/1g bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 263.88 s, 4.1 MB/s performance.read-ahead-page-count 1 16:07:16 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 10.1683 s, 106 MB/s performance.read-ahead-page-count 16 16:14:31 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.162729 s, 6.6 GB/s performance.open-behind off performance.read-ahead-page-count 16 16:18:38 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.43123 s, 167 MB/s
(In reply to Milind Changire from comment #5) > here's output of a single run as per test case in comment #0 > (without dropping kernel caches): Its better to drop caches. If we are doing a fresh mount for each test, I guess its ok. > performance.open-behind: off Was open-behind off in all runs of dd?
(In reply to Raghavendra G from comment #6) > (In reply to Milind Changire from comment #5) > > here's output of a single run as per test case in comment #0 > > (without dropping kernel caches): > > Its better to drop caches. If we are doing a fresh mount for each test, I > guess its ok. I'll do another round of test by doing a fresh mount before each run of "dd" > > > performance.open-behind: off > > Was open-behind off in all runs of dd? open-behind was turned "off" only for the last run of "dd"
Test re-run by doing fresh mount before every "dd" run: 09:05:47 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 09:05:52 [root@f25client1 ~] # dd if=/dev/zero of=/mnt/dist-rep/1g bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 77.2741 s, 13.9 MB/s 09:09:04 [root@f25client1 ~] # umount /mnt/dist-rep 09:06:12 [root@f25node1 ~] # gluster volume set dist-rep performance.read-ahead-page-count 1 volume set: success 09:06:39 [root@f25node1 ~] # gluster volume set dist-rep performance.open-behind on volume set: success 09:11:12 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 09:11:25 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.16453 s, 132 MB/s 09:12:07 [root@f25client1 ~] # umount /mnt/dist-rep 09:06:54 [root@f25node1 ~] # gluster volume set dist-rep performance.read-ahead-page-count 16 09:13:40 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 09:13:44 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 20.1027 s, 53.4 MB/s 09:13:21 [root@f25node1 ~] # gluster volume set dist-rep performance.open-behind off volume set: success 09:16:32 [root@f25client1 ~] # umount /mnt/dist-rep 09:40:07 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 09:40:31 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 10.2474 s, 105 MB/s
@Milind, I am curious to know whether you see same behavior by change read-ahead page count without applying the fix. Can you try that? regards, Raghavendra
(In reply to Raghavendra G from comment #9) > @Milind, > > I am curious to know whether you see same behavior by change read-ahead page > count without applying the fix. Can you try that? > > regards, > Raghavendra Following is the test run *without* the patch: 13:50:23 [root@f25client1 ~] # dd if=/dev/zero of=/mnt/dist-rep/1g bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 101.623 s, 10.6 MB/s 13:53:11 [root@f25client1 ~] # umount /mnt/dist-rep 13:54:30 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 13:54:36 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.44487 s, 127 MB/s 13:50:40 [root@f25node1 ~] # gluster volume set dist-rep performance.read-ahead-page-count 16 13:56:10 [root@f25client1 ~] # umount /mnt/dist-rep 13:56:35 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 13:56:59 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.23783 s, 205 MB/s 13:56:52 [root@f25node1 ~] # gluster volume set dist-rep performance.open-behind off 13:57:34 [root@f25client1 ~] # umount /mnt/dist-rep 13:57:57 [root@f25client1 ~] # mount -t glusterfs f25node1:/dist-rep /mnt/dist-rep 13:58:00 [root@f25client1 ~] # dd if=/mnt/dist-rep/1g bs=1M of=/dev/null 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.52333 s, 194 MB/s
Created attachment 1304787 [details] perf numbers with patch
Created attachment 1304790 [details] perf numbers without patch
We can see that performance gets affected by the number of read-ahead page count in both scenarios - with/without fix. Milind will do the same tests by setting client log-level to TRACE and attach logs to the bz. Hence setting needinfo on him. Once logs are attached, csaba has agreed to take the ownership of bz. Hence assigning the bz to csaba.
(In reply to Raghavendra G from comment #13) > We can see that performance gets affected by the number of read-ahead page > count in both scenarios - with/without fix. 1. If read-ahead is indeed turned off, there should be no effect on performance by tweaking page-count. 2. When read-ahead is enabled, performance with page-count 1 seems to be better than performance with higher page-count. Both of the above behavior need to be explained.
(In reply to Raghavendra G from comment #13) > We can see that performance gets affected by the number of read-ahead page > count in both scenarios - with/without fix. Milind will do the same tests by > setting client log-level to TRACE and attach logs to the bz. Hence setting > needinfo on him. > > Once logs are attached, csaba has agreed to take the ownership of bz. Hence > assigning the bz to csaba. Perf numbers for the TRACEL level logging test run are available as attachments. TRACE level logs are available at: 1. https://nofile.io/f/is0XjWnaLFK/bz-1393419-test-run-without-patch-client-TRACE.log.gz 2. https://nofile.io/f/Oh3lkzaxMSe/bz-1393419-test-run-with-patch-client-TRACE.log.gz
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#4) for review on master by mohammed rafi kc (rkavunga)
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#5) for review on master by mohammed rafi kc (rkavunga)
(In reply to Milind Changire from comment #15) > (In reply to Raghavendra G from comment #13) > > We can see that performance gets affected by the number of read-ahead page > > count in both scenarios - with/without fix. without fix, there is some percentage of opens where open-behind actually winds down and hence read-ahead will be working for those fds. I guess variance due to change of page-count can be attributed to this. Also, note that as found on bz 1214489, reads are interspersed with getattr. So, the cache is thrown away. With large value of page-count more pages are wasted and hence performance degrades for larger value of page count. I verified from logs attached that patch https://review.gluster.org/15811 makes sure that all opens are witnessed by read-ahead and hence a fix to current bug. To summarize patch #15811 is a necessary fix and solves the problem at hand. Problem with read-ahead performance are tracked by other bzs like bz 1214489. > > Milind will do the same tests by > > setting client log-level to TRACE and attach logs to the bz. Hence setting > > needinfo on him. > > > > Once logs are attached, csaba has agreed to take the ownership of bz. Hence > > assigning the bz to csaba. > > Perf numbers for the TRACEL level logging test run are available as > attachments. > > TRACE level logs are available at: > 1. > https://nofile.io/f/is0XjWnaLFK/bz-1393419-test-run-without-patch-client- > TRACE.log.gz > 2. > https://nofile.io/f/Oh3lkzaxMSe/bz-1393419-test-run-with-patch-client-TRACE. > log.gz
REVIEW: https://review.gluster.org/15811 (perf/read-ahead: read-ahead before open-behind) posted (#6) for review on master by Milind Changire
Status?
Raghavendra G should be able to answer this aptly. Redirecting needinfo to him.
This is fixed by https://review.gluster.org/r/Ifa52d8ff017f115e83247f3396b9d27f0295ce3f