Description of problem: open-behind xlator is turned on by default when creating a new volume. This appears to prevent read-ahead from working. Version-Release number of selected component (if applicable): release-3.4 branch. How reproducible: Steps to Reproduce: 1. create a volume called vol4 [root@bd-vm ~]# mkdir /test/vol4 [root@bd-vm ~]# gluster volume create vol4 bd-vm:/test/vol4 force volume create: vol4: success: please start the volume to access data [root@bd-vm ~]# gluster volume start vol4 volume start: vol4: success [root@bd-vm ~]# gluster volume info vol4 Volume Name: vol4 Type: Distribute Volume ID: 85af878b-0119-4f99-b01f-caf4577cb4d4 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: bd-vm:/test/vol4 2. mount the volume [root@bd-vm ~]# mkdir /mnt4 [root@bd-vm ~]# mount -t glusterfs localhost:/vol4 /mnt4 3. write a 4GB file (= RAM size) [root@bd-vm fio]# dd if=/dev/zero of=/mnt4/4g bs=1M count=4K 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 23.0355 s, 186 MB/s 4. first read, with read-ahead = 1, got throughput 99MB/s [root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 1 volume set: success [root@bd-vm ~]# dd if=/mnt4/4g bs=1M of=/dev/null 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 43.0906 s, 99.7 MB/s 5. second read, read-ahead=16, got throughput 107 MB/s, not much difference [root@bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 16 volume set: success [root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 40.1117 s, 107 MB/s 6. third read, read-ahead=16, open-behind=off, got throughput 269MB/s [root@bd-vm ~]# gluster volume set vol4 performance.open-behind off volume set: success [root@bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 15.982 s, 269 MB/s Actual results: read-ahead has no impact on sequential read and re-read Expected results: read-ahead should improve sequential re-read Additional info: I built gluster from git source as of Mar 25, 2014, branch release-3.4.
to assess priority, how many folks are using open-behind volume option? Open-behind translator is an optimization for small-file workloads, correct? Has anyone measured performance of open-behind on vs off? Does it help?
@Poornima/Anuradha, Can you take a look at this bug? regards, Raghavendra
I think the issue is because of open-behind using anonymous-fd. See the following option in open-behind: { .key = {"read-after-open"}, .type = GF_OPTION_TYPE_BOOL, .default_value = "no", .description = "read is sent only after actual open happens and real " "fd is obtained, instead of doing on anonymous fd (similar to write)", }, The read-ahead cache is per-fd and stored in the context of fd. If open-behind is using anonymous fds for doing reads, read is never sent on the fd which read-ahead has seen (during open from application). So, there is no read-ahead cache. Can you retry the tests by setting option "read-after-open" in open-behind to "yes"? [root@unused glusterfs]# gluster volume set dist-rep performance.read-after-open on volume set: success [root@unused glusterfs]# gluster volume info Volume Name: dist-rep Type: Distributed-Replicate Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3 Status: Created Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: booradley:/home/export-2/dist-rep1 Brick2: booradley:/home/export-2/dist-rep2 Brick3: booradley:/home/export-2/dist-rep3 Brick4: booradley:/home/export-2/dist-rep4 Options Reconfigured: performance.read-after-open: on performance.readdir-ahead: on
(In reply to Raghavendra G from comment #3) > I think the issue is because of open-behind using anonymous-fd. See the > following option in open-behind: > > { .key = {"read-after-open"}, > .type = GF_OPTION_TYPE_BOOL, > .default_value = "no", > .description = "read is sent only after actual open happens and real " > "fd is obtained, instead of doing on anonymous fd (similar to > write)", > }, > > The read-ahead cache is per-fd and stored in the context of fd. If > open-behind is using anonymous fds for doing reads, read is never sent on > the fd which read-ahead has seen (during open from application). So, there > is no read-ahead cache. This RCA is not valid. The reason is during read-request, fd is stored in local and in response cache is stored on the fd stored in local. So, even though open-behind sends read on anonymous fd, read-ahead stores the cache in the fd passed to application/kernel. > > Can you retry the tests by setting option "read-after-open" in open-behind > to "yes"? > > [root@unused glusterfs]# gluster volume set dist-rep > performance.read-after-open on > volume set: success > [root@unused glusterfs]# gluster volume info > > Volume Name: dist-rep > Type: Distributed-Replicate > Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3 > Status: Created > Number of Bricks: 2 x 2 = 4 > Transport-type: tcp > Bricks: > Brick1: booradley:/home/export-2/dist-rep1 > Brick2: booradley:/home/export-2/dist-rep2 > Brick3: booradley:/home/export-2/dist-rep3 > Brick4: booradley:/home/export-2/dist-rep4 > Options Reconfigured: > performance.read-after-open: on > performance.readdir-ahead: on
(In reply to Raghavendra G from comment #4) > (In reply to Raghavendra G from comment #3) > > I think the issue is because of open-behind using anonymous-fd. See the > > following option in open-behind: > > > > { .key = {"read-after-open"}, > > .type = GF_OPTION_TYPE_BOOL, > > .default_value = "no", > > .description = "read is sent only after actual open happens and real " > > "fd is obtained, instead of doing on anonymous fd (similar to > > write)", > > }, > > > > The read-ahead cache is per-fd and stored in the context of fd. If > > open-behind is using anonymous fds for doing reads, read is never sent on > > the fd which read-ahead has seen (during open from application). So, there > > is no read-ahead cache. > > This RCA is not valid. The reason is during read-request, fd is stored in > local and in response cache is stored on the fd stored in local. So, even > though open-behind sends read on anonymous fd, read-ahead stores the cache > in the fd passed to application/kernel. Well, the core of the RCA - read-ahead is disabled because of open-behind using anonymous fds - is still valid :). What was wrong was the mechanism through which read-ahead is turned off. In our current configuration read-ahead is loaded below open-behind. So, with "read-after-open" turned off, read-ahead never receives an open. Without an open, read-ahead doesn't create a context in fd, which is where all the cache is stored. There are two solutions to this problem: 1. Load read-ahead as one of the ancestors of open-behind. This way read-ahead witnesses the open sent by application before open-behind stopping it. 2. Turn "read-after-open" option on, so that open behind does an open.
https://review.gluster.org/#/c/20511/
The above patch is merged in mainline, and would be in release-5.0