Bug 1380638 - Files not being opened with o_direct flag during random read operation (Glusterfs 3.8.2)
Summary: Files not being opened with o_direct flag during random read operation (Glust...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: open-behind
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.2.0
Assignee: Krutika Dhananjay
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1375959 1377556 1378695 1378814
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-09-30 08:39 UTC by Krutika Dhananjay
Modified: 2017-03-23 06:06 UTC (History)
15 users (show)

Fixed In Version: glusterfs-3.8.4-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1377556
Environment:
Last Closed: 2017-03-23 06:06:46 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Krutika Dhananjay 2016-09-30 08:39:37 UTC
+++ This bug was initially created as a clone of Bug #1377556 +++

+++ This bug was initially created as a clone of Bug #1375959 +++

Description of problem:

Hi,

I am running random workload test using pbench_fio and passing Direct=1 flag to bypass the cache. On the server side I have performance.strict-o-direct: on.

The expectation was that for all workload no activity would be seen on server cache. On the contrary I am seeing full 48G of server memory is being utilized for Random reads workload.

I collected strace on the brick process and saw no files opened with o_direct flag. Also vmstat data collected on server shows full utilization of 48G of server memory.

I have 6 node server setup and running workload on them from 5 clients. I have replica 2 volume. Here is vol info:

Volume Name: rep2
Type: Distributed-Replicate
Volume ID: 08bc410b-fba4-4a81-b918-fe5239947eef
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 172.17.40.13:/bricks/b01/g
Brick2: 172.17.40.14:/bricks/b01/g
Brick3: 172.17.40.15:/bricks/b01/g
Brick4: 172.17.40.16:/bricks/b01/g
Brick5: 172.17.40.22:/bricks/b01/g
Brick6: 172.17.40.24:/bricks/b01/g
Options Reconfigured:
performance.quick-read: off
performance.open-behind: off
performance.strict-o-direct: on
transport.address-family: inet
performance.readdir-ahead: on


Version-Release number of selected component (if applicable):

On server side:

rpm -qa | grep gluster
glusterfs-cli-3.8.2-1.el7.x86_64
glusterfs-3.8.2-1.el7.x86_64
glusterfs-api-3.8.2-1.el7.x86_64
glusterfs-libs-3.8.2-1.el7.x86_64
glusterfs-fuse-3.8.2-1.el7.x86_64
glusterfs-client-xlators-3.8.2-1.el7.x86_64
glusterfs-server-3.8.2-1.el7.x86_64

On client side:

rpm -qa | grep gluster
glusterfs-client-xlators-3.8.2-1.el7.x86_64
glusterfs-libs-3.8.2-1.el7.x86_64
glusterfs-3.8.2-1.el7.x86_64
glusterfs-fuse-3.8.2-1.el7.x86_64


How reproducible:

Its reproducible every-time

Steps to Reproduce:
1. Create a volume and set performance.strict-o-direct: on
2. Run random read workload using fio or any other equivalent workload generator
3. Observe vmstat and collect strace output using strace -ff -T -p <pid> -o <path-where-you-want-the-logs-captured>. To get PID ps aux | grep glusterfsd | grep <volname>

Actual results:

Full 48G of cache memory is being used.

Expected results:

With performance.strict-o-direct: on being set, expectation was it would bypass the cache completely.

Additional info:

Output of strace (See no file is being opened with o_direct flag)

 grep -i "o_direct" strace.log.*
strace.log.13755:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000063>
strace.log.13755:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/dirty", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000013>
strace.log.13755:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/entry-changes", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000013>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000118>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000114>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000142>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000141>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000141>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 31 <0.000141>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000023>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000089>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000058>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000094>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 26 <0.000104>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000028>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000099>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000018>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 26 <0.000106>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000028>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000077>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000074>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000119>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000028>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000036>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000028>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000109>
strace.log.13764:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/landfill", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000028>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000029>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000015>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 26 <0.000015>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000027>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000050>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 30 <0.000034>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000018>
strace.log.13789:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 26 <0.000128>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000014>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000028>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000014>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000038>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 26 <0.000111>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000014>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000017>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000025>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000028>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000026>
strace.log.14342:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000075>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000022>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000013>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000013>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000025>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000021>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 29 <0.000022>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000016>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 26 <0.000080>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000027>
strace.log.14343:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000018>
strace.log.14358:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000022>
strace.log.14358:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28 <0.000026>
strace.log.14358:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000017>
strace.log.14359:openat(AT_FDCWD, "/bricks/b01/g/.glusterfs/indices/xattrop", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 27 <0.000014>

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-09-14 06:49:16 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Atin Mukherjee on 2016-09-14 13:28:19 EDT ---

Shekhar - please change the product of this BZ to GlusterFS as the testing was done with glusterfs 3.8.2 bits.

--- Additional comment from Pranith Kumar K on 2016-09-14 13:39:06 EDT ---

Shekhar,
   You need the following two options for O_DIRECT to be handled properly:

1) performance.strict-o-direct on
2) network.remote-dio off

Pranith

--- Additional comment from Shekhar Berry on 2016-09-14 13:45:06 EDT ---

Pranith,

To enable o_direct I turned performance.strict-o-direct on and my network.remote-dio was disabled by default. 

performance.quick-read: off
performance.open-behind: off

are two options which I tried as part of troubleshooting exercise with Krutika and Raghavendra.

I am reopening this issue.

--Shekhar

--- Additional comment from Krutika Dhananjay on 2016-09-14 14:17:53 EDT ---

(In reply to Pranith Kumar K from comment #3)
> Shekhar,
>    You need the following two options for O_DIRECT to be handled properly:
> 
> 1) performance.strict-o-direct on
> 2) network.remote-dio off
> 
> Pranith

@Pranith - Part of the problem seems to be that the anonymous fds created by open-behind don't inherit the original open() flags. And when ob does wind reads and writes on these anon fds, posix winds up using the usual GF_ANON_FD_FLAGS (O_RDWR|O_LARGEFILE) value at the time of open() preceding the invoked fd operation (mostly read/write).

FWIW, I did try the same test Shekhar ran on my test cluster and disabling open-behind in addition to the two already known o-direct options that you mentioned above, seemed to fix the issue of growing cache size as the test progresses (although it didn't have any effect as per Shekhar when he tried the same).

I am yet to confirm the theory (and the presence/absence of O_DIRECT flag at the level of posix) through strace output of the bricks with and without open-behind. I will do the same tomorrow and update the bug.

-Krutika

--- Additional comment from Pranith Kumar K on 2016-09-14 15:26:23 EDT ---

(In reply to Shekhar Berry from comment #4)
> Pranith,
> 
> To enable o_direct I turned performance.strict-o-direct on and my
> network.remote-dio was disabled by default. 
> 
> performance.quick-read: off
> performance.open-behind: off
> 
> are two options which I tried as part of troubleshooting exercise with
> Krutika and Raghavendra.
> 
> I am reopening this issue.
> 
> --Shekhar

I was so confident this is not a bug, sorry about closing the bug before coming to a conclusion :-(

--- Additional comment from Pranith Kumar K on 2016-09-14 15:30:03 EDT ---

(In reply to Krutika Dhananjay from comment #5)
> (In reply to Pranith Kumar K from comment #3)
> > Shekhar,
> >    You need the following two options for O_DIRECT to be handled properly:
> > 
> > 1) performance.strict-o-direct on
> > 2) network.remote-dio off
> > 
> > Pranith
> 
> @Pranith - Part of the problem seems to be that the anonymous fds created by
> open-behind don't inherit the original open() flags. And when ob does wind
> reads and writes on these anon fds, posix winds up using the usual
> GF_ANON_FD_FLAGS (O_RDWR|O_LARGEFILE) value at the time of open() preceding
> the invoked fd operation (mostly read/write).

As soon as the first write comes, an OPEN FOP with same flags that came at the time of open() are sent out from open-behind, from then on even the reads are served using this fd. Could you confirm if the test includes only reads and no writes at all? In that case we have a bug we need to look into.

> 
> FWIW, I did try the same test Shekhar ran on my test cluster and disabling
> open-behind in addition to the two already known o-direct options that you
> mentioned above, seemed to fix the issue of growing cache size as the test
> progresses (although it didn't have any effect as per Shekhar when he tried
> the same).
> 
> I am yet to confirm the theory (and the presence/absence of O_DIRECT flag at
> the level of posix) through strace output of the bricks with and without
> open-behind. I will do the same tomorrow and update the bug.

Yes, this would be important to know.

> 
> -Krutika

--- Additional comment from Krutika Dhananjay on 2016-09-15 09:28:42 EDT ---

(In reply to Krutika Dhananjay from comment #5)
> (In reply to Pranith Kumar K from comment #3)
> > Shekhar,
> >    You need the following two options for O_DIRECT to be handled properly:
> > 
> > 1) performance.strict-o-direct on
> > 2) network.remote-dio off
> > 
> > Pranith
> 
> @Pranith - Part of the problem seems to be that the anonymous fds created by
> open-behind don't inherit the original open() flags. And when ob does wind
> reads and writes on these anon fds, posix winds up using the usual
> GF_ANON_FD_FLAGS (O_RDWR|O_LARGEFILE) value at the time of open() preceding
> the invoked fd operation (mostly read/write).
> 
> FWIW, I did try the same test Shekhar ran on my test cluster and disabling
> open-behind in addition to the two already known o-direct options that you
> mentioned above, seemed to fix the issue of growing cache size as the test
> progresses (although it didn't have any effect as per Shekhar when he tried
> the same).
> 
> I am yet to confirm the theory (and the presence/absence of O_DIRECT flag at
> the level of posix) through strace output of the bricks with and without
> open-behind. I will do the same tomorrow and update the bug.
> 
> -Krutika

I stand corrected. *Even* with open-behind disabled, like Shekhar rightly said, the issue still exists. I even disabled all perf xls and the problem still persists. I found that open()s are being issued on gfid handles with O_RDONLY|O_DIRECT flags in the bricks' strace output with all these xls disabled. I still need to check if pread()s are being issued on the same file descriptors though.

Still have some more strace analysis to do. Will update the bug again once the issue is RC'd.

-Krutika

--- Additional comment from Krutika Dhananjay on 2016-09-15 12:04:21 EDT ---

So I checked the strace output again. The moment I disable open-behind, I start to see O_DIRECT in the strace logs. I grepped for other invocations of open() and realised there are all these files opened with O_WRONLY flag (without O_DIRECT of course) with a lot of pwrites being invoked on their fds. Took me a while to realise that these are filenames with patterns fio_bw*.log, fio_clat*.log, fio_lat*.log, etc - in other words log files created by fio itself which were ending up on the glusterfs mount point.

Possible dumb question inbound - Shekhar, did you invoke the fio command when your current working directory was the glusterfs mount point itself? I made that mistake today which is why I saw the same results as you did. :)

Could you please confirm this?

I ran the test again now, this time from $HOME. Disabling open-behind caused things to work for me.

The bug report however is still valid, I will send out a patch to make open-behind handle O_DIRECT. Meanwhile you can work around the issue by disabling open-behind for now.

-Krutika

--- Additional comment from Shekhar Berry on 2016-09-15 14:30:27 EDT ---

Krutika,

Thanks for looking into the bug.

Here are couple of my comments:

1) did you invoke the fio command when your current working directory was the glusterfs mount point itself?

No, I am using pbench_fio for my testing and it stores run logs under /var/lib/pbench-agent. Just FYI, I am not invoking pbench_fio from gluster mount point itself :)

2) I redid the test with performance.open-behind: off and performance.strict-o-direct: on option. The results are no different in my setup.I am still seeing full cache utilization. 

I have not captured strace under this setting. I can capture and share that with you if required.

--Shekhar

--- Additional comment from Krutika Dhananjay on 2016-09-16 01:12:36 EDT ---

Yes, strace output would be great.

1. Could you capture the strace logs of the bricks (the smaller the brick count, the easier it would be for me :) ) for the following cases:

i. strict-o-direct on, and everything else remains default (implying that open-behind will also be on)

ii. strict-o-direct on and open-behind turned off

iii. strict-o-direct on and all perf xls disabled (write-behind, open-behind, quick-read, read-ahead, io-cache, stat-prefetch).

For capturing the strace output, you can use this:

# strace -ff -f -p <pid-of-brick> -o <path-where-you-want-the-output-captured>

Not that you may have to hit Ctrl+C on the strace terminal after every run.



2. Along with 1, also start the test script (fio in your case) itself using strace. For this, you need to run execute the following:

# strace -ff -f -o fio-strace.log <your-test-invocation-command-here>

-Krutika

--- Additional comment from Shekhar Berry on 2016-09-17 03:54:25 EDT ---

I stand corrected. With 'Open-behind' off I am seeing files are now open with o_direct flag for both write and read test.

I redid the test with following volume setting:

performance.open-behind: off
performance.strict-o-direct: on

Here's the strace output:

grep -i "o_direct" strace.log.* | egrep -v openat

strace.log.25363:open("/bricks/b01/g/.glusterfs/ab/2e/ab2ea2ca-1f6e-494a-b817-faec8627afe3", O_RDWR|O_DIRECT) = 33 <0.000014>
strace.log.25363:open("/bricks/b01/g/.glusterfs/10/f7/10f7cf7d-bd68-4414-a3c8-cb3b12092452", O_RDWR|O_DIRECT) = 35 <0.000016>
strace.log.25363:open("/bricks/b01/g/.glusterfs/51/ac/51acd630-6518-42f8-a9cf-c936dc5c3b9c", O_RDWR|O_DIRECT) = 37 <0.000011>
strace.log.25363:open("/bricks/b01/g/.glusterfs/a1/41/a141600d-8fbb-41f9-b0b8-34318125839e", O_RDWR|O_DIRECT) = 39 <0.000017>
strace.log.25363:open("/bricks/b01/g/.glusterfs/fa/cd/facda6a3-53c4-484e-a599-2488d0683c31", O_RDWR|O_DIRECT) = 43 <0.000039>
strace.log.25363:open("/bricks/b01/g/.glusterfs/52/02/5202698e-51d2-40d3-9a61-9b40ce2b427b", O_RDWR|O_DIRECT) = 44 <0.000014>
strace.log.25363:open("/bricks/b01/g/.glusterfs/ac/2c/ac2ceda0-d035-4b94-87bc-c12bced59c6c", O_RDONLY|O_DIRECT) = 45 <0.000040>
strace.log.25363:open("/bricks/b01/g/.glusterfs/51/ac/51acd630-6518-42f8-a9cf-c936dc5c3b9c", O_RDONLY|O_DIRECT) = 52 <0.000021>
strace.log.25454:open("/bricks/b01/g/.glusterfs/ca/47/ca475b51-9f61-4838-ba2c-8930b65a0c04", O_RDWR|O_DIRECT) = 34 <0.000019>
strace.log.25454:open("/bricks/b01/g/.glusterfs/11/d7/11d7083f-a4d0-41c2-9b48-efbd899b831a", O_RDWR|O_DIRECT) = 36 <0.000017>
strace.log.25454:open("/bricks/b01/g/.glusterfs/84/63/8463c0d7-3e68-44c7-b360-0a6622070c42", O_RDWR|O_DIRECT) = 38 <0.000017>
strace.log.25454:open("/bricks/b01/g/.glusterfs/9a/4c/9a4cb41f-487d-464f-9238-9db7385ecf53", O_RDWR|O_DIRECT) = 40 <0.000012>
strace.log.25454:open("/bricks/b01/g/.glusterfs/30/f9/30f929ce-9b66-4d3e-ab0f-a73c60d4d817", O_RDWR|O_DIRECT) = 41 <0.000013>
strace.log.25454:open("/bricks/b01/g/.glusterfs/85/50/8550a796-172b-4390-b976-b228eafd9ada", O_RDWR|O_DIRECT) = 42 <0.000016>
strace.log.25454:open("/bricks/b01/g/.glusterfs/9d/b2/9db21c69-39bc-4565-8793-0833cf21887d", O_RDWR|O_DIRECT) = 45 <0.000020>
strace.log.25454:open("/bricks/b01/g/.glusterfs/ac/2c/ac2ceda0-d035-4b94-87bc-c12bced59c6c", O_RDWR|O_DIRECT) = 48 <0.000026>
strace.log.25454:open("/bricks/b01/g/.glusterfs/a5/b6/a5b62db9-0d1a-49cb-b188-a1be216de017", O_RDWR|O_DIRECT) = 50 <0.000021>
strace.log.25454:open("/bricks/b01/g/.glusterfs/c1/fb/c1fb470d-63dc-4d1e-8079-93877bdcb2a5", O_RDONLY|O_DIRECT) = 36 <0.000026>
strace.log.25454:open("/bricks/b01/g/.glusterfs/11/d7/11d7083f-a4d0-41c2-9b48-efbd899b831a", O_RDONLY|O_DIRECT) = 48 <0.000030>
strace.log.25455:open("/bricks/b01/g/.glusterfs/81/04/810408fe-eb40-42e3-b53b-981dbf290af3", O_RDWR|O_DIRECT) = 49 <0.000037>
strace.log.25455:open("/bricks/b01/g/.glusterfs/c1/fb/c1fb470d-63dc-4d1e-8079-93877bdcb2a5", O_RDWR|O_DIRECT) = 52 <0.000013>
strace.log.25455:open("/bricks/b01/g/.glusterfs/30/f9/30f929ce-9b66-4d3e-ab0f-a73c60d4d817", O_RDONLY|O_DIRECT) = 33 <0.000022>
strace.log.25455:open("/bricks/b01/g/.glusterfs/9d/b2/9db21c69-39bc-4565-8793-0833cf21887d", O_RDONLY|O_DIRECT) = 35 <0.000032>
strace.log.25455:open("/bricks/b01/g/.glusterfs/84/63/8463c0d7-3e68-44c7-b360-0a6622070c42", O_RDONLY|O_DIRECT) = 49 <0.000027>
strace.log.25456:open("/bricks/b01/g/.glusterfs/2d/35/2d352297-ab44-4c92-8b1d-cb314f5f693b", O_RDWR|O_DIRECT) = 51 <0.000029>
strace.log.25456:open("/bricks/b01/g/.glusterfs/fa/cd/facda6a3-53c4-484e-a599-2488d0683c31", O_RDONLY|O_DIRECT) = 34 <0.000029>
strace.log.25457:open("/bricks/b01/g/.glusterfs/e5/13/e5134455-eb71-43b8-8662-355c79d5ef1a", O_RDWR|O_DIRECT) = 46 <0.000018>
strace.log.25457:open("/bricks/b01/g/.glusterfs/49/3b/493b7a5b-42f7-4c0a-9a4c-48956a05444f", O_RDWR|O_DIRECT) = 47 <0.000027>
strace.log.25457:open("/bricks/b01/g/.glusterfs/81/04/810408fe-eb40-42e3-b53b-981dbf290af3", O_RDONLY|O_DIRECT) = 47 <0.000031>
strace.log.25458:open("/bricks/b01/g/.glusterfs/52/02/5202698e-51d2-40d3-9a61-9b40ce2b427b", O_RDONLY|O_DIRECT) = 39 <0.000026>
strace.log.25458:open("/bricks/b01/g/.glusterfs/ab/2e/ab2ea2ca-1f6e-494a-b817-faec8627afe3", O_RDONLY|O_DIRECT) = 46 <0.000037>
strace.log.25458:open("/bricks/b01/g/.glusterfs/ca/47/ca475b51-9f61-4838-ba2c-8930b65a0c04", O_RDONLY|O_DIRECT) = 51 <0.000030>
strace.log.25459:open("/bricks/b01/g/.glusterfs/e5/13/e5134455-eb71-43b8-8662-355c79d5ef1a", O_RDONLY|O_DIRECT) = 40 <0.000017>
strace.log.25459:open("/bricks/b01/g/.glusterfs/10/f7/10f7cf7d-bd68-4414-a3c8-cb3b12092452", O_RDONLY|O_DIRECT) = 42 <0.000014>
strace.log.25459:open("/bricks/b01/g/.glusterfs/49/3b/493b7a5b-42f7-4c0a-9a4c-48956a05444f", O_RDONLY|O_DIRECT) = 43 <0.000044>
strace.log.25460:open("/bricks/b01/g/.glusterfs/85/50/8550a796-172b-4390-b976-b228eafd9ada", O_RDONLY|O_DIRECT) = 37 <0.000012>
strace.log.25461:open("/bricks/b01/g/.glusterfs/a5/b6/a5b62db9-0d1a-49cb-b188-a1be216de017", O_RDONLY|O_DIRECT) = 38 <0.000039>
strace.log.25461:open("/bricks/b01/g/.glusterfs/9a/4c/9a4cb41f-487d-464f-9238-9db7385ecf53", O_RDONLY|O_DIRECT) = 44 <0.000027>
strace.log.25461:open("/bricks/b01/g/.glusterfs/a1/41/a141600d-8fbb-41f9-b0b8-34318125839e", O_RDONLY|O_DIRECT) = 50 <0.000015>
strace.log.25462:open("/bricks/b01/g/.glusterfs/2d/35/2d352297-ab44-4c92-8b1d-cb314f5f693b", O_RDONLY|O_DIRECT) = 41 <0.000017>


--Shekhar

--- Additional comment from Worker Ant on 2016-09-21 06:26:19 EDT ---

REVIEW: http://review.gluster.org/15537 (performance/open-behind: Pass O_DIRECT flags for anon fd reads when required) posted (#1) for review on master by Krutika Dhananjay (kdhananj@redhat.com)

--- Additional comment from Worker Ant on 2016-09-21 06:29:49 EDT ---

REVIEW: http://review.gluster.org/15537 (performance/open-behind: Pass O_DIRECT flags for anon fd reads when required) posted (#2) for review on master by Krutika Dhananjay (kdhananj@redhat.com)

--- Additional comment from Worker Ant on 2016-09-23 01:32:17 EDT ---

COMMIT: http://review.gluster.org/15537 committed in master by Raghavendra G (rgowdapp@redhat.com) 
------
commit a412a4f50d8ca2ae68dbfa93b80757889150ce99
Author: Krutika Dhananjay <kdhananj@redhat.com>
Date:   Tue Sep 20 12:05:23 2016 +0530

    performance/open-behind: Pass O_DIRECT flags for anon fd reads when required
    
    Writes are already passing the correct flags at the time of open().
    
    Also, make io-cache honor direct-io for anon-fds with
    O_DIRECT flag during reads.
    
    Change-Id: I215cb09ef1b607b9f95cabf0ef3065c00edd9e78
    BUG: 1377556
    Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
    Reviewed-on: http://review.gluster.org/15537
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>

Comment 3 Krutika Dhananjay 2016-10-14 09:14:06 UTC
https://code.engineering.redhat.com/gerrit/#/c/87139/

Comment 5 Atin Mukherjee 2016-10-14 09:45:50 UTC
upstream mainline : http://review.gluster.org/#/c/15537/
upstream 3.8 : http://review.gluster.org/15552
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/87139/

Comment 9 Ben Turner 2017-02-17 01:08:58 UTC
The QE perf team will run through the regression suite with O_DIRECT enabled.  The expectation is that O_DIRECT runs should perform similar to normal runs where the FSYNC time is included in the performance calculation.  I will ask Karan to kick off a jenkins run tomorrow.  If it performs as expected we will mark this BZ VERIFIED.

-b

Comment 12 SATHEESARAN 2017-02-24 07:00:15 UTC
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-12.el7rhgs ) with the following steps :

1.a. Created replica 3 volume with strict-o-direct enabled
1.b. Check whether open-behind is enabled on the volume.
2. Start the volume and look for strace on the brick process
3. Fuse mount the volume and using 'dd' with oflag=direct , create a file
4. After file creation, check strace logs for open calls.

All open calls had O_DIRECT flags

One part of the bug is that open-behind was not honoring O_DIRECT and that is verified with the above steps.

Comment 14 errata-xmlrpc 2017-03-23 06:06:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.