Bug 1360785 - Direct io to sharded files fails when on zfs backend
Summary: Direct io to sharded files fails when on zfs backend
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: posix
Version: 3.7.13
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Krutika Dhananjay
QA Contact: bugs@gluster.org
URL:
Whiteboard:
Depends On:
Blocks: 1361300 1361449
TreeView+ depends on / blocked
 
Reported: 2016-07-27 13:30 UTC by David
Modified: 2016-08-15 13:49 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.7.14
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1361300 (view as bug list)
Environment:
Last Closed: 2016-08-02 07:25:24 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:
dgossage: needinfo-


Attachments (Terms of Use)
logs from directio test (5.98 KB, application/zip)
2016-07-27 13:30 UTC, David
no flags Details
strace run during failed dd (392.46 KB, application/zip)
2016-07-29 16:29 UTC, David
no flags Details
logs from running dd command (3.89 KB, application/zip)
2016-07-29 16:29 UTC, David
no flags Details

Description David 2016-07-27 13:30:38 UTC
Created attachment 1184658 [details]
logs from directio test

Beginning with 3.7.12 and 3.7.13 when using zfs backed bricks connecting to sharded files fails with direct io.

How reproducible: Always


Steps to Reproduce:
1. zfs backed bricks default settings except xattr=sa
2. gluster fs 3.7.12+ sharding enabled
3. dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/192.168.71.11\:_glustershard/81e19cd3-ae45-449c-b716-ec3e4ad4c2f0/images/test oflag=direct count=100 bs=1M

Actual results: dd: error writing ‘/rhev/data-center/mnt/glusterSD/192.168.71.11:_glustershard/81e19cd3-ae45-449c-b716-ec3e4ad4c2f0/images/test’: Operation not permitted

file test is created with file size defined by shard size.  sharded file created in .shard are 0


Expected results: 
100+0 records in
100+0 records out
104857600 bytes etc.....


Additional info:
Using proxmox users have been able to work around by changing disk caching from none to writethrough/back.  Not sure this would help with oVirt as the pything script that checks storage with dd and oflag=direct also fails

attaching client and brick log from test

Comment 1 David 2016-07-27 13:36:52 UTC
in oVirt mailing list was asked to test these settings

i. Set network.remote-dio to off
        # gluster volume set <VOL> network.remote-dio off

ii. Set performance.strict-o-direct to on
        # gluster volume set <VOL> performance.strict-o-direct on

results:

dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard/5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test oflag=direct count=100 bs=1M
dd: error writing ‘/rhev/data-center/mnt/glusterSD/192.168.71.10:_glustershard/5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test’: Invalid argument
dd: closing output file ‘/rhev/data-center/mnt/glusterSD/192.168.71.10:_glustershard/5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test’: Invalid argument


[2016-07-25 18:20:19.393121] E [MSGID: 113039] [posix.c:2939:posix_open] 0-glustershard-posix: open on /gluster2/brick1/1/.glusterfs/02/f4/02f4783b-2799-46d9-b787-53e4ccd9a052, flags: 16385 [Invalid argument]
[2016-07-25 18:20:19.393204] E [MSGID: 115070] [server-rpc-fops.c:1568:server_open_cbk] 0-glustershard-server: 120: OPEN /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) ==> (Invalid argument) [Invalid argument]


and /var/log/glusterfs/rhev-data-center-mnt-glusterSD-192.168.71.10\:_glustershard.log
[2016-07-25 18:20:19.393275] E [MSGID: 114031] [client-rpc-fops.c:466:client3_3_open_cbk] 0-glustershard-client-0: remote operation failed. Path: /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) [Invalid argument]
[2016-07-25 18:20:19.393270] E [MSGID: 114031] [client-rpc-fops.c:466:client3_3_open_cbk] 0-glustershard-client-1: remote operation failed. Path: /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) [Invalid argument]
[2016-07-25 18:20:19.393317] E [MSGID: 114031] [client-rpc-fops.c:466:client3_3_open_cbk] 0-glustershard-client-2: remote operation failed. Path: /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) [Invalid argument]
[2016-07-25 18:20:19.393357] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 117: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393389] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 118: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393611] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 119: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393708] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 120: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393771] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 121: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393840] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 122: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393914] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 123: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393982] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 124: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.394045] W [fuse-bridge.c:709:fuse_truncate_cbk] 0-glusterfs-fuse: 125: FTRUNCATE() ERR => -1 (Invalid argument)
[2016-07-25 18:20:19.394338] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 126: FLUSH() ERR => -1 (Invalid argument)

Comment 2 David 2016-07-27 14:54:22 UTC
Also have heard from others with issue that problem exists in 3.8.x as well.  I myself have not tested as my environment is still in 3.7.x

Comment 3 David 2016-07-27 15:44:09 UTC
These are full settings I usually apply and run with


features.shard-block-size: 64MB
features.shard: on
performance.readdir-ahead: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
server.allow-insecure: on
cluster.self-heal-window-size: 1024
cluster.background-self-heal-count: 16
performance.strict-write-ordering: off
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off

Comment 4 Krutika Dhananjay 2016-07-28 17:33:25 UTC
Hi,

Open() on these affected files seems to be returning ENOENT, however as per the find command output you gave on ovirt-users ML, both the file and its gfid handle seem to be existing in the backend. Then the failure was not due to ENOENT. I looked at the code in posix again and there is evidence to suggest that the actual error code (the real reason for open() failing) is getting masked by stat in .unlink directory:

30         if (fd->inode->ia_type == IA_IFREG) {                                    
 29                 _fd = open (real_path, fd->flags);                               
 28                 if (_fd == -1) {                          
 27                         POSIX_GET_FILE_UNLINK_PATH (priv->base_path,             
 26                                                     fd->inode->gfid,             
 25                                                     unlink_path);                
 24                         _fd = open (unlink_path, fd->flags);                     
 23                 }                                                                
 22                 if (_fd == -1) {                                                 
 21                         op_errno = errno;                                        
 20                         gf_msg (this->name, GF_LOG_ERROR, op_errno,              
 19                                 P_MSG_READ_FAILED,                               
 18                                 "Failed to get anonymous "                       
 17                                 "real_path: %s _fd = %d", real_path, _fd);       
 16                         GF_FREE (pfd);                                           
 15                         pfd = NULL;                                              
 14                         goto out;                                                
 13                 }                                                                
 12         }                         

In your case, on line 29, the open on .glusterfs/de/b6/deb61291-5176-4b81-8315-3f1cf8e3534d failed for a reason other than ENOENT (it can't be ENOENT because we already saw on doing find that the file exists). And then line 27 is executed. If the file exists in its real path, then it must be absent in .unlink directory (because the gfid handle can't be present at both places). So it is the open() on line 24 that is failing with ENOENT and not the open on line 29.

I'll be sending a patch to fix this problem.

Meanwhile, in order to understand why the open on line 29 failed, could you attach all of your bricks to strace, run the test again, wait for it to fail, and then attach both the strace output files and the resultant glusterfs client and brick logs here?

# strace -ff -p <pid-of-the-brick> -o <path-where-you-want-to-capture-the-output>

Comment 5 Vijay Bellur 2016-07-29 06:18:18 UTC
REVIEW: http://review.gluster.org/15041 (storage/posix: Look for file in "unlink" dir IFF open on real-path fails with ENOENT) posted (#1) for review on release-3.7 by Krutika Dhananjay (kdhananj)

Comment 6 Vijay Bellur 2016-07-29 06:26:02 UTC
REVIEW: http://review.gluster.org/15041 (storage/posix: Look for file in "unlink" dir IFF open on real-path fails with ENOENT) posted (#2) for review on release-3.7 by Krutika Dhananjay (kdhananj)

Comment 7 David 2016-07-29 16:28:12 UTC
Until laster this weekend I can't shutdown the cluster to update gluster that some of the earlier logs provided in mailing list come from.

I did run strace earlier while running the dd commands I mentioned in this report that fail when attempting to create sharded files.

Maybe they will be beneficial in some way until I can re-attempt the update on my running oVirt setup.

Comment 8 David 2016-07-29 16:29:02 UTC
Created attachment 1185606 [details]
strace run during failed dd

strace logs during failed file creation

Comment 9 David 2016-07-29 16:29:29 UTC
Created attachment 1185607 [details]
logs from running dd command

Comment 10 Krutika Dhananjay 2016-07-29 16:59:32 UTC
Thanks. That was very helpful.

<strace-output>
...
...
open("/gluster2/brick2/1/.glusterfs/13/fd/13fde185-8bcf-4747-bec9-a67f3495d65e", O_RDWR) = 17
...
...
open("/gluster2/brick2/1/.glusterfs/13/fd/13fde185-8bcf-4747-bec9-a67f3495d65e", O_RDWR|O_DIRECT) = -1 EINVAL (Invalid argument)
open("/gluster2/brick2/1/.glusterfs/unlink/13fde185-8bcf-4747-bec9-a67f3495d65e", O_RDWR|O_DIRECT) = -1 ENOENT (No such file or directory)
...
...
</strace-output>


From the above, it is clear that the open() is failing with EINVAL. But if you notice, open() on the file with O_RDWR succeeded. But when the same file was open()'d with O_DIRECT flag included, it failed with EINVAL.

I checked `man 2 open` to find out when the syscall returns EINVAL.

<man-page-excerpt>
...
...
       EINVAL The filesystem does not support the O_DIRECT flag.  See NOTES for more information.

       EINVAL Invalid value in flags.

       EINVAL O_TMPFILE was specified in flags, but neither O_WRONLY nor O_RDWR was specified.
...
...
</man-page-excerpt>

So it seems very likely that the EINVAL was due to O_DIRECT.

At this point I wanted to ask you this - does zfs (or the version of it you're using) support O_DIRECT?

-Krutika

Comment 11 Pranith Kumar K 2016-07-29 17:27:50 UTC
(In reply to Krutika Dhananjay from comment #10)
> Thanks. That was very helpful.
> 
> <strace-output>
> ...
> ...
> open("/gluster2/brick2/1/.glusterfs/13/fd/13fde185-8bcf-4747-bec9-
> a67f3495d65e", O_RDWR) = 17
> ...
> ...
> open("/gluster2/brick2/1/.glusterfs/13/fd/13fde185-8bcf-4747-bec9-
> a67f3495d65e", O_RDWR|O_DIRECT) = -1 EINVAL (Invalid argument)
> open("/gluster2/brick2/1/.glusterfs/unlink/13fde185-8bcf-4747-bec9-
> a67f3495d65e", O_RDWR|O_DIRECT) = -1 ENOENT (No such file or directory)
> ...
> ...
> </strace-output>
> 
> 
> From the above, it is clear that the open() is failing with EINVAL. But if
> you notice, open() on the file with O_RDWR succeeded. But when the same file
> was open()'d with O_DIRECT flag included, it failed with EINVAL.
> 
> I checked `man 2 open` to find out when the syscall returns EINVAL.
> 
> <man-page-excerpt>
> ...
> ...
>        EINVAL The filesystem does not support the O_DIRECT flag.  See NOTES
> for more information.
> 
>        EINVAL Invalid value in flags.
> 
>        EINVAL O_TMPFILE was specified in flags, but neither O_WRONLY nor
> O_RDWR was specified.
> ...
> ...
> </man-page-excerpt>
> 
> So it seems very likely that the EINVAL was due to O_DIRECT.
> 
> At this point I wanted to ask you this - does zfs (or the version of it
> you're using) support O_DIRECT?

I think the mistake is done by me. I didn't backport http://review.gluster.org/14215 to 3.7 branch.

> 
> -Krutika

Comment 12 Vijay Bellur 2016-07-29 17:28:50 UTC
REVIEW: http://review.gluster.org/15050 (protocol/client: Filter o-direct in readv/writev) posted (#1) for review on release-3.7 by Pranith Kumar Karampuri (pkarampu)

Comment 13 Pranith Kumar K 2016-07-29 17:31:55 UTC
(In reply to Pranith Kumar K from comment #11)
> (In reply to Krutika Dhananjay from comment #10)
> > Thanks. That was very helpful.
> > 
> > <strace-output>
> > ...
> > ...
> > open("/gluster2/brick2/1/.glusterfs/13/fd/13fde185-8bcf-4747-bec9-
> > a67f3495d65e", O_RDWR) = 17
> > ...
> > ...
> > open("/gluster2/brick2/1/.glusterfs/13/fd/13fde185-8bcf-4747-bec9-
> > a67f3495d65e", O_RDWR|O_DIRECT) = -1 EINVAL (Invalid argument)
> > open("/gluster2/brick2/1/.glusterfs/unlink/13fde185-8bcf-4747-bec9-
> > a67f3495d65e", O_RDWR|O_DIRECT) = -1 ENOENT (No such file or directory)
> > ...
> > ...
> > </strace-output>
> > 
> > 
> > From the above, it is clear that the open() is failing with EINVAL. But if
> > you notice, open() on the file with O_RDWR succeeded. But when the same file
> > was open()'d with O_DIRECT flag included, it failed with EINVAL.
> > 
> > I checked `man 2 open` to find out when the syscall returns EINVAL.
> > 
> > <man-page-excerpt>
> > ...
> > ...
> >        EINVAL The filesystem does not support the O_DIRECT flag.  See NOTES
> > for more information.
> > 
> >        EINVAL Invalid value in flags.
> > 
> >        EINVAL O_TMPFILE was specified in flags, but neither O_WRONLY nor
> > O_RDWR was specified.
> > ...
> > ...
> > </man-page-excerpt>
> > 
> > So it seems very likely that the EINVAL was due to O_DIRECT.
> > 
> > At this point I wanted to ask you this - does zfs (or the version of it
> > you're using) support O_DIRECT?
> 
> I think the mistake is done by me. I didn't backport
> http://review.gluster.org/14215 to 3.7 branch.
> 
> > 
> > -Krutika
Oops sorry, I think your question is still valid. i.e. open with O_DIRECT shouldn't have failed!!

Comment 14 Pranith Kumar K 2016-07-29 17:33:57 UTC
So basically on zfs we shouldn't have the option remote-dio off. For this to filter reads/writes we should get http://review.gluster.org/14215 in the next 3.7.x release

Comment 15 David 2016-07-29 17:38:05 UTC
With remote-dio on the intial 64M file is written, but the files in .shard fail.

Comment 16 Pranith Kumar K 2016-07-29 17:53:23 UTC
(In reply to David from comment #15)
> With remote-dio on the intial 64M file is written, but the files in .shard
> fail.

Yes, that is because 3.7.13 is not filtering O_DIRECT in read/write of shards. Once the patch I mentioned above is merged, it will all work fine. But you must set remote-dio on.

Comment 17 David 2016-07-29 18:04:00 UTC
Ahh I see what you mean now, apologies.  Usually I have it on.  I may have had it off in the logs of one of my tests submitted from a request in one of the mailing lists.

Comment 18 Vijay Bellur 2016-07-30 01:37:57 UTC
COMMIT: http://review.gluster.org/15050 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) 
------
commit 3492f539a21223798dcadbb92e24cb7eb6cbf154
Author: Pranith Kumar K <pkarampu>
Date:   Thu May 5 07:59:03 2016 +0530

    protocol/client: Filter o-direct in readv/writev
    
     >Change-Id: I519c666b3a7c0db46d47e08a6a7e2dbecc05edf2
     >BUG: 1322214
     >Signed-off-by: Pranith Kumar K <pkarampu>
     >Reviewed-on: http://review.gluster.org/14215
     >Smoke: Gluster Build System <jenkins.com>
     >NetBSD-regression: NetBSD Build System <jenkins.org>
     >CentOS-regression: Gluster Build System <jenkins.com>
     >Reviewed-by: Krutika Dhananjay <kdhananj>
     >(cherry picked from commit 74837896c38bafdd862f164d147b75fcbb619e8f)
    
    BUG: 1360785
    Pranith Kumar K <pkarampu>
    
    Change-Id: Ib4013b10598b0b988b9f9f163296b6afa425f8fd
    Reviewed-on: http://review.gluster.org/15050
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 19 Krutika Dhananjay 2016-07-30 03:54:23 UTC
I guess one way to check whether zfs supports odirect or not is by running the same test you ran on glusterfs again, only this time use zfs directly to store the vm (keep the cache=none setting as it is). If open() fails with EINVAL, then very likely the issue is ZFS support for O_DIRECT (rather the lack of it).

Comment 20 Vijay Bellur 2016-07-30 12:29:43 UTC
COMMIT: http://review.gluster.org/15041 committed in release-3.7 by Atin Mukherjee (amukherj) 
------
commit 72db4ac5701185fc3115f115f18fb2250f3050f4
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Jul 28 22:37:38 2016 +0530

    storage/posix: Look for file in "unlink" dir IFF open on real-path fails with ENOENT
    
            Backport of: http://review.gluster.org/#/c/15039/
    
    PROBLEM:
    In some of our users' setups, open() on the anon fd failed for
    a reason other than ENOENT. But this error code is getting masked
    by a subsequent open() under posix's hidden "unlink" directory, which
    will fail with ENOENT because the gfid handle still exists under .glusterfs.
    And the log message following the two open()s ends up logging ENOENT,
    causing much confusion.
    
    FIX:
    Look for the presence of the file under "unlink" ONLY if the open()
    on the real_path failed with ENOENT.
    
    Change-Id: Id68bbe98740eea9889b17f8ea3126ed45970d26f
    BUG: 1360785
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/15041
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 21 Kaushal 2016-08-02 07:25:24 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.14, please open a new bug report.

glusterfs-3.7.14 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050319.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.