1361300 – Direct io to sharded files fails when on zfs backend

Bug 1361300 - Direct io to sharded files fails when on zfs backend

Summary: Direct io to sharded files fails when on zfs backend

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	posix
Sub Component:
Version:	mainline
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Krutika Dhananjay
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1360785
Blocks:	1361449
TreeView+	depends on / blocked

Reported:	2016-07-28 17:35 UTC by Krutika Dhananjay
Modified:	2017-03-27 18:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.9.0
Clone Of:	1360785
Clones:	1361449 (view as bug list)
Environment:
Last Closed:	2017-03-27 18:26:17 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Krutika Dhananjay 2016-07-28 17:35:08 UTC

+++ This bug was initially created as a clone of Bug #1360785 +++

Beginning with 3.7.12 and 3.7.13 when using zfs backed bricks connecting to sharded files fails with direct io.

How reproducible: Always


Steps to Reproduce:
1. zfs backed bricks default settings except xattr=sa
2. gluster fs 3.7.12+ sharding enabled
3. dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/192.168.71.11\:_glustershard/81e19cd3-ae45-449c-b716-ec3e4ad4c2f0/images/test oflag=direct count=100 bs=1M

Actual results: dd: error writing ‘/rhev/data-center/mnt/glusterSD/192.168.71.11:_glustershard/81e19cd3-ae45-449c-b716-ec3e4ad4c2f0/images/test’: Operation not permitted

file test is created with file size defined by shard size.  sharded file created in .shard are 0


Expected results: 
100+0 records in
100+0 records out
104857600 bytes etc.....


Additional info:
Using proxmox users have been able to work around by changing disk caching from none to writethrough/back.  Not sure this would help with oVirt as the pything script that checks storage with dd and oflag=direct also fails

attaching client and brick log from test

--- Additional comment from David on 2016-07-27 09:36:52 EDT ---

in oVirt mailing list was asked to test these settings

i. Set network.remote-dio to off
        # gluster volume set <VOL> network.remote-dio off

ii. Set performance.strict-o-direct to on
        # gluster volume set <VOL> performance.strict-o-direct on

results:

dd if=/dev/zero of=/rhev/data-center/mnt/glusterSD/192.168.71.10\:_glustershard/5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test oflag=direct count=100 bs=1M
dd: error writing ‘/rhev/data-center/mnt/glusterSD/192.168.71.10:_glustershard/5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test’: Invalid argument
dd: closing output file ‘/rhev/data-center/mnt/glusterSD/192.168.71.10:_glustershard/5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test’: Invalid argument


[2016-07-25 18:20:19.393121] E [MSGID: 113039] [posix.c:2939:posix_open] 0-glustershard-posix: open on /gluster2/brick1/1/.glusterfs/02/f4/02f4783b-2799-46d9-b787-53e4ccd9a052, flags: 16385 [Invalid argument]
[2016-07-25 18:20:19.393204] E [MSGID: 115070] [server-rpc-fops.c:1568:server_open_cbk] 0-glustershard-server: 120: OPEN /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) ==> (Invalid argument) [Invalid argument]


and /var/log/glusterfs/rhev-data-center-mnt-glusterSD-192.168.71.10\:_glustershard.log
[2016-07-25 18:20:19.393275] E [MSGID: 114031] [client-rpc-fops.c:466:client3_3_open_cbk] 0-glustershard-client-0: remote operation failed. Path: /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) [Invalid argument]
[2016-07-25 18:20:19.393270] E [MSGID: 114031] [client-rpc-fops.c:466:client3_3_open_cbk] 0-glustershard-client-1: remote operation failed. Path: /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) [Invalid argument]
[2016-07-25 18:20:19.393317] E [MSGID: 114031] [client-rpc-fops.c:466:client3_3_open_cbk] 0-glustershard-client-2: remote operation failed. Path: /5b8a4477-4d87-43a1-aa52-b664b1bd9e08/images/test (02f4783b-2799-46d9-b787-53e4ccd9a052) [Invalid argument]
[2016-07-25 18:20:19.393357] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 117: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393389] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 118: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393611] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 119: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393708] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 120: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393771] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 121: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393840] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 122: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393914] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 123: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.393982] W [fuse-bridge.c:2311:fuse_writev_cbk] 0-glusterfs-fuse: 124: WRITE => -1 gfid=02f4783b-2799-46d9-b787-53e4ccd9a052 fd=0x7f5fec0ba08c (Invalid argument)
[2016-07-25 18:20:19.394045] W [fuse-bridge.c:709:fuse_truncate_cbk] 0-glusterfs-fuse: 125: FTRUNCATE() ERR => -1 (Invalid argument)
[2016-07-25 18:20:19.394338] W [fuse-bridge.c:1290:fuse_err_cbk] 0-glusterfs-fuse: 126: FLUSH() ERR => -1 (Invalid argument)

--- Additional comment from David on 2016-07-27 10:54:22 EDT ---

Also have heard from others with issue that problem exists in 3.8.x as well.  I myself have not tested as my environment is still in 3.7.x

--- Additional comment from David on 2016-07-27 11:44:09 EDT ---

These are full settings I usually apply and run with


features.shard-block-size: 64MB
features.shard: on
performance.readdir-ahead: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
server.allow-insecure: on
cluster.self-heal-window-size: 1024
cluster.background-self-heal-count: 16
performance.strict-write-ordering: off
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off

--- Additional comment from Krutika Dhananjay on 2016-07-28 13:33:25 EDT ---

Hi,

Open() on these affected files seems to be returning ENOENT, however as per the find command output you gave on ovirt-users ML, both the file and its gfid handle seem to be existing in the backend. Then the failure was not due to ENOENT. I looked at the code in posix again and there is evidence to suggest that the actual error code (the real reason for open() failing) is getting masked by stat in .unlink directory:

30         if (fd->inode->ia_type == IA_IFREG) {                                    
 29                 _fd = open (real_path, fd->flags);                               
 28                 if (_fd == -1) {                          
 27                         POSIX_GET_FILE_UNLINK_PATH (priv->base_path,             
 26                                                     fd->inode->gfid,             
 25                                                     unlink_path);                
 24                         _fd = open (unlink_path, fd->flags);                     
 23                 }                                                                
 22                 if (_fd == -1) {                                                 
 21                         op_errno = errno;                                        
 20                         gf_msg (this->name, GF_LOG_ERROR, op_errno,              
 19                                 P_MSG_READ_FAILED,                               
 18                                 "Failed to get anonymous "                       
 17                                 "real_path: %s _fd = %d", real_path, _fd);       
 16                         GF_FREE (pfd);                                           
 15                         pfd = NULL;                                              
 14                         goto out;                                                
 13                 }                                                                
 12         }                         

In your case, on line 29, the open on .glusterfs/de/b6/deb61291-5176-4b81-8315-3f1cf8e3534d failed for a reason other than ENOENT (it can't be ENOENT because we already saw on doing find that the file exists). And then line 27 is executed. If the file exists in its real path, then it must be absent in .unlink directory (because the gfid handle can't be present at both places). So it is the open() on line 24 that is failing with ENOENT and not the open on line 29.

I'll be sending a patch to fix this problem.

Meanwhile, in order to understand why the open on line 29 failed, could you attach all of your bricks to strace, run the test again, wait for it to fail, and then attach both the strace output files and the resultant glusterfs client and brick logs here?

# strace -ff -p <pid-of-the-brick> -o <path-where-you-want-to-capture-the-output>

Comment 1 Vijay Bellur 2016-07-28 17:43:34 UTC

REVIEW: http://review.gluster.org/15039 (storage/posix: Look for file in .unlink IFF open on real-path fails with ENOENT) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 2 Vijay Bellur 2016-07-29 03:16:48 UTC

REVIEW: http://review.gluster.org/15039 (storage/posix: Look for file in "unlink" dir IFF open on real-path fails with ENOENT) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 3 Vijay Bellur 2016-07-29 07:05:04 UTC

COMMIT: http://review.gluster.org/15039 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit b20b3c68d32f6dba3f7e003f3ad9b86220d9fa25
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Jul 28 22:37:38 2016 +0530

    storage/posix: Look for file in "unlink" dir IFF open on real-path fails with ENOENT
    
    PROBLEM:
    In some of our users' setups, open() on the anon fd failed for
    a reason other than ENOENT. But this error code is getting masked
    by a subsequent open() under posix's hidden "unlink" directory, which
    will fail with ENOENT because the gfid handle still exists under .glusterfs.
    And the log message following the two open()s ends up logging ENOENT,
    causing much confusion.
    
    FIX:
    Look for the presence of the file under "unlink" ONLY if the open()
    on the real_path failed with ENOENT.
    
    Change-Id: Ifb674dc7123f79fec9a9ac1ec7bf832f4df17627
    BUG: 1361300
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/15039
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Krutika Dhananjay 2016-11-17 05:18:16 UTC

Patch was merged long time ago (29th July, 2016). Moving the bug to MODIFIED state.

Comment 5 Shyamsundar 2017-03-27 18:26:17 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.