Bug 1343906

Summary: [Stress/Scale] : I/O errors out from gNFS mount points during high load on an erasure coded volume,Logs flooded with Error messages.
Product: [Community] GlusterFS Reporter: Pranith Kumar K <pkarampu>
Component: posixAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: asoman, bugs, jahernan, nbalacha, pkarampu, rcyriac, rhs-bugs
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.9.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1343178
: 1360138 (view as bug list) Environment:
Last Closed: 2017-03-27 18:24:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1343178    
Bug Blocks: 1343695, 1360138, 1360140    

Comment 1 Xavi Hernandez 2016-06-08 10:30:19 UTC
Could you please add a description of the environment in which this happens, details of the procedure and the error messages that appear in the logs ?

Comment 2 Vijay Bellur 2016-06-08 10:53:26 UTC
REVIEW: http://review.gluster.org/14669 (storage/posix: Give correct errno for anon-fd operations) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 3 Pranith Kumar K 2016-06-08 10:56:19 UTC
Description of problem:
------------------------

Had a 1*(4+2) volume.Added bricks,scaled it till 3*(4+2).Ran rebalance each time.
In the meantime,I/O errored out on 2 of my clients :

dd: error writing ‘stress3’: Input/output error
8399+0 records in
8398+0 records out

Untaring the tarball failed as well.

Details about sos,the exact workload as well as error messages from logs in comments

How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Create an EC volume.Mount it on multiple clients via gNFS.Add bricks,rebalance.

2. Run all kinds of I/O from various mount points

3. Check for errors in logs/application side.

Actual results:
---------------

I/O errors out.
Logs flooded with error messages.

Expected results:
-----------------

I/Os on the application side should not be affected.

Additional info:
----------------

[root@gqas013 glusterfs]# gluster v info
 
Volume Name: drogon
Type: Distributed-Disperse
Volume ID: 6d49ee45-1048-4325-96fb-c14ac5e278e8
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickA
Brick2: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickB
Brick3: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickC
Brick4: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickD
Brick5: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickE
Brick6: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickF
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickG
Brick8: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickH
Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickI
Brick10: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickJ
Brick11: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickK
Brick12: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickL
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickM
Brick14: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickN
Brick15: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickO
Brick16: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickP
Brick17: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickQ
Brick18: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickR
Options Reconfigured:
performance.readdir-ahead: on
[root@gqas013 glusterfs]# 
A ) From the rebalance log :

I see a lot of assertion failed messages here :

[2016-06-06 15:10:53.999653] E [ec-inode-write.c:395:ec_manager_setattr] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(ec_resume+0x91) [0x7fd263b5e621] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(__ec_manager+0x57) [0x7fd263b5e807] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(ec_manager_setattr+0x2c6) [0x7fd263b7be76] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->locks[0].lock->loc.inode, &cbk->iatt[0].ia_size)
[2016-06-06 15:10:54.003509] E [ec-inode-write.c:395:ec_manager_setattr] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(ec_resume+0x91) [0x7fd263b5e621] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(__ec_manager+0x57) [0x7fd263b5e807] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(ec_manager_setattr+0x2c6) [0x7fd263b7be76] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->locks[0].lock->loc.inode, &cbk->iatt[0].ia_size)
[2016-06-06 15:10:54.012540] E [ec-inode-write.c:395:ec_manager_setattr] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(ec_resume+0x91) [0x7fd263b5e621] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(__ec_manager+0x57) [0x7fd263b5e807] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so(ec_manager_setattr+0x2c6) [0x7fd263b7be76] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->locks[0].lock->loc.inode, &cbk->iatt[0].ia_size)
[2016-06-06 15:13:15.333800] E [MSGID: 109023] [dht-rebalance.c:2330:gf_defrag_get_entry] 0-drogon-dht: Migrate file failed:/d4/linux-4.6.1/Documentation/arm/Porting lookup failed


B ) The brick logs are literally flooded with bad file descriptor errors :

bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275139] E [MSGID: 115073] [server-rpc-fops.c:1907:server_fxattrop_cbk] 0-drogon-server: 16513: FXATTROP -2 (c812a0d6-a85d-41f6-a7e6-51630285ec60) ==> (File descriptor in bad state) [File descriptor in bad state]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275220] E [MSGID: 113040] [posix-helpers.c:1686:__posix_fd_ctx_get] 0-drogon-posix: Failed to get anonymous real_path: /bricks/testvol_brickA/.glusterfs/14/c7/14c711cb-11c7-4c8e-8893-de71849e9eb9 _fd = -1 [No such file or directory]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275290] E [MSGID: 115073] [server-rpc-fops.c:1907:server_fxattrop_cbk] 0-drogon-server: 16514: FXATTROP -2 (14c711cb-11c7-4c8e-8893-de71849e9eb9) ==> (File descriptor in bad state) [File descriptor in bad state]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275393] E [MSGID: 113040] [posix-helpers.c:1686:__posix_fd_ctx_get] 0-drogon-posix: Failed to get anonymous real_path: /bricks/testvol_brickA/.glusterfs/80/3b/803b05cb-4e6f-457c-88b1-2a608530757c _fd = -1 [No such file or directory]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275468] E [MSGID: 115073] [server-rpc-fops.c:1907:server_fxattrop_cbk] 0-drogon-server: 16515: FXATTROP -2 (803b05cb-4e6f-457c-88b1-2a608530757c) ==> (File descriptor in bad state) [File descriptor in bad state]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275652] E [MSGID: 113040] [posix-helpers.c:1686:__posix_fd_ctx_get] 0-drogon-posix: Failed to get anonymous real_path: /bricks/testvol_brickA/.glusterfs/8b/fe/8bfe4154-b137-4a84-a265-b541574adca9 _fd = -1 [No such file or directory]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275714] E [MSGID: 115073] [server-rpc-fops.c:1907:server_fxattrop_cbk] 0-drogon-server: 16516: FXATTROP -2 (8bfe4154-b137-4a84-a265-b541574adca9) ==> (File descriptor in bad state) [File descriptor in bad state]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275786] E [MSGID: 113040] [posix-helpers.c:1686:__posix_fd_ctx_get] 0-drogon-posix: Failed to get anonymous real_path: /bricks/testvol_brickA/.glusterfs/49/c3/49c3cbec-3b03-4ec2-968a-89e8bda401c6 _fd = -1 [No such file or directory]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275851] E [MSGID: 115073] [server-rpc-fops.c:1907:server_fxattrop_cbk] 0-drogon-server: 16518: FXATTROP -2 (49c3cbec-3b03-4ec2-968a-89e8bda401c6) ==> (File descriptor in bad state) [File descriptor in bad state]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.275979] E [MSGID: 113040] [posix-helpers.c:1686:__posix_fd_ctx_get] 0-drogon-posix: Failed to get anonymous real_path: /bricks/testvol_brickA/.glusterfs/f8/6f/f86fe79c-7ad0-4d9c-8e5d-26ec63f14696 _fd = -1 [No such file or directory]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.276039] E [MSGID: 115073] [server-rpc-fops.c:1907:server_fxattrop_cbk] 0-drogon-server: 16519: FXATTROP -2 (f86fe79c-7ad0-4d9c-8e5d-26ec63f14696) ==> (File descriptor in bad state) [File descriptor in bad state]
bricks/bricks-testvol_brickA.log:[2016-06-06 15:09:58.276223] E [MSGID: 113040] [posix-helpers.c:1686:__posix_fd_ctx_get] 0-drogon-posix: Failed to get anonymous real_path: /bricks/testvol_brickA/.glusterfs/c4/8a/c48aff49-de3a-46e8-9445-8064c4ca19c5 _fd = -1 [No such file or directory]

Comment 4 Pranith Kumar K 2016-06-08 11:00:21 UTC
Xavi,
        I only fixed the EBADFD part of the bug. We still need to find why Assertions are coming as well as the reason for I/O failure.

Pranith

Comment 5 Xavi Hernandez 2016-06-08 11:12:16 UTC
That assertion should be solved by http://review.gluster.org/14112. It was caused by the execution of a fop that required an inodelk while the inodelk was not really taken.

Comment 6 Pranith Kumar K 2016-06-08 14:12:41 UTC
That build had this patch :-/

Comment 7 Vijay Bellur 2016-06-24 11:42:31 UTC
REVIEW: http://review.gluster.org/14669 (storage/posix: Give correct errno for anon-fd operations) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 8 Vijay Bellur 2016-06-27 06:18:16 UTC
COMMIT: http://review.gluster.org/14669 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit d5088c056d5aee1bda2997ad5835379465fed3a1
Author: Pranith Kumar K <pkarampu>
Date:   Tue Jun 7 21:27:10 2016 +0530

    storage/posix: Give correct errno for anon-fd operations
    
    Change-Id: Ia9e61d3baa6881eb7dc03dd8ddb6bfdde5a01958
    BUG: 1343906
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/14669
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 9 Shyamsundar 2017-03-27 18:24:03 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/