Bug 1343178

Summary: [Stress/Scale] : I/O errors out from gNFS mount points during high load on an erasure coded volume,Logs flooded with Error messages.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: disperseAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, asoman, asrivast, nbalacha, rcyriac, rhinduja, rhs-bugs, sankarshan
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1343906 (view as bug list) Environment:
Last Closed: 2017-03-23 05:34:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1343906, 1351522, 1360138, 1360140    

Description Ambarish 2016-06-06 17:24:16 UTC
Description of problem:
------------------------

Had a 1*(4+2) volume.Added bricks,scaled it till 3*(4+2).Ran rebalance each time.
In the meantime,I/O errored out on 2 of my clients :

dd: error writing ‘stress3’: Input/output error
8399+0 records in
8398+0 records out

Untaring the tarball failed as well.

Details about sos,the exact workload as well as error messages from logs in comments


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.7.9-8

How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Create an EC volume.Mount it on multiple clients via gNFS.Add bricks,rebalance.

2. Run all kinds of I/O from various mount points

3. Check for errors in logs/application side.

Actual results:
---------------

I/O errors out.
Logs flooded with error messages.

Expected results:
-----------------

I/Os on the application side should not be affected.

Additional info:
----------------

[root@gqas013 glusterfs]# gluster v info
 
Volume Name: drogon
Type: Distributed-Disperse
Volume ID: 6d49ee45-1048-4325-96fb-c14ac5e278e8
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickA
Brick2: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickB
Brick3: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickC
Brick4: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickD
Brick5: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickE
Brick6: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickF
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickG
Brick8: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickH
Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickI
Brick10: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickJ
Brick11: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickK
Brick12: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickL
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickM
Brick14: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickN
Brick15: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickO
Brick16: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickP
Brick17: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickQ
Brick18: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickR
Options Reconfigured:
performance.readdir-ahead: on
[root@gqas013 glusterfs]# 


[root@gqas013 glusterfs]# gluster v status
Status of volume: drogon
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickA                        49158     0          Y       6404 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickB                        49158     0          Y       5683 
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickC                        49158     0          Y       5662 
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickD                        49158     0          Y       5655 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickE                        49159     0          Y       6423 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickF                        49159     0          Y       5702 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickG                        49160     0          Y       6683 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickH                        49160     0          Y       5898 
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickI                        49159     0          Y       5862 
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickJ                        49159     0          Y       5846 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickK                        49161     0          Y       6702 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickL                        49161     0          Y       5917 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickM                        49162     0          Y       6875 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickN                        49162     0          Y       6033 
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickO                        49160     0          Y       5985 
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickP                        49160     0          Y       5960 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickQ                        49163     0          Y       6894 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickR                        49163     0          Y       6052 
NFS Server on localhost                     2049      0          Y       6914 
Self-heal Daemon on localhost               N/A       N/A        Y       6922 
NFS Server on gqas011.sbu.lab.eng.bos.redha
t.com                                       2049      0          Y       6072 
Self-heal Daemon on gqas011.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       6080 
NFS Server on gqas006.sbu.lab.eng.bos.redha
t.com                                       2049      0          Y       5980 
Self-heal Daemon on gqas006.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       5988 
NFS Server on gqas005.sbu.lab.eng.bos.redha
t.com                                       2049      0          Y       6005 
Self-heal Daemon on gqas005.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       6013 
 
Task Status of Volume drogon
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 7b9744ff-5b16-4b38-8186-d44ecc07b0bf
Status               : completed           
 
[root@gqas013 glusterfs]#

Comment 6 Ambarish 2016-06-07 10:41:48 UTC
Tested with pure I/O without any add-brick.
All my I/O(tar/dd) had a 0 Error Status.

Comment 11 Ambarish 2016-06-07 17:34:24 UTC
"Assertion Failed"  failure is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1343695.

Comment 12 Atin Mukherjee 2016-06-08 10:57:06 UTC
Upstream patch http://review.gluster.org/14669 posted for review.

Comment 22 Pranith Kumar K 2016-10-25 10:46:23 UTC
This is already fixed as part of rebase:

commit c579303bfc4704187b1a41f658b8b3dc75b55c56
Author: Pranith Kumar K <pkarampu>
Date:   Tue Jun 7 21:27:10 2016 +0530

    storage/posix: Give correct errno for anon-fd operations
    
     >Change-Id: Ia9e61d3baa6881eb7dc03dd8ddb6bfdde5a01958
     >BUG: 1343906
     >Signed-off-by: Pranith Kumar K <pkarampu>
     >Reviewed-on: http://review.gluster.org/14669
     >Smoke: Gluster Build System <jenkins.org>
     >NetBSD-regression: NetBSD Build System <jenkins.org>
     >CentOS-regression: Gluster Build System <jenkins.org>
     >Reviewed-by: Raghavendra G <rgowdapp>
     >(cherry picked from commit d5088c056d5aee1bda2997ad5835379465fed3a1)
    
    Change-Id: I8f4c26a2314766579aa03873deb8033c75944c0d
    BUG: 1360138
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/15008
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Krutika Dhananjay <kdhananj>

Comment 24 Ambarish 2017-01-12 10:43:06 UTC
I did not hit an EIO on 3.8.4-11 on 2 tries as long as I keep rm out of the picture,on the exact same workload I tried with 3.1.3.

EIO on rm -rf on EC is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1395161

Ill be doing some Scale tests on 3.2,will reopen if it pops up again.

Moving this to Verified for now.

Comment 26 errata-xmlrpc 2017-03-23 05:34:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html