Bug 1343178 - [Stress/Scale] : I/O errors out from gNFS mount points during high load on an erasure coded volume,Logs flooded with Error messages.
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
Target Milestone: ---
: RHGS 3.2.0
Assignee: Pranith Kumar K
QA Contact: Ambarish
Blocks: 1343906 1351522 1360138 1360140
Reported: 2016-06-06 17:24 UTC by Ambarish
Modified: 2017-03-28 06:57 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-1
Doc Type: If docs needed, set a value
1343906
Last Closed: 2017-03-23 05:34:42 UTC

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Ambarish 2016-06-06 17:24:16 UTC
Description of problem:

Had a 1*(4+2) volume.Added bricks,scaled it till 3*(4+2).Ran rebalance each time.
In the meantime,I/O errored out on 2 of my clients :

dd: error writing ‘stress3’: Input/output error
8399+0 records in
8398+0 records out

Untaring the tarball failed as well.

Details about sos,the exact workload as well as error messages from logs in comments

Version-Release number of selected component (if applicable):


How reproducible:

Reporting the first occurrence.

Steps to Reproduce:

1. Create an EC volume.Mount it on multiple clients via gNFS.Add bricks,rebalance.

2. Run all kinds of I/O from various mount points

3. Check for errors in logs/application side.

Actual results:

I/O errors out.
Logs flooded with error messages.

Expected results:

I/Os on the application side should not be affected.

Additional info:

[root@gqas013 glusterfs]# gluster v info
Volume Name: drogon
Type: Distributed-Disperse
Volume ID: 6d49ee45-1048-4325-96fb-c14ac5e278e8
Status: Started
Number of Bricks: 3 x (4 + 2) = 18
Transport-type: tcp
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickA
Brick2: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickB
Brick3: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickC
Brick4: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickD
Brick5: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickE
Brick6: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickF
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickG
Brick8: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickH
Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickI
Brick10: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickJ
Brick11: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickK
Brick12: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickL
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickM
Brick14: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickN
Brick15: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickO
Brick16: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickP
Brick17: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickQ
Brick18: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickR
Options Reconfigured:
performance.readdir-ahead: on
[root@gqas013 glusterfs]# 

[root@gqas013 glusterfs]# gluster v status
Status of volume: drogon
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickA                        49158     0          Y       6404 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickB                        49158     0          Y       5683 
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickC                        49158     0          Y       5662 
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickD                        49158     0          Y       5655 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickE                        49159     0          Y       6423 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickF                        49159     0          Y       5702 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickG                        49160     0          Y       6683 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickH                        49160     0          Y       5898 
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickI                        49159     0          Y       5862 
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickJ                        49159     0          Y       5846 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickK                        49161     0          Y       6702 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickL                        49161     0          Y       5917 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickM                        49162     0          Y       6875 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickN                        49162     0          Y       6033 
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickO                        49160     0          Y       5985 
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickP                        49160     0          Y       5960 
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickQ                        49163     0          Y       6894 
Brick gqas011.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickR                        49163     0          Y       6052 
NFS Server on localhost                     2049      0          Y       6914 
Self-heal Daemon on localhost               N/A       N/A        Y       6922 
NFS Server on gqas011.sbu.lab.eng.bos.redha
t.com                                       2049      0          Y       6072 
Self-heal Daemon on gqas011.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       6080 
NFS Server on gqas006.sbu.lab.eng.bos.redha
t.com                                       2049      0          Y       5980 
Self-heal Daemon on gqas006.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       5988 
NFS Server on gqas005.sbu.lab.eng.bos.redha
t.com                                       2049      0          Y       6005 
Self-heal Daemon on gqas005.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       6013 
Task Status of Volume drogon
Task                 : Rebalance           
ID                   : 7b9744ff-5b16-4b38-8186-d44ecc07b0bf
Status               : completed           
[root@gqas013 glusterfs]#

Comment 6 Ambarish 2016-06-07 10:41:48 UTC
Tested with pure I/O without any add-brick.
All my I/O(tar/dd) had a 0 Error Status.

Comment 11 Ambarish 2016-06-07 17:34:24 UTC
"Assertion Failed"  failure is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1343695.

Comment 12 Atin Mukherjee 2016-06-08 10:57:06 UTC
Upstream patch http://review.gluster.org/14669 posted for review.

Comment 22 Pranith Kumar K 2016-10-25 10:46:23 UTC
This is already fixed as part of rebase:

commit c579303bfc4704187b1a41f658b8b3dc75b55c56
Author: Pranith Kumar K <pkarampu@redhat.com>
Date:   Tue Jun 7 21:27:10 2016 +0530

    storage/posix: Give correct errno for anon-fd operations
     >Change-Id: Ia9e61d3baa6881eb7dc03dd8ddb6bfdde5a01958
     >BUG: 1343906
     >Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
     >Reviewed-on: http://review.gluster.org/14669
     >Smoke: Gluster Build System <jenkins@build.gluster.org>
     >NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
     >CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
     >Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
     >(cherry picked from commit d5088c056d5aee1bda2997ad5835379465fed3a1)
    Change-Id: I8f4c26a2314766579aa03873deb8033c75944c0d
    BUG: 1360138
    Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
    Reviewed-on: http://review.gluster.org/15008
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Reviewed-by: Krutika Dhananjay <kdhananj@redhat.com>

Comment 24 Ambarish 2017-01-12 10:43:06 UTC
I did not hit an EIO on 3.8.4-11 on 2 tries as long as I keep rm out of the picture,on the exact same workload I tried with 3.1.3.

EIO on rm -rf on EC is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1395161

Ill be doing some Scale tests on 3.2,will reopen if it pops up again.

Moving this to Verified for now.

Comment 26 errata-xmlrpc 2017-03-23 05:34:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


