Description of problem: ------------------------ Had a 1*(4+2) volume.Added bricks,scaled it till 3*(4+2).Ran rebalance each time. In the meantime,I/O errored out on 2 of my clients : dd: error writing ‘stress3’: Input/output error 8399+0 records in 8398+0 records out Untaring the tarball failed as well. Details about sos,the exact workload as well as error messages from logs in comments Version-Release number of selected component (if applicable): ------------------------------------------------------------- 3.7.9-8 How reproducible: ----------------- Reporting the first occurrence. Steps to Reproduce: ------------------- 1. Create an EC volume.Mount it on multiple clients via gNFS.Add bricks,rebalance. 2. Run all kinds of I/O from various mount points 3. Check for errors in logs/application side. Actual results: --------------- I/O errors out. Logs flooded with error messages. Expected results: ----------------- I/Os on the application side should not be affected. Additional info: ---------------- [root@gqas013 glusterfs]# gluster v info Volume Name: drogon Type: Distributed-Disperse Volume ID: 6d49ee45-1048-4325-96fb-c14ac5e278e8 Status: Started Number of Bricks: 3 x (4 + 2) = 18 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickA Brick2: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickB Brick3: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickC Brick4: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickD Brick5: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickE Brick6: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickF Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickG Brick8: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickH Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickI Brick10: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickJ Brick11: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickK Brick12: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickL Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickM Brick14: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickN Brick15: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickO Brick16: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickP Brick17: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickQ Brick18: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brickR Options Reconfigured: performance.readdir-ahead: on [root@gqas013 glusterfs]# [root@gqas013 glusterfs]# gluster v status Status of volume: drogon Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickA 49158 0 Y 6404 Brick gqas011.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickB 49158 0 Y 5683 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickC 49158 0 Y 5662 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickD 49158 0 Y 5655 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickE 49159 0 Y 6423 Brick gqas011.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickF 49159 0 Y 5702 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickG 49160 0 Y 6683 Brick gqas011.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickH 49160 0 Y 5898 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickI 49159 0 Y 5862 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickJ 49159 0 Y 5846 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickK 49161 0 Y 6702 Brick gqas011.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickL 49161 0 Y 5917 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickM 49162 0 Y 6875 Brick gqas011.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickN 49162 0 Y 6033 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickO 49160 0 Y 5985 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickP 49160 0 Y 5960 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickQ 49163 0 Y 6894 Brick gqas011.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brickR 49163 0 Y 6052 NFS Server on localhost 2049 0 Y 6914 Self-heal Daemon on localhost N/A N/A Y 6922 NFS Server on gqas011.sbu.lab.eng.bos.redha t.com 2049 0 Y 6072 Self-heal Daemon on gqas011.sbu.lab.eng.bos .redhat.com N/A N/A Y 6080 NFS Server on gqas006.sbu.lab.eng.bos.redha t.com 2049 0 Y 5980 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 5988 NFS Server on gqas005.sbu.lab.eng.bos.redha t.com 2049 0 Y 6005 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 6013 Task Status of Volume drogon ------------------------------------------------------------------------------ Task : Rebalance ID : 7b9744ff-5b16-4b38-8186-d44ecc07b0bf Status : completed [root@gqas013 glusterfs]#
Tested with pure I/O without any add-brick. All my I/O(tar/dd) had a 0 Error Status.
"Assertion Failed" failure is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1343695.
Upstream patch http://review.gluster.org/14669 posted for review.
This is already fixed as part of rebase: commit c579303bfc4704187b1a41f658b8b3dc75b55c56 Author: Pranith Kumar K <pkarampu> Date: Tue Jun 7 21:27:10 2016 +0530 storage/posix: Give correct errno for anon-fd operations >Change-Id: Ia9e61d3baa6881eb7dc03dd8ddb6bfdde5a01958 >BUG: 1343906 >Signed-off-by: Pranith Kumar K <pkarampu> >Reviewed-on: http://review.gluster.org/14669 >Smoke: Gluster Build System <jenkins.org> >NetBSD-regression: NetBSD Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.org> >Reviewed-by: Raghavendra G <rgowdapp> >(cherry picked from commit d5088c056d5aee1bda2997ad5835379465fed3a1) Change-Id: I8f4c26a2314766579aa03873deb8033c75944c0d BUG: 1360138 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/15008 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Krutika Dhananjay <kdhananj>
I did not hit an EIO on 3.8.4-11 on 2 tries as long as I keep rm out of the picture,on the exact same workload I tried with 3.1.3. EIO on rm -rf on EC is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1395161 Ill be doing some Scale tests on 3.2,will reopen if it pops up again. Moving this to Verified for now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html