Bug 1412566

Summary: [Scale] : I/O errors out with ENOTCONN during rebalance
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED WORKSFORME QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, bturner, pkarampu, ravishankar, rcyriac, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-31 06:25:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ambarish 2017-01-12 10:15:23 UTC
Description of problem:
----------------------

The intent was to scale from 1*2 to 6*2 and then back to 1*2 amidst continuous I/O from FUSE mounts.

While add-brick from 3*2 to 4*2,I saw that Bonnie++ errored out on one of my clients :

<snip>

Changing to the specified mountpoint
/gluster-mount/d2/run3638
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...Bonnie: drastic I/O error (re-write read): Transport endpoint is not connected
Can't read a full block, only got 8550 bytes.

</snip>

I was running Bonnie,finds,dds and kernel untars.

sosreports and statedump location will be shared in comments.


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-3.8.4-11.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurrence.


Actual results:
---------------

Bonnie errors out on application side.

Expected results:
-----------------

No EIO.

Additional info:
----------------

Client and Server OS :RHEL 7.3

*Vol Config* :

[root@gqas009 ~]# gluster v status
Status of volume: butcher
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks1/A                                    49152     0          Y       23269
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks1/A                                    49152     0          Y       23170
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks2/A                                    49153     0          Y       23466
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks2/A                                    49153     0          Y       23380
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks3/A                                    49154     0          Y       24074
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks3/A                                    49154     0          Y       24472
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks4/A                                    49155     0          Y       24872
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks4/A                                    49155     0          Y       25346
Self-heal Daemon on localhost               N/A       N/A        Y       27002
Quota Daemon on localhost                   N/A       N/A        Y       27010
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       25917
Quota Daemon on gqas015.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       25925
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       25484
Quota Daemon on gqas014.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       25492
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       26554
Quota Daemon on gqas010.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       26562
 
Task Status of Volume butcher
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 86df50c3-00fc-409c-aac8-02c64dd5faa5
Status               : completed           
 
[root@gqas009 ~]#

Comment 2 Ambarish 2017-01-12 10:22:50 UTC
From client mount logs :

[2017-01-12 06:30:50.721491] W [MSGID: 108035] [afr-transaction.c:2221:afr_changelog_fsync_cbk] 6-butcher-replicate-3: fsync(317da8ef-9dc3-41ea-824a-88f9af31066a) failed on subvolume butcher-client-7. Transaction was WRITE [Transport endpoint is not connected]

Comment 5 Ambarish 2017-01-12 10:37:41 UTC
**************
EXACT WORKLOAD
**************

Client 1 : dd in loop 

Client 2 : Bonnie++

Client 3 : tarball untar

Client 4:  finds and fileop

Comment 15 Ambarish 2017-01-31 06:25:56 UTC
I scaled out 1*2 to 6*2 and then back to 1*2 on 3.8.4-13 on FUSE.

It worked seamlessly

Closing it as WFM post disussion with Atin/Ravi,for lack of a reproducer from QE.