1412566 – [Scale] : I/O errors out with ENOTCONN during rebalance

Bug 1412566 - [Scale] : I/O errors out with ENOTCONN during rebalance

Summary: [Scale] : I/O errors out with ENOTCONN during rebalance

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-12 10:15 UTC by Ambarish
Modified:	2017-03-28 06:51 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-31 06:25:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ambarish 2017-01-12 10:15:23 UTC

Description of problem:
----------------------

The intent was to scale from 1*2 to 6*2 and then back to 1*2 amidst continuous I/O from FUSE mounts.

While add-brick from 3*2 to 4*2,I saw that Bonnie++ errored out on one of my clients :

<snip>

Changing to the specified mountpoint
/gluster-mount/d2/run3638
executing bonnie
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...Bonnie: drastic I/O error (re-write read): Transport endpoint is not connected
Can't read a full block, only got 8550 bytes.

</snip>

I was running Bonnie,finds,dds and kernel untars.

sosreports and statedump location will be shared in comments.


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-3.8.4-11.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurrence.


Actual results:
---------------

Bonnie errors out on application side.

Expected results:
-----------------

No EIO.

Additional info:
----------------

Client and Server OS :RHEL 7.3

*Vol Config* :

[root@gqas009 ~]# gluster v status
Status of volume: butcher
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks1/A                                    49152     0          Y       23269
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks1/A                                    49152     0          Y       23170
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks2/A                                    49153     0          Y       23466
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks2/A                                    49153     0          Y       23380
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks3/A                                    49154     0          Y       24074
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks3/A                                    49154     0          Y       24472
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks4/A                                    49155     0          Y       24872
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks4/A                                    49155     0          Y       25346
Self-heal Daemon on localhost               N/A       N/A        Y       27002
Quota Daemon on localhost                   N/A       N/A        Y       27010
Self-heal Daemon on gqas015.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       25917
Quota Daemon on gqas015.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       25925
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       25484
Quota Daemon on gqas014.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       25492
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       26554
Quota Daemon on gqas010.sbu.lab.eng.bos.red
hat.com                                     N/A       N/A        Y       26562
 
Task Status of Volume butcher
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 86df50c3-00fc-409c-aac8-02c64dd5faa5
Status               : completed           
 
[root@gqas009 ~]#

Comment 2 Ambarish 2017-01-12 10:22:50 UTC

From client mount logs :

[2017-01-12 06:30:50.721491] W [MSGID: 108035] [afr-transaction.c:2221:afr_changelog_fsync_cbk] 6-butcher-replicate-3: fsync(317da8ef-9dc3-41ea-824a-88f9af31066a) failed on subvolume butcher-client-7. Transaction was WRITE [Transport endpoint is not connected]

Comment 5 Ambarish 2017-01-12 10:37:41 UTC

**************
EXACT WORKLOAD
**************

Client 1 : dd in loop 

Client 2 : Bonnie++

Client 3 : tarball untar

Client 4:  finds and fileop

Comment 15 Ambarish 2017-01-31 06:25:56 UTC

I scaled out 1*2 to 6*2 and then back to 1*2 on 3.8.4-13 on FUSE.

It worked seamlessly

Closing it as WFM post disussion with Atin/Ravi,for lack of a reproducer from QE.

Note You need to log in before you can comment on or make changes to this bug.