Bug 1640573

Summary:	[geo-rep]: Transport endpoint not connected with arbiter volumes
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rochelle <rallan>
Component:	geo-replication	Assignee:	Shwetha K Acharya <sacharya>
Status:	CLOSED ERRATA	QA Contact:	Leela Venkaiah Gangavarapu <lgangava>
Severity:	low	Docs Contact:
Priority:	low
Version:	rhgs-3.4	CC:	csaba, pprakash, puebele, rhs-bugs, rkothiya, sheggodu, storage-qa-internal, sunkumar
Target Milestone:	---	Keywords:	EasyFix, ZStream
Target Release:	RHGS 3.5.z Batch Update 3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.0-38	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1664335 (view as bug list)		Environment:
Last Closed:	2020-12-17 04:50:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1664335

Description Rochelle 2018-10-18 10:44:52 UTC

Description of problem:
======================
Brick goes down in the middle of this particular test case

On the Master:
--------------
[2018-10-18 09:30:58.93183] I [master(/rhs/brick2/b5):1450:crawl] _GMaster: slave's time        stime=(1539855011, 0)
[2018-10-18 09:31:00.624412] E [repce(/rhs/brick3/b8):209:__call__] RepceClient: call failed    call=17069:139754185205568:1539855057.49        method=entry_ops        error=OSError
[2018-10-18 09:31:00.625464] E [syncdutils(/rhs/brick3/b8):349:log_raise_exception] <top>: Gluster Mount process exited error=ENOTCONN
[2018-10-18 09:31:00.699959] I [syncdutils(/rhs/brick3/b8):295:finalize] <top>: exiting.


brick3/b8 logs report:
----------------------
[2018-10-18 09:31:00.725836] W [socket.c:593:__socket_rwv] 0-master-changelog: readv on /var/run/gluster/.f8271615d91fca5417068.sock failed (No data available)
[2018-10-18 09:31:00.739737] I [MSGID: 115036] [server.c:571:server_rpc_notify] 0-master-server: disconnecting connection from dhcp42-2.lab.eng.blr.redhat.com-17140-2018/10/18-09:30:01:598538-master-client-4-0-0
[2018-10-18 09:31:00.740127] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-master-server: Shutting down connection dhcp42-2.lab.eng.blr.redhat.com-17140-2018/10/18-09:30:01:598538-master-client-4-0-0


On the slave:
-------------
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 639, in entry_ops
    st = lstat(slink)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 577, in lstat
    return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 559, in errno_wrap
    return call(*arg)
OSError: [Errno 107] Transport endpoint is not connected: '.gfid/4b67b1d8-b53b-4962-a4c4-294b3d5e750c'



Version-Release number of selected component (if applicable):
=============================================================
[root@dhcp42-2 bricks]# rpm -qa | grep gluster
glusterfs-server-3.12.2-22.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-22.el7rhgs.x86_64
glusterfs-rdma-3.12.2-22.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.12.2-22.el7rhgs.x86_64
glusterfs-api-3.12.2-22.el7rhgs.x86_64
glusterfs-events-3.12.2-22.el7rhgs.x86_64
glusterfs-libs-3.12.2-22.el7rhgs.x86_64
glusterfs-fuse-3.12.2-22.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-22.el7rhgs.x86_64
python2-gluster-3.12.2-22.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-cli-3.12.2-22.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7.x86_64
[root@dhcp42-2 bricks]# 


How reproducible:
=================
2/2

Steps to Reproduce:
===================
1.Create and start a master and slave arbiter volume
2.Set up a geo-rep session between the 2
3.Mount the master and pump IO:

for i in {create,chmod,symlink,create,chown,chmod,create,symlink,chgrp,symlink,truncate,symlink,chown,create,symlink}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master/ ; sleep 10 ; done


master vol info:
-----------------
Volume Name: master
Type: Distributed-Replicate
Volume ID: 5a97408a-8cc0-4f24-a306-7f9e143e6614
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.43.116:/rhs/brick2/b4
Brick2: 10.70.42.2:/rhs/brick2/b5
Brick3: 10.70.42.44:/rhs/brick2/b6 (arbiter)
Brick4: 10.70.43.116:/rhs/brick3/b7
Brick5: 10.70.42.2:/rhs/brick3/b8
Brick6: 10.70.42.44:/rhs/brick3/b9 (arbiter)
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: off
cluster.enable-shared-storage: enable


slave vol info:
---------------
Volume Name: slave
Type: Distributed-Replicate
Volume ID: 125aece0-1800-4065-a363-f36dc0efc6f5
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.42.226:/rhs/brick2/b4
Brick2: 10.70.43.81:/rhs/brick2/b5
Brick3: 10.70.41.204:/rhs/brick2/b6 (arbiter)
Brick4: 10.70.42.226:/rhs/brick3/b7
Brick5: 10.70.43.81:/rhs/brick3/b8
Brick6: 10.70.41.204:/rhs/brick3/b9 (arbiter)
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off



Actual results:
==============
Brick went down

Expected results:
================
Brick should not go down

Comment 9 Sunny Kumar 2019-01-28 06:40:37 UTC

*** Bug 1669936 has been marked as a duplicate of this bug. ***

Comment 13 Kotresh HR 2019-11-19 05:25:03 UTC

*** Bug 1666974 has been marked as a duplicate of this bug. ***

Comment 16 Sunny Kumar 2020-02-24 14:33:13 UTC

It's targeted for 3.5.2 once branching is done will back-port the fix.

Comment 25 errata-xmlrpc 2020-12-17 04:50:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5603