1009351 – [RFE] Dist-geo-rep : no need of restarting other geo replication instances when they receives 'ECONNABORTED' on remove-brick commit of some other brick

Bug 1009351 - [RFE] Dist-geo-rep : no need of restarting other geo replication instances when they receives 'ECONNABORTED' on remove-brick commit of some other brick

Summary: [RFE] Dist-geo-rep : no need of restarting other geo replication instances wh...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:	usability
Depends On:
Blocks:	1202842 1223636
TreeView+	depends on / blocked

Reported:	2013-09-18 09:18 UTC by Rachana Patel
Modified:	2015-07-29 04:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.7.0-2.el6rhs
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-29 04:29:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Rachana Patel 2013-09-18 09:18:46 UTC

Description of problem:
When user performs remove-brick commit operation, brick process for that brick is getting killed, all instances receive 'ECONNABORTED' and all instances are restarted. We can avoid restarting of other instance on remove-brick commit 

Version-Release number of selected component (if applicable):
3.4.0.33rhs-1.el6rhs.x86_64

How reproducible:
always

Steps to Reproduce:
1.  create and start dist-rep volume and mount it.Start creating data on master volume from mount point. 

mount point:-
mount | grep remove_xsync
10.70.35.179:/remove_xsync on /mnt/remove_xsync type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
10.70.35.179:/remove_xsync on /mnt/remove_xsync_nfs type nfs (rw,addr=10.70.35.179)

2, create and start geo rep session between master and slave volume.
[root@old5 ~]# gluster volume geo remove_xsync status
NODE                           MASTER           SLAVE                                HEALTH    UPTIME                
-----------------------------------------------------------------------------------------------------------------
old5.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    Stable    4 days 07:12:33       
old6.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    Stable    4 days 23:52:43 



3. remove brick(s) from master volume with start option.

--> gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 start

4. once remove-brick is completed perform commit operation
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 status
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 commit

[root@old5 ~]# gluster v info remove_change
 
Volume Name: remove_change
Type: Distributed-Replicate
Volume ID: eb500199-37d4-4cb9-96ed-ae5bc1bf2498
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.179:/rhs/brick3/c1
Brick2: 10.70.35.235:/rhs/brick3/c1
Brick3: 10.70.35.179:/rhs/brick3/c2
Brick4: 10.70.35.235:/rhs/brick3/c2
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on

5.on remove-brick commit operation, brick process for that brick is getting killed, all instances receive 'ECONNABORTED' and all instances are restarted

log snippet:-
 less /var/log/glusterfs/geo-replication/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync.log 

[2013-09-16 14:56:33.944725] I [master(/rhs/brick2/x3):587:fallback_xsync] _GMaster: falling back to xsync mode
[2013-09-16 14:56:48.72854] I [syncdutils(/rhs/brick2/x3):159:finalize] <top>: exiting.
[2013-09-16 14:56:50.587552] E [syncdutils(/rhs/brick2/x1):201:log_raise_exception] <top>: glusterfs session went down [ECONNABORTED]
[2013-09-16 14:56:52.982089] I [syncdutils(/rhs/brick2/x1):159:finalize] <top>: exiting.
[2013-09-16 14:56:51.429940] E [syncdutils(/rhs/brick2/x2):201:log_raise_exception] <top>: glusterfs session went down [ECONNABORTED]
[2013-09-16 14:56:53.641541] I [syncdutils(/rhs/brick2/x2):159:finalize] <top>: exiting.
[2013-09-16 14:56:56.116944] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-09-16 14:57:12.589235] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-09-16 14:57:12.786187] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-09-16 14:57:12.730447] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-09-16 14:57:12.844243] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-09-16 14:57:13.646564] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-09-16 14:57:13.647228] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-09-16 14:57:14.677306] I [gsyncd(/rhs/brick2/x2):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root
.37.195:gluster://localhost:remove_xsync
[2013-09-16 14:57:14.682374] I [gsyncd(/rhs/brick2/x3):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root
.37.98:gluster://localhost:remove_xsync
[2013-09-16 14:57:14.684375] I [gsyncd(/rhs/brick2/x1):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root
.37.98:gluster://localhost:remove_xsync
[2013-09-16 14:57:21.670073] I [master(/rhs/brick2/x2):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:21.676136] I [master(/rhs/brick2/x2):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:21.688627] I [master(/rhs/brick2/x2):816:register] _GMaster: xsync temp directory: /var/run/gluster/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync/9b86668c9bd1c074e1e2720fc5005e44/xsync
[2013-09-16 14:57:21.688901] I [master(/rhs/brick2/x2):816:register] _GMaster: xsync temp directory: /var/run/gluster/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync/9b86668c9bd1c074e1e2720fc5005e44/xsync
[2013-09-16 14:57:22.300641] I [master(/rhs/brick2/x3):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:22.320192] I [master(/rhs/brick2/x1):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:22.320787] I [master(/rhs/brick2/x3):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:22.323508] I [master(/rhs/brick2/x1):57:gmaster_builder] <top>: setting up xsync change detection mode

Actual results:
 all instances are restarted.

Expected results:
restarting of other instance on remove-brick commit can be avoided.

Additional info:

Comment 8 Rahul Hinduja 2015-07-16 12:38:03 UTC

Verified with the build: glusterfs-3.7.1-10.el6rhs.x86_64

We have an additional step to stop the geo-rep session before doing commit. Once the bricks are removed using commit, they do not get listed in the volume info. Starting the geo-rep session do not pick that brick and we do not see ECONNABORTED

Moving this bug to verified state.

Comment 11 errata-xmlrpc 2015-07-29 04:29:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.