Bug 1009351 - [RFE] Dist-geo-rep : no need of restarting other geo replication instances when they receives 'ECONNABORTED' on remove-brick commit of some other brick
[RFE] Dist-geo-rep : no need of restarting other geo replication instances wh...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
2.1
x86_64 Linux
medium Severity medium
: ---
: RHGS 3.1.0
Assigned To: Kotresh HR
Rahul Hinduja
usability
: FutureFeature
Depends On:
Blocks: 1202842 1223636
  Show dependency treegraph
 
Reported: 2013-09-18 05:18 EDT by Rachana Patel
Modified: 2015-07-29 00:29 EDT (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.7.0-2.el6rhs
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-07-29 00:29:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rachana Patel 2013-09-18 05:18:46 EDT
Description of problem:
When user performs remove-brick commit operation, brick process for that brick is getting killed, all instances receive 'ECONNABORTED' and all instances are restarted. We can avoid restarting of other instance on remove-brick commit 

Version-Release number of selected component (if applicable):
3.4.0.33rhs-1.el6rhs.x86_64

How reproducible:
always

Steps to Reproduce:
1.  create and start dist-rep volume and mount it.Start creating data on master volume from mount point. 

mount point:-
mount | grep remove_xsync
10.70.35.179:/remove_xsync on /mnt/remove_xsync type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
10.70.35.179:/remove_xsync on /mnt/remove_xsync_nfs type nfs (rw,addr=10.70.35.179)

2, create and start geo rep session between master and slave volume.
[root@old5 ~]# gluster volume geo remove_xsync status
NODE                           MASTER           SLAVE                                HEALTH    UPTIME                
-----------------------------------------------------------------------------------------------------------------
old5.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    Stable    4 days 07:12:33       
old6.lab.eng.blr.redhat.com    remove_xsync    ssh://10.70.37.195::remove_xsync    Stable    4 days 23:52:43 



3. remove brick(s) from master volume with start option.

--> gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 start

4. once remove-brick is completed perform commit operation
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 status
 gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 commit

[root@old5 ~]# gluster v info remove_change
 
Volume Name: remove_change
Type: Distributed-Replicate
Volume ID: eb500199-37d4-4cb9-96ed-ae5bc1bf2498
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.179:/rhs/brick3/c1
Brick2: 10.70.35.235:/rhs/brick3/c1
Brick3: 10.70.35.179:/rhs/brick3/c2
Brick4: 10.70.35.235:/rhs/brick3/c2
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on

5.on remove-brick commit operation, brick process for that brick is getting killed, all instances receive 'ECONNABORTED' and all instances are restarted

log snippet:-
 less /var/log/glusterfs/geo-replication/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync.log 

[2013-09-16 14:56:33.944725] I [master(/rhs/brick2/x3):587:fallback_xsync] _GMaster: falling back to xsync mode
[2013-09-16 14:56:48.72854] I [syncdutils(/rhs/brick2/x3):159:finalize] <top>: exiting.
[2013-09-16 14:56:50.587552] E [syncdutils(/rhs/brick2/x1):201:log_raise_exception] <top>: glusterfs session went down [ECONNABORTED]
[2013-09-16 14:56:52.982089] I [syncdutils(/rhs/brick2/x1):159:finalize] <top>: exiting.
[2013-09-16 14:56:51.429940] E [syncdutils(/rhs/brick2/x2):201:log_raise_exception] <top>: glusterfs session went down [ECONNABORTED]
[2013-09-16 14:56:53.641541] I [syncdutils(/rhs/brick2/x2):159:finalize] <top>: exiting.
[2013-09-16 14:56:56.116944] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-09-16 14:57:12.589235] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-09-16 14:57:12.786187] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-09-16 14:57:12.730447] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-09-16 14:57:12.844243] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-09-16 14:57:13.646564] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-09-16 14:57:13.647228] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-09-16 14:57:14.677306] I [gsyncd(/rhs/brick2/x2):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root@10.70
.37.195:gluster://localhost:remove_xsync
[2013-09-16 14:57:14.682374] I [gsyncd(/rhs/brick2/x3):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root@10.70
.37.98:gluster://localhost:remove_xsync
[2013-09-16 14:57:14.684375] I [gsyncd(/rhs/brick2/x1):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root@10.70
.37.98:gluster://localhost:remove_xsync
[2013-09-16 14:57:21.670073] I [master(/rhs/brick2/x2):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:21.676136] I [master(/rhs/brick2/x2):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:21.688627] I [master(/rhs/brick2/x2):816:register] _GMaster: xsync temp directory: /var/run/gluster/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync/9b86668c9bd1c074e1e2720fc5005e44/xsync
[2013-09-16 14:57:21.688901] I [master(/rhs/brick2/x2):816:register] _GMaster: xsync temp directory: /var/run/gluster/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync/9b86668c9bd1c074e1e2720fc5005e44/xsync
[2013-09-16 14:57:22.300641] I [master(/rhs/brick2/x3):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:22.320192] I [master(/rhs/brick2/x1):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:22.320787] I [master(/rhs/brick2/x3):57:gmaster_builder] <top>: setting up xsync change detection mode
[2013-09-16 14:57:22.323508] I [master(/rhs/brick2/x1):57:gmaster_builder] <top>: setting up xsync change detection mode

Actual results:
 all instances are restarted.

Expected results:
restarting of other instance on remove-brick commit can be avoided.

Additional info:
Comment 8 Rahul Hinduja 2015-07-16 08:38:03 EDT
Verified with the build: glusterfs-3.7.1-10.el6rhs.x86_64

We have an additional step to stop the geo-rep session before doing commit. Once the bricks are removed using commit, they do not get listed in the volume info. Starting the geo-rep session do not pick that brick and we do not see ECONNABORTED

Moving this bug to verified state.
Comment 11 errata-xmlrpc 2015-07-29 00:29:04 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.