Description of problem: When user performs remove-brick commit operation, brick process for that brick is getting killed, all instances receive 'ECONNABORTED' and all instances are restarted. We can avoid restarting of other instance on remove-brick commit Version-Release number of selected component (if applicable): 3.4.0.33rhs-1.el6rhs.x86_64 How reproducible: always Steps to Reproduce: 1. create and start dist-rep volume and mount it.Start creating data on master volume from mount point. mount point:- mount | grep remove_xsync 10.70.35.179:/remove_xsync on /mnt/remove_xsync type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) 10.70.35.179:/remove_xsync on /mnt/remove_xsync_nfs type nfs (rw,addr=10.70.35.179) 2, create and start geo rep session between master and slave volume. [root@old5 ~]# gluster volume geo remove_xsync status NODE MASTER SLAVE HEALTH UPTIME ----------------------------------------------------------------------------------------------------------------- old5.lab.eng.blr.redhat.com remove_xsync ssh://10.70.37.195::remove_xsync Stable 4 days 07:12:33 old6.lab.eng.blr.redhat.com remove_xsync ssh://10.70.37.195::remove_xsync Stable 4 days 23:52:43 3. remove brick(s) from master volume with start option. --> gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 start 4. once remove-brick is completed perform commit operation gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 status gluster volume remove-brick remove_xsync 10.70.35.179:/rhs/brick3/x3 10.70.35.235:/rhs/brick3/x3 commit [root@old5 ~]# gluster v info remove_change Volume Name: remove_change Type: Distributed-Replicate Volume ID: eb500199-37d4-4cb9-96ed-ae5bc1bf2498 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.179:/rhs/brick3/c1 Brick2: 10.70.35.235:/rhs/brick3/c1 Brick3: 10.70.35.179:/rhs/brick3/c2 Brick4: 10.70.35.235:/rhs/brick3/c2 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on 5.on remove-brick commit operation, brick process for that brick is getting killed, all instances receive 'ECONNABORTED' and all instances are restarted log snippet:- less /var/log/glusterfs/geo-replication/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync.log [2013-09-16 14:56:33.944725] I [master(/rhs/brick2/x3):587:fallback_xsync] _GMaster: falling back to xsync mode [2013-09-16 14:56:48.72854] I [syncdutils(/rhs/brick2/x3):159:finalize] <top>: exiting. [2013-09-16 14:56:50.587552] E [syncdutils(/rhs/brick2/x1):201:log_raise_exception] <top>: glusterfs session went down [ECONNABORTED] [2013-09-16 14:56:52.982089] I [syncdutils(/rhs/brick2/x1):159:finalize] <top>: exiting. [2013-09-16 14:56:51.429940] E [syncdutils(/rhs/brick2/x2):201:log_raise_exception] <top>: glusterfs session went down [ECONNABORTED] [2013-09-16 14:56:53.641541] I [syncdutils(/rhs/brick2/x2):159:finalize] <top>: exiting. [2013-09-16 14:56:56.116944] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2013-09-16 14:57:12.589235] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-09-16 14:57:12.786187] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-09-16 14:57:12.730447] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-09-16 14:57:12.844243] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-09-16 14:57:13.646564] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-09-16 14:57:13.647228] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-09-16 14:57:14.677306] I [gsyncd(/rhs/brick2/x2):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root .37.195:gluster://localhost:remove_xsync [2013-09-16 14:57:14.682374] I [gsyncd(/rhs/brick2/x3):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root .37.98:gluster://localhost:remove_xsync [2013-09-16 14:57:14.684375] I [gsyncd(/rhs/brick2/x1):503:main_i] <top>: syncing: gluster://localhost:remove_xsync -> ssh://root .37.98:gluster://localhost:remove_xsync [2013-09-16 14:57:21.670073] I [master(/rhs/brick2/x2):57:gmaster_builder] <top>: setting up xsync change detection mode [2013-09-16 14:57:21.676136] I [master(/rhs/brick2/x2):57:gmaster_builder] <top>: setting up xsync change detection mode [2013-09-16 14:57:21.688627] I [master(/rhs/brick2/x2):816:register] _GMaster: xsync temp directory: /var/run/gluster/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync/9b86668c9bd1c074e1e2720fc5005e44/xsync [2013-09-16 14:57:21.688901] I [master(/rhs/brick2/x2):816:register] _GMaster: xsync temp directory: /var/run/gluster/remove_xsync/ssh%3A%2F%2Froot%4010.70.37.195%3Agluster%3A%2F%2F127.0.0.1%3Aremove_xsync/9b86668c9bd1c074e1e2720fc5005e44/xsync [2013-09-16 14:57:22.300641] I [master(/rhs/brick2/x3):57:gmaster_builder] <top>: setting up xsync change detection mode [2013-09-16 14:57:22.320192] I [master(/rhs/brick2/x1):57:gmaster_builder] <top>: setting up xsync change detection mode [2013-09-16 14:57:22.320787] I [master(/rhs/brick2/x3):57:gmaster_builder] <top>: setting up xsync change detection mode [2013-09-16 14:57:22.323508] I [master(/rhs/brick2/x1):57:gmaster_builder] <top>: setting up xsync change detection mode Actual results: all instances are restarted. Expected results: restarting of other instance on remove-brick commit can be avoided. Additional info:
Verified with the build: glusterfs-3.7.1-10.el6rhs.x86_64 We have an additional step to stop the geo-rep session before doing commit. Once the bricks are removed using commit, they do not get listed in the volume info. Starting the geo-rep session do not pick that brick and we do not see ECONNABORTED Moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html