Description of problem: ======================= After removing a brick from the slave and then the master, geo-replication is in FAULTY state and fails with the following traceback: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1151, in process_change failures = self.slave.server.meta_ops(meta_entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__ raise res OSError: [Errno 22] Invalid argument: '.gfid/0817a213-4da2-480a-91db-524e18f414b4' [root@dhcp41-226 master]# gluster volume geo-replication master 10.70.41.229::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED --------------------------------------------------------------------------------------------------------------------------------------------------- 10.70.41.226 master /rhs/brick1/b1 root 10.70.41.229::slave 10.70.41.230 Passive N/A N/A 10.70.41.226 master /rhs/brick2/b4 root 10.70.41.229::slave N/A Faulty N/A N/A 10.70.41.226 master /rhs/brick3/b7 root 10.70.41.229::slave 10.70.41.230 Passive N/A N/A 10.70.41.228 master /rhs/brick1/b3 root 10.70.41.229::slave N/A Faulty N/A N/A 10.70.41.228 master /rhs/brick2/b6 root 10.70.41.229::slave 10.70.41.219 Passive N/A N/A 10.70.41.228 master /rhs/brick3/b9 root 10.70.41.229::slave N/A Faulty N/A N/A 10.70.41.227 master /rhs/brick1/b2 root 10.70.41.229::slave 10.70.41.229 Active History Crawl 2018-04-20 05:45:58 10.70.41.227 master /rhs/brick2/b5 root 10.70.41.229::slave 10.70.41.229 Active History Crawl 2018-04-20 05:45:58 10.70.41.227 master /rhs/brick3/b8 root 10.70.41.229::slave 10.70.41.229 Active History Crawl 2018-04-20 05:45:58 Version-Release number of selected component (if applicable): ============================================================= [root@dhcp41-226 master]# rpm -qa | grep gluster glusterfs-fuse-3.12.2-7.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-libs-3.12.2-7.el7rhgs.x86_64 glusterfs-cli-3.12.2-7.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64 glusterfs-rdma-3.12.2-7.el7rhgs.x86_64 glusterfs-events-3.12.2-7.el7rhgs.x86_64 glusterfs-3.12.2-7.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-7.el7rhgs.x86_64 glusterfs-server-3.12.2-7.el7rhgs.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch python2-gluster-3.12.2-7.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 glusterfs-api-3.12.2-7.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: ================== 1. Create Master volume (3x3) and Slave volume (3x3) 2. Setup geo-rep session between master and slave 3. Mount the Master volume and start the following IO: for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,rename,chown,rename,create,hardlink,hardlink,symlink,rename}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master ; sleep 10 ; done 4. Wait for couple of mins for geo-rep to start syncing these files to slave. 5. While syncing is in progress, remove one subvolume from slave volume (3x2) via rebalance start 6. After about 10 mins, remove one subvolume from Master volume too (4x2) 7. Keep checking rebalance status, once it is completed at slave, commit the remove brick 8. Keep checking rebalance status, once it is completed at Master, stop the geo-replication, commit the remove brick and start the geo-replication. Actual results: =============== Geo-replication in FAULTY state Expected results: ================ Geo-replication should not be FAULTY