Bug 1569934

Summary:	[geo-rep]: Geo-rep FAULTY with remove-brick
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rochelle <rallan>
Component:	geo-replication	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	amukherj, csaba, khiremat, moagrawa, rallan, rhinduja, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-11 05:46:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1484113

Description Rochelle 2018-04-20 10:32:20 UTC

Description of problem:
=======================
After removing a brick from the slave and then the master, geo-replication is in FAULTY state and fails with the following traceback:

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1151, in process_change
    failures = self.slave.server.meta_ops(meta_entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__
    raise res
OSError: [Errno 22] Invalid argument: '.gfid/0817a213-4da2-480a-91db-524e18f414b4'


[root@dhcp41-226 master]# gluster volume geo-replication master 10.70.41.229::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
---------------------------------------------------------------------------------------------------------------------------------------------------
10.70.41.226    master        /rhs/brick1/b1    root          10.70.41.229::slave    10.70.41.230    Passive    N/A              N/A                          
10.70.41.226    master        /rhs/brick2/b4    root          10.70.41.229::slave    N/A             Faulty     N/A              N/A                          
10.70.41.226    master        /rhs/brick3/b7    root          10.70.41.229::slave    10.70.41.230    Passive    N/A              N/A                          
10.70.41.228    master        /rhs/brick1/b3    root          10.70.41.229::slave    N/A             Faulty     N/A              N/A                          
10.70.41.228    master        /rhs/brick2/b6    root          10.70.41.229::slave    10.70.41.219    Passive    N/A              N/A                          
10.70.41.228    master        /rhs/brick3/b9    root          10.70.41.229::slave    N/A             Faulty     N/A              N/A                          
10.70.41.227    master        /rhs/brick1/b2    root          10.70.41.229::slave    10.70.41.229    Active     History Crawl    2018-04-20 05:45:58          
10.70.41.227    master        /rhs/brick2/b5    root          10.70.41.229::slave    10.70.41.229    Active     History Crawl    2018-04-20 05:45:58          
10.70.41.227    master        /rhs/brick3/b8    root          10.70.41.229::slave    10.70.41.229    Active     History Crawl    2018-04-20 05:45:58  



Version-Release number of selected component (if applicable):
=============================================================
[root@dhcp41-226 master]# rpm -qa | grep gluster
glusterfs-fuse-3.12.2-7.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-7.el7rhgs.x86_64
glusterfs-cli-3.12.2-7.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64
glusterfs-rdma-3.12.2-7.el7rhgs.x86_64
glusterfs-events-3.12.2-7.el7rhgs.x86_64
glusterfs-3.12.2-7.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-7.el7rhgs.x86_64
glusterfs-server-3.12.2-7.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
python2-gluster-3.12.2-7.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-api-3.12.2-7.el7rhgs.x86_64


How reproducible:
=================
1/1


Steps to Reproduce:
==================
1. Create Master volume (3x3) and Slave volume (3x3)
2. Setup geo-rep session between master and slave
3. Mount the Master volume and start the following IO:

for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,rename,chown,rename,create,hardlink,hardlink,symlink,rename}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master ; sleep 10 ; done

4. Wait for couple of mins for geo-rep to start syncing these files to slave. 
5. While syncing is in progress, remove one subvolume from slave volume (3x2) via rebalance start 
6. After about 10 mins, remove one subvolume from Master volume too (4x2)
7. Keep checking rebalance status, once it is completed at slave, commit the remove brick
8. Keep checking rebalance status, once it is completed at Master, stop the geo-replication, commit the remove brick and start the geo-replication. 

Actual results:
===============
Geo-replication in FAULTY state

Expected results:
================
Geo-replication should not be FAULTY