1569934 – [geo-rep]: Geo-rep FAULTY with remove-brick

Bug 1569934 - [geo-rep]: Geo-rep FAULTY with remove-brick

Summary: [geo-rep]: Geo-rep FAULTY with remove-brick

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mohit Agrawal
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1484113
TreeView+	depends on / blocked

Reported:	2018-04-20 10:32 UTC by Rochelle
Modified:	2018-05-11 05:46 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-11 05:46:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rochelle 2018-04-20 10:32:20 UTC

Description of problem:
=======================
After removing a brick from the slave and then the master, geo-replication is in FAULTY state and fails with the following traceback:

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 210, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 802, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1676, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 597, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1470, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1370, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1151, in process_change
    failures = self.slave.server.meta_ops(meta_entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 228, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 210, in __call__
    raise res
OSError: [Errno 22] Invalid argument: '.gfid/0817a213-4da2-480a-91db-524e18f414b4'


[root@dhcp41-226 master]# gluster volume geo-replication master 10.70.41.229::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
---------------------------------------------------------------------------------------------------------------------------------------------------
10.70.41.226    master        /rhs/brick1/b1    root          10.70.41.229::slave    10.70.41.230    Passive    N/A              N/A                          
10.70.41.226    master        /rhs/brick2/b4    root          10.70.41.229::slave    N/A             Faulty     N/A              N/A                          
10.70.41.226    master        /rhs/brick3/b7    root          10.70.41.229::slave    10.70.41.230    Passive    N/A              N/A                          
10.70.41.228    master        /rhs/brick1/b3    root          10.70.41.229::slave    N/A             Faulty     N/A              N/A                          
10.70.41.228    master        /rhs/brick2/b6    root          10.70.41.229::slave    10.70.41.219    Passive    N/A              N/A                          
10.70.41.228    master        /rhs/brick3/b9    root          10.70.41.229::slave    N/A             Faulty     N/A              N/A                          
10.70.41.227    master        /rhs/brick1/b2    root          10.70.41.229::slave    10.70.41.229    Active     History Crawl    2018-04-20 05:45:58          
10.70.41.227    master        /rhs/brick2/b5    root          10.70.41.229::slave    10.70.41.229    Active     History Crawl    2018-04-20 05:45:58          
10.70.41.227    master        /rhs/brick3/b8    root          10.70.41.229::slave    10.70.41.229    Active     History Crawl    2018-04-20 05:45:58  



Version-Release number of selected component (if applicable):
=============================================================
[root@dhcp41-226 master]# rpm -qa | grep gluster
glusterfs-fuse-3.12.2-7.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-7.el7rhgs.x86_64
glusterfs-cli-3.12.2-7.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64
glusterfs-rdma-3.12.2-7.el7rhgs.x86_64
glusterfs-events-3.12.2-7.el7rhgs.x86_64
glusterfs-3.12.2-7.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-7.el7rhgs.x86_64
glusterfs-server-3.12.2-7.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
python2-gluster-3.12.2-7.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-api-3.12.2-7.el7rhgs.x86_64


How reproducible:
=================
1/1


Steps to Reproduce:
==================
1. Create Master volume (3x3) and Slave volume (3x3)
2. Setup geo-rep session between master and slave
3. Mount the Master volume and start the following IO:

for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,rename,chown,rename,create,hardlink,hardlink,symlink,rename}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master ; sleep 10 ; done

4. Wait for couple of mins for geo-rep to start syncing these files to slave. 
5. While syncing is in progress, remove one subvolume from slave volume (3x2) via rebalance start 
6. After about 10 mins, remove one subvolume from Master volume too (4x2)
7. Keep checking rebalance status, once it is completed at slave, commit the remove brick
8. Keep checking rebalance status, once it is completed at Master, stop the geo-replication, commit the remove brick and start the geo-replication. 

Actual results:
===============
Geo-replication in FAULTY state

Expected results:
================
Geo-replication should not be FAULTY

Note You need to log in before you can comment on or make changes to this bug.