Bug 1365694

Summary: [GSS] Geo-Replication session faulty after running out of space on /var partition.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Cal Calhoun <ccalhoun>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED UPSTREAM QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.1CC: atumball, csaba, rhs-bugs, storage-qa-internal, vnosov
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 10:30:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1472361    

Description Cal Calhoun 2016-08-09 22:55:17 UTC
Description of problem:

After the /var partition filled up and was subsequently cleaned up, one geo-replication session is showing faulty for one pair of nodes.

Version-Release number of selected component (if applicable):
RHGS 3.1.3 on RHEL 6.8

How reproducible:
Ongoing

[root@rhssp1 ~]# gluster v geo-replication node8_dir status
 
MASTER NODE    MASTER VOL    MASTER BRICK     SLAVE USER    SLAVE                              SLAVE NODE    STATUS     CRAWL STATUS       LAST_SYNCED                  
-------------------------------------------------------------------------------------------------------------------------------------------------------------
node1         vol8_dir    /vol8b1/dir    root          ssh://node11::vol8_dir_slave    node11       Active     Changelog Crawl    2016-08-08 13:24:52          
node3         vol8_dir    /vol8b3/dir    root          ssh://node11::vol8_dir_slave    N/A           Faulty     N/A                N/A                          
node4         vol8_dir    /vol8b4/dir    root          ssh://node11::gvol8_dir_slave    N/A           Faulty     N/A                N/A                          
node2         vol8_dir    /vol8b2/dir    root          ssh://node11::vol8_dir_slave    node11       Passive    N/A                N/A

Comment 3 Aravinda VK 2016-08-10 06:48:28 UTC
Checked the log files, Observations:

gsyncd.conf file corruption: When /var partition is full, on glustered restart it is corrupting the gsyncd.conf(Geo-rep session conf). As a workaround, conf file copied from good peer node to other nodes. To fix this issue in future, we should avoid regeneration of conf file every time when glusterd starts and handling failures.

Python Traceback causing Geo-rep status Faulty. We found following traceback in Slave log file.

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 772, in entry_ops
    os.unlink(entry)
OSError: [Errno 21] Is a directory: '.gfid/12711ebf-7fdc-4f4b-9850-2d75581eb452/New folder'

During Rename, if source and Target has same inode then Geo-rep is deleting the source since No rename required. It is not handling the directories. When it tries to delete the source directory it is failing with this error. We will work on this fix.

As a workaround, 
In all brick backends of Slave,
ls .glusterfs/12/71/12711ebf-7fdc-4f4b-9850-2d75581eb452/New folder

If Empty then cleanup/delete "New folder" in Slave. So that Geo-rep can continue by passing this error.

If not empty, backup the files and then delete the "New folder". We can trigger sync for the files from this directory if any.

Comment 6 Aravinda VK 2018-02-06 08:08:38 UTC
Geo-replication support added to Glusterd2 project, which will be available with Gluster upstream 4.0 and 4.1 releases. 

Most of the issues already fixed with issue https://github.com/gluster/glusterd2/issues/271 and remaining fixes are noted in issue https://github.com/gluster/glusterd2/issues/557

We can close these issues since we are not planning any fixes for 3.x series.

Comment 8 Amar Tumballi 2018-10-11 10:30:31 UTC
I see the issues are fixed UPSTREAM right now. Customer case is closed too.

When the GD2 comes to product, this gets automatically closed! Please re-open the issue if the ask is to get it in GD1 itself, which means we have to rescope the effort and see what can be done!