Bug 1365694 - [GSS] Geo-Replication session faulty after running out of space on /var partition.
Summary: [GSS] Geo-Replication session faulty after running out of space on /var parti...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.1
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks: RHGS-3.4-GSS-proposed-tracker
TreeView+ depends on / blocked
 
Reported: 2016-08-09 22:55 UTC by Cal Calhoun
Modified: 2019-11-14 08:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 10:30:31 UTC
Embargoed:


Attachments (Terms of Use)

Description Cal Calhoun 2016-08-09 22:55:17 UTC
Description of problem:

After the /var partition filled up and was subsequently cleaned up, one geo-replication session is showing faulty for one pair of nodes.

Version-Release number of selected component (if applicable):
RHGS 3.1.3 on RHEL 6.8

How reproducible:
Ongoing

[root@rhssp1 ~]# gluster v geo-replication node8_dir status
 
MASTER NODE    MASTER VOL    MASTER BRICK     SLAVE USER    SLAVE                              SLAVE NODE    STATUS     CRAWL STATUS       LAST_SYNCED                  
-------------------------------------------------------------------------------------------------------------------------------------------------------------
node1         vol8_dir    /vol8b1/dir    root          ssh://node11::vol8_dir_slave    node11       Active     Changelog Crawl    2016-08-08 13:24:52          
node3         vol8_dir    /vol8b3/dir    root          ssh://node11::vol8_dir_slave    N/A           Faulty     N/A                N/A                          
node4         vol8_dir    /vol8b4/dir    root          ssh://node11::gvol8_dir_slave    N/A           Faulty     N/A                N/A                          
node2         vol8_dir    /vol8b2/dir    root          ssh://node11::vol8_dir_slave    node11       Passive    N/A                N/A

Comment 3 Aravinda VK 2016-08-10 06:48:28 UTC
Checked the log files, Observations:

gsyncd.conf file corruption: When /var partition is full, on glustered restart it is corrupting the gsyncd.conf(Geo-rep session conf). As a workaround, conf file copied from good peer node to other nodes. To fix this issue in future, we should avoid regeneration of conf file every time when glusterd starts and handling failures.

Python Traceback causing Geo-rep status Faulty. We found following traceback in Slave log file.

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 772, in entry_ops
    os.unlink(entry)
OSError: [Errno 21] Is a directory: '.gfid/12711ebf-7fdc-4f4b-9850-2d75581eb452/New folder'

During Rename, if source and Target has same inode then Geo-rep is deleting the source since No rename required. It is not handling the directories. When it tries to delete the source directory it is failing with this error. We will work on this fix.

As a workaround, 
In all brick backends of Slave,
ls .glusterfs/12/71/12711ebf-7fdc-4f4b-9850-2d75581eb452/New folder

If Empty then cleanup/delete "New folder" in Slave. So that Geo-rep can continue by passing this error.

If not empty, backup the files and then delete the "New folder". We can trigger sync for the files from this directory if any.

Comment 6 Aravinda VK 2018-02-06 08:08:38 UTC
Geo-replication support added to Glusterd2 project, which will be available with Gluster upstream 4.0 and 4.1 releases. 

Most of the issues already fixed with issue https://github.com/gluster/glusterd2/issues/271 and remaining fixes are noted in issue https://github.com/gluster/glusterd2/issues/557

We can close these issues since we are not planning any fixes for 3.x series.

Comment 8 Amar Tumballi 2018-10-11 10:30:31 UTC
I see the issues are fixed UPSTREAM right now. Customer case is closed too.

When the GD2 comes to product, this gets automatically closed! Please re-open the issue if the ask is to get it in GD1 itself, which means we have to rescope the effort and see what can be done!


Note You need to log in before you can comment on or make changes to this bug.