1027727 – Dist-geo-rep : first xsync crawl failed to sync few hardlink to slave when there were some 200K hardlinks

Bug 1027727 - Dist-geo-rep : first xsync crawl failed to sync few hardlink to slave when there were some 200K hardlinks

Summary: Dist-geo-rep : first xsync crawl failed to sync few hardlink to slave when t...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:	consistency
Depends On:
Blocks:	1035040
TreeView+	depends on / blocked

Reported:	2013-11-07 10:38 UTC by Vijaykumar Koppad
Modified:	2015-11-25 08:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	When there are hundreds of thousands of hardlinks on the master volume prior to starting the Geo-replication session, some hardlinks are not getting synchronized to the slave volume.
Clone Of:
Environment:
Last Closed:	2015-11-25 08:48:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2013-11-07 10:38:41 UTC

Description of problem: first xsync crawl  failed to sync few hardlink to slave when there were some 200K hardlinks. Investigating on the files which were missing on the slave, had entries in the xsync-changelogs which were processed.



Version-Release number of selected component (if applicable): glusterfs-3.4.0.39rhs-1


How reproducible: Didn't try to reproduce, but it looks like reproducible. 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create some 200K files on master using the command "./crefi.py -n 2000 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master/"  and let it sync on the master.
3.Stop the geo-rep session,
4.create hardlinks to all those files using the command ""./crefi.py -n 2000 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=hardlink /mnt/master/"

Actual results: It failed to sync few hardlinks.


Expected results: It shouldn't fail to sync hardlinks


Additional info:

To take an example 

This is the file name,  ~/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB which is missing from the slave.
this is hardlink to the file 
~/level07/level17/level27/5278c705%%RTP7V23802

both have same inodes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@redcell ~]# ls /mnt/master/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB -i
11831742919762312969 /mnt/master/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB
[root@redcell ~]# ls /mnt/master/level07/level17/level27/5278c705%%RTP7V23802 -i
11831742919762312969 /mnt/master/level07/level17/level27/5278c705%%RTP7V23802
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

gfdi of the file missing 

[root@redcell hardlink_to_files]# getfattr -n glusterfs.gfid.string 527a515c%%3QMD5P6YJB
# file: 527a515c%%3QMD5P6YJB
glusterfs.gfid.string="f35e8545-0325-4e64-a432-cb7f8e2cff09"



this has the entry in the xsync changelog,

[root@redcell xsync]# grep f35e8545-0325-4e64-a432-cb7f8e2cff09 *
XSYNC-CHANGELOG.1383804613:E f35e8545-0325-4e64-a432-cb7f8e2cff09 LINK 0f35938d-d910-4c4a-9699-aa90b61c6f1f%2F5278c705%25%25RTP7V23802
XSYNC-CHANGELOG.1383804613:D f35e8545-0325-4e64-a432-cb7f8e2cff09


And this file has been processed by the worker, 

root@redcell ~]# grep XSYNC-CHANGELOG.1383804613 /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log
[2013-11-07 12:08:32.530020] I [master(/bricks/brick1):922:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1383804613


Which mean entry operation has been failed on the slave side and there are no logs on the corresponding slave client logs.

Comment 1 Vijaykumar Koppad 2013-11-07 10:46:33 UTC

After these failure to sync files to slave, the deletion on files crashed gsyncd with following backtrace. 


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
f-31b8-44a1-b626-c1c302dd218f ...
[2013-11-06 22:54:07.616204] I [master(/bricks/brick3):413:crawlwrap] _GMaster: crawl interval: 3 seconds
[2013-11-06 22:54:07.741698] E [repce(/bricks/brick3):188:__call__] RepceClient: call 10737:140265205274368:1383758647.66 (entry_ops) failed on peer with OSError
[2013-11-06 22:54:07.742059] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 535, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1134, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 437, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 858, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 815, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 780, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
OSError: [Errno 61] No data available

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Which looks like the backtrace observed in this Bug 1027252

Comment 3 Amar Tumballi 2013-11-11 09:53:12 UTC

backtrace looks same as bug 1028343, which is fixed in .42rhs. Can this workload be tested on the same build?

Comment 4 Nagaprasad Sathyanarayana 2013-11-11 11:40:25 UTC

Looks like the fix for 1028343 must fix this issue also.  Requesting QE to verify in the .42rhs build.

Comment 5 Vijaykumar Koppad 2013-11-12 11:48:44 UTC

I tried it on the build glusterfs-3.4.0.43rhs-1, still I see this issue. 

on master no of files are,

[root@shaktiman ~]# find /mnt/master/ | wc -l                                      220201

and on slave it as synced only, 

[root@spiderman ~]# find /mnt/slave/ | wc -l
218535

some 1.5K files are missing,

and those are not shown in the status detail skipped files also.

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE                  STATUS     CHECKPOINT STATUS    CRAWL STATUS       FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING    FILES SKIPPED   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
shaktiman.blr.redhat.com    master        /bricks/brick1    10.70.42.171::slave    Active     N/A                  Changelog Crawl    192326         0                0                0                  0               
snow.blr.redhat.com         master        /bricks/brick4    10.70.42.229::slave    Passive    N/A                  N/A                5601           0                0                0                  0               
targarean.blr.redhat.com    master        /bricks/brick3    10.70.43.159::slave    Active     N/A                  Changelog Crawl    193030         0                0                0                  0               
riverrun.blr.redhat.com     master        /bricks/brick2    10.70.42.225::slave    Passive    N/A                  N/A                0              0                0                0                  0

Comment 6 Kotresh HR 2013-12-16 08:27:27 UTC

Vijaykumar,

Could you test with entry-timeout as 0. You'd need to configure "gluster_params" to have this option.

# gluster volume geo <master> <slave> config gluster_params "aux-gfid-mount entry-timeout=0"

Comment 7 Vivek Agarwal 2013-12-18 00:47:44 UTC

This needs to go as a known issue, removing the corbett flag.

Comment 8 Shalaka 2014-01-02 11:14:37 UTC

Modified the DocText for this Known Issue. Please review and confirm.

Comment 9 Kotresh HR 2014-01-20 06:03:40 UTC

The Doc Text looks fine.

Comment 13 Aravinda VK 2015-11-25 08:48:03 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 14 Aravinda VK 2015-11-25 08:50:21 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.