Bug 1027727 - Dist-geo-rep : first xsync crawl failed to sync few hardlink to slave when there were some 200K hardlinks
Dist-geo-rep : first xsync crawl failed to sync few hardlink to slave when t...
Status: CLOSED EOL
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
2.1
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
Rahul Hinduja
consistency
: ZStream
Depends On:
Blocks: 1035040
  Show dependency treegraph
 
Reported: 2013-11-07 05:38 EST by Vijaykumar Koppad
Modified: 2015-11-25 03:50 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
When there are hundreds of thousands of hardlinks on the master volume prior to starting the Geo-replication session, some hardlinks are not getting synchronized to the slave volume.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-25 03:48:03 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Vijaykumar Koppad 2013-11-07 05:38:41 EST
Description of problem: first xsync crawl  failed to sync few hardlink to slave when there were some 200K hardlinks. Investigating on the files which were missing on the slave, had entries in the xsync-changelogs which were processed.



Version-Release number of selected component (if applicable): glusterfs-3.4.0.39rhs-1


How reproducible: Didn't try to reproduce, but it looks like reproducible. 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create some 200K files on master using the command "./crefi.py -n 2000 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master/"  and let it sync on the master.
3.Stop the geo-rep session,
4.create hardlinks to all those files using the command ""./crefi.py -n 2000 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=hardlink /mnt/master/"

Actual results: It failed to sync few hardlinks.


Expected results: It shouldn't fail to sync hardlinks


Additional info:

To take an example 

This is the file name,  ~/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB which is missing from the slave.
this is hardlink to the file 
~/level07/level17/level27/5278c705%%RTP7V23802

both have same inodes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@redcell ~]# ls /mnt/master/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB -i
11831742919762312969 /mnt/master/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB
[root@redcell ~]# ls /mnt/master/level07/level17/level27/5278c705%%RTP7V23802 -i
11831742919762312969 /mnt/master/level07/level17/level27/5278c705%%RTP7V23802
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

gfdi of the file missing 

[root@redcell hardlink_to_files]# getfattr -n glusterfs.gfid.string 527a515c%%3QMD5P6YJB
# file: 527a515c%%3QMD5P6YJB
glusterfs.gfid.string="f35e8545-0325-4e64-a432-cb7f8e2cff09"



this has the entry in the xsync changelog,

[root@redcell xsync]# grep f35e8545-0325-4e64-a432-cb7f8e2cff09 *
XSYNC-CHANGELOG.1383804613:E f35e8545-0325-4e64-a432-cb7f8e2cff09 LINK 0f35938d-d910-4c4a-9699-aa90b61c6f1f%2F5278c705%25%25RTP7V23802
XSYNC-CHANGELOG.1383804613:D f35e8545-0325-4e64-a432-cb7f8e2cff09


And this file has been processed by the worker, 

root@redcell ~]# grep XSYNC-CHANGELOG.1383804613 /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log
[2013-11-07 12:08:32.530020] I [master(/bricks/brick1):922:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1383804613


Which mean entry operation has been failed on the slave side and there are no logs on the corresponding slave client logs.
Comment 1 Vijaykumar Koppad 2013-11-07 05:46:33 EST
After these failure to sync files to slave, the deletion on files crashed gsyncd with following backtrace. 


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
f-31b8-44a1-b626-c1c302dd218f ...
[2013-11-06 22:54:07.616204] I [master(/bricks/brick3):413:crawlwrap] _GMaster: crawl interval: 3 seconds
[2013-11-06 22:54:07.741698] E [repce(/bricks/brick3):188:__call__] RepceClient: call 10737:140265205274368:1383758647.66 (entry_ops) failed on peer with OSError
[2013-11-06 22:54:07.742059] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 535, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1134, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 437, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 858, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 815, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 780, in process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__
    raise res
OSError: [Errno 61] No data available

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Which looks like the backtrace observed in this Bug 1027252
Comment 3 Amar Tumballi 2013-11-11 04:53:12 EST
backtrace looks same as bug 1028343, which is fixed in .42rhs. Can this workload be tested on the same build?
Comment 4 Nagaprasad Sathyanarayana 2013-11-11 06:40:25 EST
Looks like the fix for 1028343 must fix this issue also.  Requesting QE to verify in the .42rhs build.
Comment 5 Vijaykumar Koppad 2013-11-12 06:48:44 EST
I tried it on the build glusterfs-3.4.0.43rhs-1, still I see this issue. 

on master no of files are,

[root@shaktiman ~]# find /mnt/master/ | wc -l                                      220201

and on slave it as synced only, 

[root@spiderman ~]# find /mnt/slave/ | wc -l
218535

some 1.5K files are missing,

and those are not shown in the status detail skipped files also.

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE                  STATUS     CHECKPOINT STATUS    CRAWL STATUS       FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING    FILES SKIPPED   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
shaktiman.blr.redhat.com    master        /bricks/brick1    10.70.42.171::slave    Active     N/A                  Changelog Crawl    192326         0                0                0                  0               
snow.blr.redhat.com         master        /bricks/brick4    10.70.42.229::slave    Passive    N/A                  N/A                5601           0                0                0                  0               
targarean.blr.redhat.com    master        /bricks/brick3    10.70.43.159::slave    Active     N/A                  Changelog Crawl    193030         0                0                0                  0               
riverrun.blr.redhat.com     master        /bricks/brick2    10.70.42.225::slave    Passive    N/A                  N/A                0              0                0                0                  0
Comment 6 Kotresh HR 2013-12-16 03:27:27 EST
Vijaykumar,

Could you test with entry-timeout as 0. You'd need to configure "gluster_params" to have this option.

# gluster volume geo <master> <slave> config gluster_params "aux-gfid-mount entry-timeout=0"
Comment 7 Vivek Agarwal 2013-12-17 19:47:44 EST
This needs to go as a known issue, removing the corbett flag.
Comment 8 Shalaka 2014-01-02 06:14:37 EST
Modified the DocText for this Known Issue. Please review and confirm.
Comment 9 Kotresh HR 2014-01-20 01:03:40 EST
The Doc Text looks fine.
Comment 13 Aravinda VK 2015-11-25 03:48:03 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.
Comment 14 Aravinda VK 2015-11-25 03:50:21 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.