Description of problem: first xsync crawl failed to sync few hardlink to slave when there were some 200K hardlinks. Investigating on the files which were missing on the slave, had entries in the xsync-changelogs which were processed. Version-Release number of selected component (if applicable): glusterfs-3.4.0.39rhs-1 How reproducible: Didn't try to reproduce, but it looks like reproducible. Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave. 2.create some 200K files on master using the command "./crefi.py -n 2000 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master/" and let it sync on the master. 3.Stop the geo-rep session, 4.create hardlinks to all those files using the command ""./crefi.py -n 2000 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=hardlink /mnt/master/" Actual results: It failed to sync few hardlinks. Expected results: It shouldn't fail to sync hardlinks Additional info: To take an example This is the file name, ~/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB which is missing from the slave. this is hardlink to the file ~/level07/level17/level27/5278c705%%RTP7V23802 both have same inodes >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [root@redcell ~]# ls /mnt/master/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB -i 11831742919762312969 /mnt/master/level07/level17/level27/hardlink_to_files/527a515c%%3QMD5P6YJB [root@redcell ~]# ls /mnt/master/level07/level17/level27/5278c705%%RTP7V23802 -i 11831742919762312969 /mnt/master/level07/level17/level27/5278c705%%RTP7V23802 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gfdi of the file missing [root@redcell hardlink_to_files]# getfattr -n glusterfs.gfid.string 527a515c%%3QMD5P6YJB # file: 527a515c%%3QMD5P6YJB glusterfs.gfid.string="f35e8545-0325-4e64-a432-cb7f8e2cff09" this has the entry in the xsync changelog, [root@redcell xsync]# grep f35e8545-0325-4e64-a432-cb7f8e2cff09 * XSYNC-CHANGELOG.1383804613:E f35e8545-0325-4e64-a432-cb7f8e2cff09 LINK 0f35938d-d910-4c4a-9699-aa90b61c6f1f%2F5278c705%25%25RTP7V23802 XSYNC-CHANGELOG.1383804613:D f35e8545-0325-4e64-a432-cb7f8e2cff09 And this file has been processed by the worker, root@redcell ~]# grep XSYNC-CHANGELOG.1383804613 /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log [2013-11-07 12:08:32.530020] I [master(/bricks/brick1):922:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.76%3Agluster%3A%2F%2F127.0.0.1%3Aslave/bd42ad17ef8864d51407b1c6478f5dc6/xsync/XSYNC-CHANGELOG.1383804613 Which mean entry operation has been failed on the slave side and there are no logs on the corresponding slave client logs.
After these failure to sync files to slave, the deletion on files crashed gsyncd with following backtrace. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f-31b8-44a1-b626-c1c302dd218f ... [2013-11-06 22:54:07.616204] I [master(/bricks/brick3):413:crawlwrap] _GMaster: crawl interval: 3 seconds [2013-11-06 22:54:07.741698] E [repce(/bricks/brick3):188:__call__] RepceClient: call 10737:140265205274368:1383758647.66 (entry_ops) failed on peer with OSError [2013-11-06 22:54:07.742059] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 535, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1134, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 437, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 858, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 815, in process if self.process_change(change, done, retry): File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 780, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 204, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 189, in __call__ raise res OSError: [Errno 61] No data available >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Which looks like the backtrace observed in this Bug 1027252
backtrace looks same as bug 1028343, which is fixed in .42rhs. Can this workload be tested on the same build?
Looks like the fix for 1028343 must fix this issue also. Requesting QE to verify in the .42rhs build.
I tried it on the build glusterfs-3.4.0.43rhs-1, still I see this issue. on master no of files are, [root@shaktiman ~]# find /mnt/master/ | wc -l 220201 and on slave it as synced only, [root@spiderman ~]# find /mnt/slave/ | wc -l 218535 some 1.5K files are missing, and those are not shown in the status detail skipped files also. MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- shaktiman.blr.redhat.com master /bricks/brick1 10.70.42.171::slave Active N/A Changelog Crawl 192326 0 0 0 0 snow.blr.redhat.com master /bricks/brick4 10.70.42.229::slave Passive N/A N/A 5601 0 0 0 0 targarean.blr.redhat.com master /bricks/brick3 10.70.43.159::slave Active N/A Changelog Crawl 193030 0 0 0 0 riverrun.blr.redhat.com master /bricks/brick2 10.70.42.225::slave Passive N/A N/A 0 0 0 0 0
Vijaykumar, Could you test with entry-timeout as 0. You'd need to configure "gluster_params" to have this option. # gluster volume geo <master> <slave> config gluster_params "aux-gfid-mount entry-timeout=0"
This needs to go as a known issue, removing the corbett flag.
Modified the DocText for this Known Issue. Please review and confirm.
The Doc Text looks fine.
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.