Hide Forgot
Description of problem: checkpoint for one of the node failed complete even though all the files were synced to slave. # gluster --mode=script volume geo master 10.70.43.159::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ shaktiman.blr.redhat.com master /bricks/brick1 10.70.42.171::slave Active checkpoint as of 2013-12-10 19:18:18 is completed at 2013-12-10 19:18:27 Changelog Crawl targarean.blr.redhat.com master /bricks/brick3 10.70.43.159::slave Active checkpoint as of 2013-12-10 19:18:18 is not reached yet Changelog Crawl snow.blr.redhat.com master /bricks/brick4 10.70.42.229::slave Passive N/A N/A riverrun.blr.redhat.com master /bricks/brick2 10.70.42.225::slave Passive N/A N/A and the arequal checksum on both master and slave match, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # ./arequal-checksum /mnt/master/ Entry counts Regular files : 1000 Directories : 101 Symbolic links : 0 Other : 0 Total : 1101 Metadata checksums Regular files : 3e9 Directories : 24e15a Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 90b14246848a2a048f68bcac192bd82c Directories : 975727c7e0a5b13 Symbolic links : 0 Other : 0 Total : 16ac8c96e3aba93b # ./arequal-checksum /mnt/slave/ Entry counts Regular files : 1000 Directories : 101 Symbolic links : 0 Other : 0 Total : 1101 Metadata checksums Regular files : 3e9 Directories : 24e15a Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 90b14246848a2a048f68bcac192bd82c Directories : 975727c7e0a5b13 Symbolic links : 0 Other : 0 Total : 16ac8c96e3aba93b >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable): glusterfs-3.4.0.48geo-1 How reproducible: Didn't try to reproduce. It looks like, it is not easy to hit it. Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave. 2.create data on master through nfs mount as unprivileged user. Let it sync. ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master_nfs/ 3. delete the data on master and let it purge from slave too. 4. Now create some more data on master and set the checkpoint. ./crefi.py -n 10 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master_nfs/ Actual results: checkpoint for one of the node failed complete even though all the files were synced to slave Expected results: when checkpoint has completed, it show that it has completed in the status and status detail output. Additional info: Looking at the logs on both bricks brick1 and brick3, ( brick1 was the one which got completed, and brick2 didn't complete) brick1 geo-rep log file has checkpoint completed logs >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-12-10 19:18:12.294694] I [master(/bricks/brick1):1065:crawl] _GMaster: slave's time: (1386683273, 0) [2013-12-10 19:18:20.737725] I [gsyncd(conf):479:main_i] <top>: checkpoint now:1386683298.237099 set [2013-12-10 19:18:20.745501] I [syncdutils(conf):159:finalize] <top>: exiting. [2013-12-10 19:18:27.65421] I [master(/bricks/brick1):587:checkpt_service] _GMaster: checkpoint now:1386683298.237099 completed [2013-12-10 19:19:08.419729] I [master(/bricks/brick1):451:crawlwrap] _GMaster: 20 crawls, 1 turns >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> brick3 geo-rep log doesn't have checkpoint completed logs , >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-12-10 19:18:02.353600] I [master(/bricks/brick3):1065:crawl] _GMaster: slave's time: (1386683263, 0) [2013-12-10 19:18:10.427390] I [master(/bricks/brick3):451:crawlwrap] _GMaster: 14 crawls, 3 turns [2013-12-10 19:18:20.979933] I [gsyncd(conf):479:main_i] <top>: checkpoint now:1386683298.237099 set [2013-12-10 19:18:21.71181] I [syncdutils(conf):159:finalize] <top>: exiting. [2013-12-10 19:19:10.709388] I [master(/bricks/brick3):451:crawlwrap] _GMaster: 20 crawls, 0 turns >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> getfattr on brick1 and brick3 ]# getfattr -d -m . -e hex /bricks/brick1/ getfattr: Removing leading '/' from absolute path names # file: bricks/brick1/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.0ae03a11-5791-4a2b-8b65-51d333ef6336.stime=0x52a71b9800000000 trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.xtime=0x52a71b8e0005df50 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe trusted.glusterfs.volume-id=0x99461ce2e8f7443f80488938fcaf379b # getfattr -d -m . -e hex /bricks/brick3/ getfattr: Removing leading '/' from absolute path names # file: bricks/brick3/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.0ae03a11-5791-4a2b-8b65-51d333ef6336.stime=0x52a71b8e00000000 trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.xtime=0x52a71b8e0001e253 trusted.glusterfs.dht=0x00000001000000007fffffffffffffff trusted.glusterfs.volume-id=0x99461ce2e8f7443f80488938fcaf379b
This has happened again in the build glusterfs-libs-3.4.0.57rhs-1. # getfattr -d -m . -e hex /bricks/master_brick1/ getfattr: Removing leading '/' from absolute path names # file: bricks/master_brick1/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.ce29b5ec-8d1c-4060-a356-4420e21679a5.stime=0x52e6090800000000 trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.xtime=0x52e60907000eeeb1 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe trusted.glusterfs.volume-id=0x7fd3bc69de104d299179578b0c74e22a # getfattr -d -m . -e hex /bricks/master_brick3/ getfattr: Removing leading '/' from absolute path names # file: bricks/master_brick3/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.ce29b5ec-8d1c-4060-a356-4420e21679a5.stime=0x52e6090800000000 trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.xtime=0x52e6090800022cf3 trusted.glusterfs.dht=0x00000001000000007fffffffffffffff trusted.glusterfs.volume-id=0x7fd3bc69de104d299179578b0c74e22a # getfattr -d -m . -e hex /mnt/master getfattr: Removing leading '/' from absolute path names # file: mnt/master trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.ce29b5ec-8d1c-4060-a356-4420e21679a5.stime=0x52e6090800000000 trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.xtime=0x52e609080001c6da trusted.glusterfs.volume-id=0x7fd3bc69de104d299179578b0c74e22a
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.