Bug 1040344

Summary: Dist-geo-rep : checkpoint for one of the node failed complete even though all the files were synced to slave
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED EOL QA Contact: storage-qa-internal <storage-qa-internal>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: avishwan, chrisw, csaba, david.macdonald
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: checkpoint
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-25 08:49:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijaykumar Koppad 2013-12-11 09:03:15 UTC
Description of problem: 

checkpoint for one of the node failed complete even though all the files were synced to slave.

# gluster --mode=script volume geo master 10.70.43.159::slave status  
 
MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE                  STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
shaktiman.blr.redhat.com    master        /bricks/brick1    10.70.42.171::slave    Active     checkpoint as of 2013-12-10 19:18:18 is completed at 2013-12-10 19:18:27    Changelog Crawl        
targarean.blr.redhat.com    master        /bricks/brick3    10.70.43.159::slave    Active     checkpoint as of 2013-12-10 19:18:18 is not reached yet                     Changelog Crawl        
snow.blr.redhat.com         master        /bricks/brick4    10.70.42.229::slave    Passive    N/A                                                                         N/A                    
riverrun.blr.redhat.com     master        /bricks/brick2    10.70.42.225::slave    Passive    N/A                                                                         N/A      

and the arequal checksum on both master and slave match,

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# ./arequal-checksum  /mnt/master/
Entry counts
Regular files   : 1000
Directories     : 101
Symbolic links  : 0
Other           : 0
Total           : 1101

Metadata checksums
Regular files   : 3e9
Directories     : 24e15a
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 90b14246848a2a048f68bcac192bd82c
Directories     : 975727c7e0a5b13
Symbolic links  : 0
Other           : 0
Total           : 16ac8c96e3aba93b


# ./arequal-checksum /mnt/slave/

Entry counts
Regular files   : 1000
Directories     : 101
Symbolic links  : 0
Other           : 0
Total           : 1101

Metadata checksums
Regular files   : 3e9
Directories     : 24e15a
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 90b14246848a2a048f68bcac192bd82c
Directories     : 975727c7e0a5b13
Symbolic links  : 0
Other           : 0
Total           : 16ac8c96e3aba93b

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): 
glusterfs-3.4.0.48geo-1


How reproducible: Didn't try to reproduce. It looks like, it is not easy to hit it.


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.create data on master through nfs mount as unprivileged user. Let it sync. 
./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master_nfs/
3. delete the data on master and let it purge from slave too.
4. Now create some more data on master and set the checkpoint.
./crefi.py -n 10 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master_nfs/


Actual results:  checkpoint for one of the node failed complete even though all the files were synced to slave


Expected results: when checkpoint has completed, it show that it has completed in the status and status detail output.


Additional info:

Looking at the logs on both bricks brick1 and brick3, ( brick1 was the one which got completed, and brick2 didn't complete) 

brick1 geo-rep log file has checkpoint completed logs

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-12-10 19:18:12.294694] I [master(/bricks/brick1):1065:crawl] _GMaster: slave's time: (1386683273, 0)
[2013-12-10 19:18:20.737725] I [gsyncd(conf):479:main_i] <top>: checkpoint now:1386683298.237099 set
[2013-12-10 19:18:20.745501] I [syncdutils(conf):159:finalize] <top>: exiting.
[2013-12-10 19:18:27.65421] I [master(/bricks/brick1):587:checkpt_service] _GMaster: checkpoint now:1386683298.237099
completed
[2013-12-10 19:19:08.419729] I [master(/bricks/brick1):451:crawlwrap] _GMaster: 20 crawls, 1 turns
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

brick3 geo-rep log doesn't  have checkpoint completed logs ,


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-12-10 19:18:02.353600] I [master(/bricks/brick3):1065:crawl] _GMaster: slave's time: (1386683263, 0)
[2013-12-10 19:18:10.427390] I [master(/bricks/brick3):451:crawlwrap] _GMaster: 14 crawls, 3 turns
[2013-12-10 19:18:20.979933] I [gsyncd(conf):479:main_i] <top>: checkpoint now:1386683298.237099 set
[2013-12-10 19:18:21.71181] I [syncdutils(conf):159:finalize] <top>: exiting.
[2013-12-10 19:19:10.709388] I [master(/bricks/brick3):451:crawlwrap] _GMaster: 20 crawls, 0 turns
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

getfattr on brick1 and brick3

]# getfattr -d -m . -e hex /bricks/brick1/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.0ae03a11-5791-4a2b-8b65-51d333ef6336.stime=0x52a71b9800000000
trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.xtime=0x52a71b8e0005df50
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.volume-id=0x99461ce2e8f7443f80488938fcaf379b


# getfattr -d -m . -e hex /bricks/brick3/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick3/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.0ae03a11-5791-4a2b-8b65-51d333ef6336.stime=0x52a71b8e00000000
trusted.glusterfs.99461ce2-e8f7-443f-8048-8938fcaf379b.xtime=0x52a71b8e0001e253
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff
trusted.glusterfs.volume-id=0x99461ce2e8f7443f80488938fcaf379b

Comment 2 Vijaykumar Koppad 2014-01-27 09:13:56 UTC
This has happened again in the build glusterfs-libs-3.4.0.57rhs-1.


# getfattr -d -m . -e hex /bricks/master_brick1/
getfattr: Removing leading '/' from absolute path names
# file: bricks/master_brick1/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.ce29b5ec-8d1c-4060-a356-4420e21679a5.stime=0x52e6090800000000
trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.xtime=0x52e60907000eeeb1
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.volume-id=0x7fd3bc69de104d299179578b0c74e22a



# getfattr -d -m . -e hex /bricks/master_brick3/
getfattr: Removing leading '/' from absolute path names
# file: bricks/master_brick3/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.ce29b5ec-8d1c-4060-a356-4420e21679a5.stime=0x52e6090800000000
trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.xtime=0x52e6090800022cf3
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff
trusted.glusterfs.volume-id=0x7fd3bc69de104d299179578b0c74e22a


# getfattr -d -m . -e hex /mnt/master
getfattr: Removing leading '/' from absolute path names
# file: mnt/master
trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.ce29b5ec-8d1c-4060-a356-4420e21679a5.stime=0x52e6090800000000
trusted.glusterfs.7fd3bc69-de10-4d29-9179-578b0c74e22a.xtime=0x52e609080001c6da
trusted.glusterfs.volume-id=0x7fd3bc69de104d299179578b0c74e22a

Comment 4 Aravinda VK 2015-11-25 08:49:31 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 5 Aravinda VK 2015-11-25 08:51:10 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.