1179701 – dist-geo-rep: Geo-rep skipped some files after replacing a node with the same hostname and IP

Bug 1179701 - dist-geo-rep: Geo-rep skipped some files after replacing a node with the same hostname and IP

Summary: dist-geo-rep: Geo-rep skipped some files after replacing a node with the same...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Aravinda VK
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:	consistency
Depends On:
Blocks:	1179709 1202842 1223636
TreeView+	depends on / blocked

Reported:	2015-01-07 11:27 UTC by shilpa
Modified:	2015-07-29 04:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.7.0-2.el6rhs
Doc Type:	Bug Fix
Doc Text:	Previously, when a new node was added to Red Hat Gluster Storage node, Historical Changelogs were not available. Due to issue in comparing the xtime, Hybrid crawl missed few files to sync. With this fix, Xtime compare logic in Geo-replication is fixed in Hybrid Crawl and it does not miss any files to sync to Slave.
Clone Of:
Environment:
Last Closed:	2015-07-29 04:37:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description shilpa 2015-01-07 11:27:21 UTC

Description of problem:
Stopped geo-rep and replaced a node after reinstalling it by following the procedure of Replacing_a_Host_Machine_with_the_Same_IP_Address provided in the doc link below:
http://documentation-devel.engineering.redhat.com/site/documentation/en-US/Red_Hat_Storage/3/html-single/Administration_Guide/index.html#Replacing_a_Host_Machine_with_the_Same_IP_Address.
After adding the node and starting the geo-rep, some files were not synced.

Version-Release number of selected component (if applicable):
glusterfs-3.6.0.41-1.el6rhs.x86_64

How reproducible:
Tried once

Steps to Reproduce:
1. Create distribute-replicate master and slave volumes 2*2 cluster with four nodes in each cluster and start geo-replication.
2. Ensure that all the data if any is replicated.
3. stop geo-replication.
4. Replace an active node from the master volume with a new node with the same hostname and IP and configuration. Follow the procedure of Replacing_a_Host_Machine_with_the_Same_IP_Address as provided in the RHS doc.
5. Run I/O on the master volume when the node that is being replaced is down.
6. After adding the new node to the master volume with same hostname and IP wait for the self heal to copy data to the new node added.
7. Start geo-rep with the following process:
a. gluster system:: execute gsec_create
b. gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL create push-pem force
c. gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL start force

Actual results:
There was mismatch on the number of files on master and slave. All the files are not synced.

Expected results:
After the geo-rep start force, all the files that were written during the time when the active node was down, should start syncing and no files should be missed if the self-heal is complete.

Additional info:

Files missing:

On master mountpoint:
# find /mnt/slave f | wc -l
24975

On slave mountpoint:

## find /mnt/master| wc -l
31011

# arequal-checksum -p /mnt/master

Entry counts
Regular files : 30000
Directories : 1011
Symbolic links : 0
Other : 0
Total : 31011

Metadata checksums
Regular files : 3e9
Directories : 24e15a
Symbolic links : 3e9
Other : 3e9

Checksums
Regular files : ad313c59da41368f7bc1f35c1237a4e1
Directories : 70560818100c3b63
Symbolic links : 0
Other : 0
Total : a6a6c71dd87aa90d

[root@ccr changelogs]# arequal-checksum -p /mnt/slave

Entry counts
Regular files : 23964
Directories : 1011
Symbolic links : 0
Other : 0
Total : 24975

Metadata checksums
Regular files : 3e9
Directories : 24e15a
Symbolic links : 3e9
Other : 3e9

Checksums
Regular files : c40857f4e79796a793377f8ddbbbeb73
Directories : 5178010c7b174506
Symbolic links : 0
Other : 0
Total : 6472975473b38d2

Self-heal was complete:

# find /bricks/master_brick2 -not -path '*/\.*' -type f | wc -l
15028

# find /bricks/master_brick3/ -not -path '*/\.*' -type f | wc -l
15028

In the geo-rep logs, I see there are 326 files in the SKIPPED section but when compared when compared there are more than 326 files missing..

[2015-01-07 13:02:37.657751] W [master(/bricks/master_brick2):996:process] _GMaster: changelogs CHANGELOG.1420615598 CHANGELOG.1420615613 CHANGE
LOG.1420615628 could not be processed - moving on...
[2015-01-07 13:02:37.661405] W [master(/bricks/master_brick2):1000:process] _GMaster: SKIPPED GFID = 4242dc96-1b85-48fa-b30a-394ccc5242cd,309c25
fe-c8fe-4148-9cc0-25af2888853d,606e3871-318f-4c77-a9cf-f2f7efffc3e2,976cd720-50a3-4170-bf75-278472f44533,a5428043-8c45-4c70-baf5-df95ea962fe2,78
b7871d-4ca8-4313-8ce9-34501010cd26,82536f00-21a6-484b-a14d-3f727062657c,09f03143-3b6b-42e8-bbd4-3350359984f0.....

Comment 2 Pavithra 2015-01-12 09:31:13 UTC

Hi Aravinda, 

Can you please review the edited doc text and sign off?

Comment 3 Aravinda VK 2015-01-12 09:34:23 UTC

(In reply to Pavithra from comment #2)
> Hi Aravinda, 
> 
> Can you please review the edited doc text and sign off?

doc text looks good to me.

Comment 4 Pavithra 2015-01-12 11:53:39 UTC

Made a minor edit.

Comment 9 Rahul Hinduja 2015-07-20 12:18:27 UTC

Verified with build: glusterfs-3.7.1-10.el6rhs.x86_64

Once the node is re-installed which got the same IP. Followed the steps mentioned in 

http://documentation-devel.engineering.redhat.com/site/documentation/en-US/Red_Hat_Storage/3/html-single/Administration_Guide/index.html#Replacing_a_Host_Machine_with_the_Same_Hostname

Once the geo-rep started, it started performed the Hybrid crawl and sync the data to the slave. 

Master:
=======
[root@wingo ~]# find /mnt/6m | wc -l 
7565
[root@wingo ~]# 

[root@wingo scripts]# arequal-checksum -p /mnt/6m

Entry counts
Regular files   : 4723
Directories     : 871
Symbolic links  : 1971
Other           : 0
Total           : 7565

Metadata checksums
Regular files   : 47a9e5
Directories     : 24d481
Symbolic links  : 5a815a
Other           : 3e9

Checksums
Regular files   : 2bc0ad4daf6f43f398647ccb254094b5
Directories     : 5f77734b7d784455
Symbolic links  : 7a33023b4e214744
Other           : 0
Total           : 96e0a0f6b976d457
[root@wingo scripts]# 


Slave:
======

[root@wingo ~]# find /mnt/6s | wc -l 
7565
[root@wingo ~]# 

[root@wingo scripts]# arequal-checksum -p /mnt/6s

Entry counts
Regular files   : 4723
Directories     : 871
Symbolic links  : 1971
Other           : 0
Total           : 7565

Metadata checksums
Regular files   : 47a9e5
Directories     : 24d481
Symbolic links  : 5a815a
Other           : 3e9

Checksums
Regular files   : 2bc0ad4daf6f43f398647ccb254094b5
Directories     : 5f77734b7d784455
Symbolic links  : 7a33023b4e214744
Other           : 0
Total           : 96e0a0f6b976d457
[root@wingo scripts]# 

The node which was re-installed was ACTIVE before and after re-installation, the node becomes ACTIVE {Without use-meta-volume}

Data was synced to the replicate brick using "heal full" before the geo-rep session is created.

Comment 10 Bhavana 2015-07-27 10:15:36 UTC

Hi Aravinda,

Please review the doc-text and sign-off if this looks ok.

Comment 12 Bhavana 2015-07-28 05:22:51 UTC

Changing the doc text flag to +

Comment 14 errata-xmlrpc 2015-07-29 04:37:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.