1063229 – dist-geo-rep: Few regular files are not synced to slave when node is taken down updated and then brought back online

Bug 1063229 - dist-geo-rep: Few regular files are not synced to slave when node is taken down updated and then brought back online

Summary: dist-geo-rep: Few regular files are not synced to slave when node is taken do...

Keywords:
Status:	CLOSED DUPLICATE of bug 1162142
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Aravinda VK
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	consistency
Depends On:
Blocks:	1035040
TreeView+	depends on / blocked

Reported:	2014-02-10 10:24 UTC by M S Vishwanath Bhat
Modified:	2016-06-01 01:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	After upgrade, two geo-rep monitor processes were running for same session. Both process were trying to use the same xsync changelog file to record the changes. Workaround: Before running 'geo-rep create force' command, kill the geo-rep monitor process.
Clone Of:
Environment:
Last Closed:	2015-03-27 17:07:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description M S Vishwanath Bhat 2014-02-10 10:24:47 UTC

Description of problem:
After upgrading the geo-rep from update 1 to update 2 build (from 44rhs to 59rhs), few of the regular files are not synced to slave. One of the node has around 100 changelog files in .processing directory and one xsync changelog in xsync directory in working-dir. But these are not synced even after keeping the setup Idle for about more than 2 days. 

Version-Release number of selected component (if applicable):
Upgrade from glusterfs-3.4.0.44rhs to glusterfs-3.4.0.59rhs-1.el6rhs.x86_64

How reproducible:
Hit twice in two tries

Steps to Reproduce:
1. Install 44 rhs build and then using the doc upgrade it to 55 rhs

Actual results:

After two days of Idle this was the checksum taken on master mountpoint and slave mountpoint

[root@gauss ~]# ./arequal-checksum -p /mnt/master/

Entry counts
Regular files   : 101283
Directories     : 8357
Symbolic links  : 6934
Other           : 0
Total           : 116574

Metadata checksums
Regular files   : 480610
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 29d41d0b2d6a2e873ad2ac93e66af98d
Directories     : 231e331709486e7b
Symbolic links  : 6b180c193e515362
Other           : 0
Total           : 5b008e96fc19ea13


[root@gauss ~]# ./arequal-checksum -p /mnt/slave/

Entry counts
Regular files   : 99251
Directories     : 8357
Symbolic links  : 6934
Other           : 0
Total           : 114542

Metadata checksums
Regular files   : 47ff5c
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 52ecb8387db98b68dbfd64c594c782eb
Directories     : 315c723e20240a77
Symbolic links  : 6b180c193e515362
Other           : 0
Total           : d355a2daf70b5096


There are some files in working-dir/.processing

[root@pythagoras ~]# ssh root@archimedes
Last login: Fri Feb  7 00:41:14 2014 from pythagoras
[root@archimedes ~]# ls /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.37.188%3Agluster%3A%2F%2F127.0.0.1%3Aslave/32bb6e3a46ef511ac32bdc895ff0debf/.processing/
CHANGELOG.1391765691  CHANGELOG.1391765991  CHANGELOG.1391766293  CHANGELOG.1391768264  CHANGELOG.1391768568  CHANGELOG.1391768871  CHANGELOG.1391769173  CHANGELOG.1391769477
CHANGELOG.1391765706  CHANGELOG.1391766006  CHANGELOG.1391766308  CHANGELOG.1391768279  CHANGELOG.1391768583  CHANGELOG.1391768886  CHANGELOG.1391769188  CHANGELOG.1391769492
CHANGELOG.1391765721  CHANGELOG.1391766021  CHANGELOG.1391766323  CHANGELOG.1391768295  CHANGELOG.1391768598  CHANGELOG.1391768901  CHANGELOG.1391769203  CHANGELOG.1391769507
CHANGELOG.1391765736  CHANGELOG.1391766036  CHANGELOG.1391766338  CHANGELOG.1391768310  CHANGELOG.1391768613  CHANGELOG.1391768917  CHANGELOG.1391769218  CHANGELOG.1391769524
CHANGELOG.1391765751  CHANGELOG.1391766052  CHANGELOG.1391766353  CHANGELOG.1391768325  CHANGELOG.1391768628  CHANGELOG.1391768932  CHANGELOG.1391769233  CHANGELOG.1391769539
CHANGELOG.1391765766  CHANGELOG.1391766067  CHANGELOG.1391766369  CHANGELOG.1391768340  CHANGELOG.1391768643  CHANGELOG.1391768947  CHANGELOG.1391769249  CHANGELOG.1391769554
CHANGELOG.1391765781  CHANGELOG.1391766082  CHANGELOG.1391766384  CHANGELOG.1391768355  CHANGELOG.1391768658  CHANGELOG.1391768962  CHANGELOG.1391769264  CHANGELOG.1391769569
CHANGELOG.1391765796  CHANGELOG.1391766097  CHANGELOG.1391766399  CHANGELOG.1391768370  CHANGELOG.1391768673  CHANGELOG.1391768977  CHANGELOG.1391769279  CHANGELOG.1391769585
CHANGELOG.1391765811  CHANGELOG.1391766112  CHANGELOG.1391766414  CHANGELOG.1391768385  CHANGELOG.1391768688  CHANGELOG.1391768992  CHANGELOG.1391769294  CHANGELOG.1391769600
CHANGELOG.1391765826  CHANGELOG.1391766127  CHANGELOG.1391766429  CHANGELOG.1391768400  CHANGELOG.1391768703  CHANGELOG.1391769007  CHANGELOG.1391769309  CHANGELOG.1391769615
CHANGELOG.1391765841  CHANGELOG.1391766142  CHANGELOG.1391766444  CHANGELOG.1391768415  CHANGELOG.1391768719  CHANGELOG.1391769022  CHANGELOG.1391769324  CHANGELOG.1391769630
CHANGELOG.1391765856  CHANGELOG.1391766157  CHANGELOG.1391766459  CHANGELOG.1391768430  CHANGELOG.1391768734  CHANGELOG.1391769037  CHANGELOG.1391769339  CHANGELOG.1391769645
CHANGELOG.1391765871  CHANGELOG.1391766172  CHANGELOG.1391768141  CHANGELOG.1391768445  CHANGELOG.1391768749  CHANGELOG.1391769052  CHANGELOG.1391769354  CHANGELOG.1391769660
CHANGELOG.1391765886  CHANGELOG.1391766187  CHANGELOG.1391768156  CHANGELOG.1391768460  CHANGELOG.1391768764  CHANGELOG.1391769068  CHANGELOG.1391769372  CHANGELOG.1391769676
CHANGELOG.1391765901  CHANGELOG.1391766202  CHANGELOG.1391768171  CHANGELOG.1391768477  CHANGELOG.1391768779  CHANGELOG.1391769083  CHANGELOG.1391769387  CHANGELOG.1391769691
CHANGELOG.1391765916  CHANGELOG.1391766218  CHANGELOG.1391768186  CHANGELOG.1391768492  CHANGELOG.1391768795  CHANGELOG.1391769098  CHANGELOG.1391769402  CHANGELOG.1391769706
CHANGELOG.1391765931  CHANGELOG.1391766233  CHANGELOG.1391768201  CHANGELOG.1391768507  CHANGELOG.1391768810  CHANGELOG.1391769113  CHANGELOG.1391769417  CHANGELOG.1391769721
CHANGELOG.1391765946  CHANGELOG.1391766248  CHANGELOG.1391768216  CHANGELOG.1391768522  CHANGELOG.1391768826  CHANGELOG.1391769128  CHANGELOG.1391769432
CHANGELOG.1391765961  CHANGELOG.1391766263  CHANGELOG.1391768232  CHANGELOG.1391768538  CHANGELOG.1391768841  CHANGELOG.1391769143  CHANGELOG.1391769447
CHANGELOG.1391765976  CHANGELOG.1391766278  CHANGELOG.1391768247  CHANGELOG.1391768553  CHANGELOG.1391768856  CHANGELOG.1391769158  CHANGELOG.1391769462
[root@archimedes ~]# ls /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.37.188%3Agluster%3A%2F%2F127.0.0.1%3Aslave/32bb6e3a46ef511ac32bdc895ff0debf/xsync/
XSYNC-CHANGELOG.1391763066


The same was the case for other node in the volume.

Expected results:
All the files should be synced to slave.

Additional info:

I have archived all the logs and keeping the environment as it is. Will update the bug with more information.

Comment 3 Venky Shankar 2014-02-12 00:43:45 UTC

Due to some reason there are two worker (and monitor) processes running on the host 'pythogoras'. This is most probably the reason for missing files the two processes may use the same Xsync changelog filename, thereby truncating the file (and loosing and changes) when the "loosing" process initializes an Xsync changelog file.

Two worker processes
--------------------

[root@pythagoras 59ddf777397e52a13ba1333653d63854]# ps auxww |grep feedback
root     10311  0.0  0.0 103244   808 pts/14   S+   06:09   0:00 grep feedback
root     21379  0.2  0.7 1121832 14348 ?       Sl   Feb10   4:49 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0  -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a ssh://euclid::slave -N -p  --slave-id a47ff8cc-beef-48ac-954b-c292cb044085 --feedback-fd 6 --local-path /rhs/bricks/brick0 --local-id .%2Frhs%2Fbricks%2Fbrick0 --resource-remote ssh://root@euclid:gluster://localhost:slave
root     21570  0.2  0.6 1120836 13180 ?       Sl   Feb10   4:46 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0  -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a euclid::slave -N -p  --slave-id a47ff8cc-beef-48ac-954b-c292cb044085 --feedback-fd 8 --local-path /rhs/bricks/brick0 --local-id .%2Frhs%2Fbricks%2Fbrick0 --resource-remote ssh://root@euclid:gluster://localhost:slave

Two monitor processes
---------------------
[root@pythagoras 59ddf777397e52a13ba1333653d63854]# ps auxww |grep monitor
root      2159  0.0  0.1 360460  3620 ?        Ssl  Feb07   0:36 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0  --monitor -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a ssh://euclid::slave
root      4631  0.0  0.5 360468 10908 ?        Ssl  Feb07   0:34 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0  --monitor -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a euclid::slave

Comment 5 Avra Sengupta 2014-02-13 14:00:46 UTC

Steps:

1. After upgrade, delete the config file and kill all gluster and gsync processes.

2. Restart glusterd

Behaviour:

1. On every node  a half-baked gsyncd.conf file is created.

2. glusterd starts gsycnd on every node, even though the state-file entry is missing from the half baked config file.

3. On every node the gsyncd spawned dies with a log (GLusterfs session went down), except the node where there is a passwordless ssh connection with the slave. In that node, this spawned gsyncd process stays active.

Comment 6 Venky Shankar 2014-02-13 16:24:04 UTC

Avra,

Can we not create the "half baked" config file by any chance (we cannot control the spawning of gsyncd when glusterd starts). That way, gsyncd will spawn and terminate. Anyway, a "create force" is needed (which is the next step of the upgrade), so that should not be a problem and additionally cuts down a step from the upgrade doc.

Comment 7 Aravinda VK 2015-03-27 17:07:42 UTC

Half baked config file creation is prevented by the patch http://review.gluster.org/#/c/6856/ and upgrade steps also changed. Stop Geo-replication -> Upgrade All Master and Slave Nodes -> Start Geo-replication.

If config file is not corrupted, Two monitor process will not start.

status command:
Before: Uses template conf if session conf not present. Status shows fine.
Now: Status shows config corrupted if session conf not present.

Start command:
Before: Starts geo-rep successfully even if gsyncd.conf does not exists,
but creates half baked gsyncd.conf
Now: Start and Start force fails if gsyncd.conf does not exist.

Stop command:
Before: Succeeds if gsyncd.conf does not exist, fails with verification
error if half baked gsyncd.conf.
Now: Fails if gsyncd.conf does not exists or if half baked gsyncd.conf
exists. Start force will succeed.


Half baked config prevention is verified in BZ 1162142 as part of RHS 2.1.6. Closing this Bug as duplicate of 1162142. Please reopen if issue still exists.

*** This bug has been marked as a duplicate of bug 1162142 ***

Note You need to log in before you can comment on or make changes to this bug.