Description of problem: After upgrading the geo-rep from update 1 to update 2 build (from 44rhs to 59rhs), few of the regular files are not synced to slave. One of the node has around 100 changelog files in .processing directory and one xsync changelog in xsync directory in working-dir. But these are not synced even after keeping the setup Idle for about more than 2 days. Version-Release number of selected component (if applicable): Upgrade from glusterfs-3.4.0.44rhs to glusterfs-3.4.0.59rhs-1.el6rhs.x86_64 How reproducible: Hit twice in two tries Steps to Reproduce: 1. Install 44 rhs build and then using the doc upgrade it to 55 rhs Actual results: After two days of Idle this was the checksum taken on master mountpoint and slave mountpoint [root@gauss ~]# ./arequal-checksum -p /mnt/master/ Entry counts Regular files : 101283 Directories : 8357 Symbolic links : 6934 Other : 0 Total : 116574 Metadata checksums Regular files : 480610 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 29d41d0b2d6a2e873ad2ac93e66af98d Directories : 231e331709486e7b Symbolic links : 6b180c193e515362 Other : 0 Total : 5b008e96fc19ea13 [root@gauss ~]# ./arequal-checksum -p /mnt/slave/ Entry counts Regular files : 99251 Directories : 8357 Symbolic links : 6934 Other : 0 Total : 114542 Metadata checksums Regular files : 47ff5c Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 52ecb8387db98b68dbfd64c594c782eb Directories : 315c723e20240a77 Symbolic links : 6b180c193e515362 Other : 0 Total : d355a2daf70b5096 There are some files in working-dir/.processing [root@pythagoras ~]# ssh root@archimedes Last login: Fri Feb 7 00:41:14 2014 from pythagoras [root@archimedes ~]# ls /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.37.188%3Agluster%3A%2F%2F127.0.0.1%3Aslave/32bb6e3a46ef511ac32bdc895ff0debf/.processing/ CHANGELOG.1391765691 CHANGELOG.1391765991 CHANGELOG.1391766293 CHANGELOG.1391768264 CHANGELOG.1391768568 CHANGELOG.1391768871 CHANGELOG.1391769173 CHANGELOG.1391769477 CHANGELOG.1391765706 CHANGELOG.1391766006 CHANGELOG.1391766308 CHANGELOG.1391768279 CHANGELOG.1391768583 CHANGELOG.1391768886 CHANGELOG.1391769188 CHANGELOG.1391769492 CHANGELOG.1391765721 CHANGELOG.1391766021 CHANGELOG.1391766323 CHANGELOG.1391768295 CHANGELOG.1391768598 CHANGELOG.1391768901 CHANGELOG.1391769203 CHANGELOG.1391769507 CHANGELOG.1391765736 CHANGELOG.1391766036 CHANGELOG.1391766338 CHANGELOG.1391768310 CHANGELOG.1391768613 CHANGELOG.1391768917 CHANGELOG.1391769218 CHANGELOG.1391769524 CHANGELOG.1391765751 CHANGELOG.1391766052 CHANGELOG.1391766353 CHANGELOG.1391768325 CHANGELOG.1391768628 CHANGELOG.1391768932 CHANGELOG.1391769233 CHANGELOG.1391769539 CHANGELOG.1391765766 CHANGELOG.1391766067 CHANGELOG.1391766369 CHANGELOG.1391768340 CHANGELOG.1391768643 CHANGELOG.1391768947 CHANGELOG.1391769249 CHANGELOG.1391769554 CHANGELOG.1391765781 CHANGELOG.1391766082 CHANGELOG.1391766384 CHANGELOG.1391768355 CHANGELOG.1391768658 CHANGELOG.1391768962 CHANGELOG.1391769264 CHANGELOG.1391769569 CHANGELOG.1391765796 CHANGELOG.1391766097 CHANGELOG.1391766399 CHANGELOG.1391768370 CHANGELOG.1391768673 CHANGELOG.1391768977 CHANGELOG.1391769279 CHANGELOG.1391769585 CHANGELOG.1391765811 CHANGELOG.1391766112 CHANGELOG.1391766414 CHANGELOG.1391768385 CHANGELOG.1391768688 CHANGELOG.1391768992 CHANGELOG.1391769294 CHANGELOG.1391769600 CHANGELOG.1391765826 CHANGELOG.1391766127 CHANGELOG.1391766429 CHANGELOG.1391768400 CHANGELOG.1391768703 CHANGELOG.1391769007 CHANGELOG.1391769309 CHANGELOG.1391769615 CHANGELOG.1391765841 CHANGELOG.1391766142 CHANGELOG.1391766444 CHANGELOG.1391768415 CHANGELOG.1391768719 CHANGELOG.1391769022 CHANGELOG.1391769324 CHANGELOG.1391769630 CHANGELOG.1391765856 CHANGELOG.1391766157 CHANGELOG.1391766459 CHANGELOG.1391768430 CHANGELOG.1391768734 CHANGELOG.1391769037 CHANGELOG.1391769339 CHANGELOG.1391769645 CHANGELOG.1391765871 CHANGELOG.1391766172 CHANGELOG.1391768141 CHANGELOG.1391768445 CHANGELOG.1391768749 CHANGELOG.1391769052 CHANGELOG.1391769354 CHANGELOG.1391769660 CHANGELOG.1391765886 CHANGELOG.1391766187 CHANGELOG.1391768156 CHANGELOG.1391768460 CHANGELOG.1391768764 CHANGELOG.1391769068 CHANGELOG.1391769372 CHANGELOG.1391769676 CHANGELOG.1391765901 CHANGELOG.1391766202 CHANGELOG.1391768171 CHANGELOG.1391768477 CHANGELOG.1391768779 CHANGELOG.1391769083 CHANGELOG.1391769387 CHANGELOG.1391769691 CHANGELOG.1391765916 CHANGELOG.1391766218 CHANGELOG.1391768186 CHANGELOG.1391768492 CHANGELOG.1391768795 CHANGELOG.1391769098 CHANGELOG.1391769402 CHANGELOG.1391769706 CHANGELOG.1391765931 CHANGELOG.1391766233 CHANGELOG.1391768201 CHANGELOG.1391768507 CHANGELOG.1391768810 CHANGELOG.1391769113 CHANGELOG.1391769417 CHANGELOG.1391769721 CHANGELOG.1391765946 CHANGELOG.1391766248 CHANGELOG.1391768216 CHANGELOG.1391768522 CHANGELOG.1391768826 CHANGELOG.1391769128 CHANGELOG.1391769432 CHANGELOG.1391765961 CHANGELOG.1391766263 CHANGELOG.1391768232 CHANGELOG.1391768538 CHANGELOG.1391768841 CHANGELOG.1391769143 CHANGELOG.1391769447 CHANGELOG.1391765976 CHANGELOG.1391766278 CHANGELOG.1391768247 CHANGELOG.1391768553 CHANGELOG.1391768856 CHANGELOG.1391769158 CHANGELOG.1391769462 [root@archimedes ~]# ls /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.37.188%3Agluster%3A%2F%2F127.0.0.1%3Aslave/32bb6e3a46ef511ac32bdc895ff0debf/xsync/ XSYNC-CHANGELOG.1391763066 The same was the case for other node in the volume. Expected results: All the files should be synced to slave. Additional info: I have archived all the logs and keeping the environment as it is. Will update the bug with more information.
Due to some reason there are two worker (and monitor) processes running on the host 'pythogoras'. This is most probably the reason for missing files the two processes may use the same Xsync changelog filename, thereby truncating the file (and loosing and changes) when the "loosing" process initializes an Xsync changelog file. Two worker processes -------------------- [root@pythagoras 59ddf777397e52a13ba1333653d63854]# ps auxww |grep feedback root 10311 0.0 0.0 103244 808 pts/14 S+ 06:09 0:00 grep feedback root 21379 0.2 0.7 1121832 14348 ? Sl Feb10 4:49 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a ssh://euclid::slave -N -p --slave-id a47ff8cc-beef-48ac-954b-c292cb044085 --feedback-fd 6 --local-path /rhs/bricks/brick0 --local-id .%2Frhs%2Fbricks%2Fbrick0 --resource-remote ssh://root@euclid:gluster://localhost:slave root 21570 0.2 0.6 1120836 13180 ? Sl Feb10 4:46 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a euclid::slave -N -p --slave-id a47ff8cc-beef-48ac-954b-c292cb044085 --feedback-fd 8 --local-path /rhs/bricks/brick0 --local-id .%2Frhs%2Fbricks%2Fbrick0 --resource-remote ssh://root@euclid:gluster://localhost:slave Two monitor processes --------------------- [root@pythagoras 59ddf777397e52a13ba1333653d63854]# ps auxww |grep monitor root 2159 0.0 0.1 360460 3620 ? Ssl Feb07 0:36 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 --monitor -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a ssh://euclid::slave root 4631 0.0 0.5 360468 10908 ? Ssl Feb07 0:34 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 --monitor -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a euclid::slave
Steps: 1. After upgrade, delete the config file and kill all gluster and gsync processes. 2. Restart glusterd Behaviour: 1. On every node a half-baked gsyncd.conf file is created. 2. glusterd starts gsycnd on every node, even though the state-file entry is missing from the half baked config file. 3. On every node the gsyncd spawned dies with a log (GLusterfs session went down), except the node where there is a passwordless ssh connection with the slave. In that node, this spawned gsyncd process stays active.
Avra, Can we not create the "half baked" config file by any chance (we cannot control the spawning of gsyncd when glusterd starts). That way, gsyncd will spawn and terminate. Anyway, a "create force" is needed (which is the next step of the upgrade), so that should not be a problem and additionally cuts down a step from the upgrade doc.
Half baked config file creation is prevented by the patch http://review.gluster.org/#/c/6856/ and upgrade steps also changed. Stop Geo-replication -> Upgrade All Master and Slave Nodes -> Start Geo-replication. If config file is not corrupted, Two monitor process will not start. status command: Before: Uses template conf if session conf not present. Status shows fine. Now: Status shows config corrupted if session conf not present. Start command: Before: Starts geo-rep successfully even if gsyncd.conf does not exists, but creates half baked gsyncd.conf Now: Start and Start force fails if gsyncd.conf does not exist. Stop command: Before: Succeeds if gsyncd.conf does not exist, fails with verification error if half baked gsyncd.conf. Now: Fails if gsyncd.conf does not exists or if half baked gsyncd.conf exists. Start force will succeed. Half baked config prevention is verified in BZ 1162142 as part of RHS 2.1.6. Closing this Bug as duplicate of 1162142. Please reopen if issue still exists. *** This bug has been marked as a duplicate of bug 1162142 ***