Bug 1063229
| Summary: | dist-geo-rep: Few regular files are not synced to slave when node is taken down updated and then brought back online | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | M S Vishwanath Bhat <vbhat> |
| Component: | geo-replication | Assignee: | Aravinda VK <avishwan> |
| Status: | CLOSED DUPLICATE | QA Contact: | storage-qa-internal <storage-qa-internal> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.1 | CC: | aavati, asengupt, csaba, mzywusko, nlevinki, sharne, vshankar |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Unspecified | ||
| Whiteboard: | consistency | ||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
After upgrade, two geo-rep monitor processes were running for same session. Both process were trying to use the same xsync changelog file to record the changes.
Workaround:
Before running 'geo-rep create force' command, kill the geo-rep monitor process.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-03-27 17:07:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1035040 | ||
|
Description
M S Vishwanath Bhat
2014-02-10 10:24:47 UTC
Due to some reason there are two worker (and monitor) processes running on the host 'pythogoras'. This is most probably the reason for missing files the two processes may use the same Xsync changelog filename, thereby truncating the file (and loosing and changes) when the "loosing" process initializes an Xsync changelog file. Two worker processes -------------------- [root@pythagoras 59ddf777397e52a13ba1333653d63854]# ps auxww |grep feedback root 10311 0.0 0.0 103244 808 pts/14 S+ 06:09 0:00 grep feedback root 21379 0.2 0.7 1121832 14348 ? Sl Feb10 4:49 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a ssh://euclid::slave -N -p --slave-id a47ff8cc-beef-48ac-954b-c292cb044085 --feedback-fd 6 --local-path /rhs/bricks/brick0 --local-id .%2Frhs%2Fbricks%2Fbrick0 --resource-remote ssh://root@euclid:gluster://localhost:slave root 21570 0.2 0.6 1120836 13180 ? Sl Feb10 4:46 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a euclid::slave -N -p --slave-id a47ff8cc-beef-48ac-954b-c292cb044085 --feedback-fd 8 --local-path /rhs/bricks/brick0 --local-id .%2Frhs%2Fbricks%2Fbrick0 --resource-remote ssh://root@euclid:gluster://localhost:slave Two monitor processes --------------------- [root@pythagoras 59ddf777397e52a13ba1333653d63854]# ps auxww |grep monitor root 2159 0.0 0.1 360460 3620 ? Ssl Feb07 0:36 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 --monitor -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a ssh://euclid::slave root 4631 0.0 0.5 360468 10908 ? Ssl Feb07 0:34 /usr/bin/python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/bricks/brick0 --monitor -c /var/lib/glusterd/geo-replication/master_euclid_slave/gsyncd.conf :master --glusterd-uuid=0d998b9d-0ad0-4f84-9b8f-02767aa6bd7a euclid::slave Steps: 1. After upgrade, delete the config file and kill all gluster and gsync processes. 2. Restart glusterd Behaviour: 1. On every node a half-baked gsyncd.conf file is created. 2. glusterd starts gsycnd on every node, even though the state-file entry is missing from the half baked config file. 3. On every node the gsyncd spawned dies with a log (GLusterfs session went down), except the node where there is a passwordless ssh connection with the slave. In that node, this spawned gsyncd process stays active. Avra, Can we not create the "half baked" config file by any chance (we cannot control the spawning of gsyncd when glusterd starts). That way, gsyncd will spawn and terminate. Anyway, a "create force" is needed (which is the next step of the upgrade), so that should not be a problem and additionally cuts down a step from the upgrade doc. Half baked config file creation is prevented by the patch http://review.gluster.org/#/c/6856/ and upgrade steps also changed. Stop Geo-replication -> Upgrade All Master and Slave Nodes -> Start Geo-replication. If config file is not corrupted, Two monitor process will not start. status command: Before: Uses template conf if session conf not present. Status shows fine. Now: Status shows config corrupted if session conf not present. Start command: Before: Starts geo-rep successfully even if gsyncd.conf does not exists, but creates half baked gsyncd.conf Now: Start and Start force fails if gsyncd.conf does not exist. Stop command: Before: Succeeds if gsyncd.conf does not exist, fails with verification error if half baked gsyncd.conf. Now: Fails if gsyncd.conf does not exists or if half baked gsyncd.conf exists. Start force will succeed. Half baked config prevention is verified in BZ 1162142 as part of RHS 2.1.6. Closing this Bug as duplicate of 1162142. Please reopen if issue still exists. *** This bug has been marked as a duplicate of bug 1162142 *** |