Bug 1196632 - dist-geo-rep: Concurrent renames and node reboots results in slave having both source and destination of file with destination being 0 byte sticky file
Summary: dist-geo-rep: Concurrent renames and node reboots results in slave having bot...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: mainline
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Kotresh HR
QA Contact:
URL:
Whiteboard:
Depends On: 1140183
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-26 12:23 UTC by Kotresh HR
Modified: 2015-05-14 17:35 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.7.0beta1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1140183
Environment:
Last Closed: 2015-05-14 17:26:30 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Kotresh HR 2015-02-26 12:23:39 UTC
+++ This bug was initially created as a clone of Bug #1140183 +++

Description of problem:
The renames were being done from the master mount and meanwhile one of the node got rebooted. The node after coming back up, resulted in slave having more files than master. Slave actually had both source and destination name for few files. And the destination files were 0 byte sticky bit set, linkto files.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Not sure. Seen once.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 master and 2*2 slave.
2. Start renaming all the files from the master mount point.
  find /mnt/master -type f -exec mv {} {}_renamed \;
3. Now reboot one of the "Active" nodes in the master

Actual results:
Slave has more number of files than the master.

[root@rhsauto029 ~]# find /mnt/master/ | wc -l
33494

[root@rhsauto029 ~]# find /mnt/slave/ | wc -l
33561

Also the both source and target files were present in the slave

[root@rhsauto029 ~]# ls -lh /mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile
-rw-rw-r-- 1 root root 622 Jul 22  2011 /mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile
[root@rhsauto029 ~]# ls -lh /mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile_renamed 
---------T 1 root root 0 Sep  9 05:40 /mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile_renamed


As you can see, the destination file (*_renamed) has sticky bit set and has size zero.

The gfid of the files were also same.

[root@rhsauto029 ~]# getfattr -d -m . -n "glusterfs.gfid.string" /mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile 2> /dev/null 
# file: mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile
glusterfs.gfid.string="6d613003-a35a-489a-826f-14e4a964134f"

[root@rhsauto029 ~]# getfattr -d -m . -n "glusterfs.gfid.string" /mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile_renamed 2> /dev/null 
# file: mnt/slave/linux-3.0/drivers/media/dvb/mantis/Makefile_renamed
glusterfs.gfid.string="6d613003-a35a-489a-826f-14e4a964134f"


Expected results:
All the renames should be synced to slave.

Additional info:

The changelog entried from the working-dir of the node which got rebooted.

[root@rhsauto048 7d805e4489617ef3f01d944e965cb309]# find . -type f | xargs grep "fa9725cf-e888-4b95-a33b-aa6bc6f83c62"
./.processed/CHANGELOG.1410264742:E fa9725cf-e888-4b95-a33b-aa6bc6f83c62 MKNOD 33280 0 0 bafa54a2-d7b6-4124-a0c6-6e1e9bee8442%2Fmantis_dma.c_renamed
./.processed/CHANGELOG.1410264742:M fa9725cf-e888-4b95-a33b-aa6bc6f83c62 NULL
./.processed/CHANGELOG.1410264742:D fa9725cf-e888-4b95-a33b-aa6bc6f83c62


The changelog entries from the working-dir of the node which was replica pair of the node which went down

[root@rhsauto049 cfdffea3581f40685f18a34384edc263]# find . -type f | xargs grep "fa9725cf-e888-4b95-a33b-aa6bc6f83c62"
./.processing/CHANGELOG.1410264734:M fa9725cf-e888-4b95-a33b-aa6bc6f83c62 SETATTR
./.processing/CHANGELOG.1410264719:M fa9725cf-e888-4b95-a33b-aa6bc6f83c62 NULL
./.processing/CHANGELOG.1410264719:E fa9725cf-e888-4b95-a33b-aa6bc6f83c62 RENAME bafa54a2-d7b6-4124-a0c6-6e1e9bee8442%2Fmantis_dma.c bafa54a2-d7b6-4124-a0c6-6e1e9bee8442%2Fmantis_dma.c_renamed

Root caused the issue.

Without node reboot, changelog entries are as follows.
touch f1
mv f1 f2 (Assuming f2 hashed subvolume is b2)

| log    | b1 | log    | b1 repl || log    | b2          | log    | b2 repl     |
| CREATE | f1 | CREATE | f1      || -      | -           | -      | -           |
| -      | f2 | -      | f2      || RENAME | f2 (sticky) | RENAME | f2 (sticky) |


When b2 replica is down during RENAME, and comes back

mv f1 f2 (Assuming f2 hashed subvolume is b2)

| log    | b1 | log    | b1 repl || log    | b2          | log   | b2 repl     |
| CREATE | f1 | CREATE | f1      || -      | -           | -     | -           |
| -      | f2 | -      | f2      || RENAME | f2 (sticky) |       |             |
| -      | f2 | -      | f2      || -      | f2 (sticky) | MKNOD | f2 (sticky) | <-- self heal

Once b2 replica comes back, if it becomes active then processing RENAME is missed, instead it creates sticky file in Slave since MKNOD is recorded in that brick.

--- Additional comment from Aravinda VK on 2014-09-11 05:48:06 EDT ---

Reformatted.

Root caused the issue.

Without node reboot, changelog entries are as follows.
touch f1
mv f1 f2 (Assuming f2 hashed subvolume is b2)

Brick1
======
| log    | b1 | log    | b1 repl |
| CREATE | f1 | CREATE | f1      |
| -      | f2 | -      | f2      |

Brick2
======
| log    | b2          | log    | b2 repl     |
| -      | -           | -      | -           |
| RENAME | f2 (sticky) | RENAME | f2 (sticky) |

When b2 replica is down during RENAME, and comes back

mv f1 f2 (Assuming f2 hashed subvolume is b2)

Brick1
======
| log    | b1 | log    | b1 repl |
| CREATE | f1 | CREATE | f1      |
| -      | f2 | -      | f2      |
| -      | f2 | -      | f2      |

Brick2
======
| log    | b2          | log   | b2 repl     |
| -      | -           | -     | -           |
| RENAME | f2 (sticky) |       |             |
| -      | f2 (sticky) | MKNOD | f2 (sticky) | <-- self heal


Once b2 replica comes back, if it becomes active then processing RENAME is missed, instead it creates sticky file in Slave since MKNOD is recorded in that brick.

Comment 1 Anand Avati 2015-02-26 12:34:51 UTC
REVIEW: http://review.gluster.org/9759 (feature/geo-rep: Active Passive Switching logic flock) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 2 Anand Avati 2015-03-06 08:07:16 UTC
REVIEW: http://review.gluster.org/9759 (feature/geo-rep: Active Passive Switching logic flock) posted (#2) for review on master by Kotresh HR (khiremat)

Comment 3 Anand Avati 2015-03-06 09:48:00 UTC
REVIEW: http://review.gluster.org/9759 (feature/geo-rep: Active Passive Switching logic flock) posted (#3) for review on master by Kotresh HR (khiremat)

Comment 4 Anand Avati 2015-03-13 06:59:49 UTC
REVIEW: http://review.gluster.org/9759 (feature/geo-rep: Active Passive Switching logic flock) posted (#4) for review on master by Kotresh HR (khiremat)

Comment 5 Anand Avati 2015-03-16 04:54:03 UTC
COMMIT: http://review.gluster.org/9759 committed in master by Vijay Bellur (vbellur) 
------
commit f0224ce93ae9ad420e23612fe6e6707a821f9cab
Author: Kotresh HR <khiremat>
Date:   Mon Feb 23 14:46:48 2015 +0530

    feature/geo-rep: Active Passive Switching logic flock
    
    CURRENT DESIGN AND ITS LIMITATIONS:
    -----------------------------------
    Geo-replication syncs changes across geography using changelogs captured
    by changelog translator. Changelog translator sits on server side just
    above posix translator. Hence, in distributed replicated setup, both
    replica pairs collect changelogs w.r.t their bricks. Geo-replication
    syncs the changes using only one brick among the replica pair at a time,
    calling it as "ACTIVE" and other non syncing brick as "PASSIVE".
    
    Let's consider below example of distributed replicated setup where
    NODE-1 as b1 and its replicated brick b1r is in NODE-2
    
            NODE-1                         NODE-2
              b1                            b1r
    
    At the beginning, geo-replication chooses to sync changes from NODE-1:b1
    and NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr
    'trusted.glusterfs.node-uuid' which always returns first up subvolume
    i.e., NODE-1. When NODE-1 goes down, the above xattr returns NODE-2 and
    that is made 'ACTIVE'. But when NODE-1 comes back again, the above xattr
    returns NODE-1 and it is made 'ACTIVE' again. So for a brief interval of
    time, if NODE-2 had not finished processing the changelog, both NODE-2
    and NODE-1 will be ACTIVE causing rename race as mentioned in the bug.
    
    SOLUTION:
    ---------
    1. Have a shared replicated storage, a glusterfs management volume specific
       to geo-replication.
    
    2. Geo-rep creates a file per replica set on management volume.
    
    3. fcntl lock on the above said file is used for synchronization
       between geo-rep workers belonging to same replica set.
    
    4. If management volume is not configured, geo-replication will back
       to previous logic of using first up sub volume.
    
    Each worker tries to lock the file on shared storage, who ever wins will
    be ACTIVE. With this, we are able to solve the problem but there is an
    issue when the shared replicated storage goes down (when all replicas
    goes down). In that case, the lock state is lost. So AFR needs to rebuild the
    lock state after brick comes up.
    
    NOTE:
    -----
    This patch brings in the, pre-requisite step of setting up management volume
    for geo-replication during creation.
    
    1. Create mgmt-vol for geo-replicatoin and start it. Management volume should
       be part of master cluster and recommended to be three way replicated
       volume having each brick in different nodes for availability.
    2. Create geo-rep session.
    3. Configure mgmt-vol created with geo-replication session as follows.
       gluster vol geo-rep <mastervol> slavenode::<slavevol> config meta_volume \
       <meta-vol-name>
    4. Start geo-rep session.
    
    Backward Compatiability:
    -----------------------
    If management volume is not configured, it falls back to previous logic of
    using node-uuid virtual xattr. But it is not recommended.
    
    Change-Id: I7319d2289516f534b69edd00c9d0db5a3725661a
    BUG: 1196632
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/9759
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 6 Anand Avati 2015-03-30 09:00:11 UTC
REVIEW: http://review.gluster.org/10043 (doc/geo-rep: Documentation for management volume for geo-rep) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 7 Anand Avati 2015-03-30 09:03:07 UTC
REVIEW: http://review.gluster.org/10043 (doc/geo-rep: Documentation for management volume in geo-rep) posted (#2) for review on master by Kotresh HR (khiremat)

Comment 8 Anand Avati 2015-03-31 09:42:13 UTC
REVIEW: http://review.gluster.org/10043 (doc/geo-rep: Documentation for management volume for geo-rep) posted (#3) for review on master by Kotresh HR (khiremat)

Comment 9 Anand Avati 2015-03-31 15:20:12 UTC
REVIEW: http://review.gluster.org/10043 (doc/geo-rep: Documentation for management volume for geo-rep) posted (#4) for review on master by Kotresh HR (khiremat)

Comment 10 Anand Avati 2015-03-31 20:42:48 UTC
COMMIT: http://review.gluster.org/10043 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit f5e4c943cf520c6ec2df3c231fef9ae4116097b8
Author: Kotresh HR <khiremat>
Date:   Mon Mar 30 13:10:00 2015 +0530

    doc/geo-rep: Documentation for management volume for geo-rep
    
    Documented new changes to admin guide for setting up
    geo-replication with the new active/passive switching
    logic that comes with http://review.gluster.org/#/c/9759/
    
    Change-Id: I47de9d2c1e678f7ad789f0ca2acf7ce67eb96c62
    BUG: 1196632
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/10043
    Reviewed-by: Aravinda VK <avishwan>
    Reviewed-by: Humble Devassy Chirammal <humble.devassy>
    Tested-by: Gluster Build System <jenkins.com>

Comment 11 Niels de Vos 2015-05-14 17:26:30 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 12 Niels de Vos 2015-05-14 17:28:24 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 13 Niels de Vos 2015-05-14 17:35:16 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.