Description of problem: I have a replica 2 setup as follows: ------------ root@gfs_serv0:~] gluster v info gfs_replicated_vol Volume Name: gfs_replicated_vol Type: Replicate Volume ID: 4e72e0cc-318f-4706-92af-4a56fc793063 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gfs_serv0:/mnt/bricks/gfs_replicated_vol/brick Brick2: gfs_serv1:/mnt/bricks/gfs_replicated_vol/brick Options Reconfigured: nfs.disable: on cluster.self-heal-daemon: on nfs.enable-ino32: on network.ping-timeout: 10 root@gfs_serv0:~] I check for heal info and I find that all files are in sync: ------------ root@gfs_serv0:~] gluster volume heal gfs_replicated_vol info Gathering Heal info on volume _tftpboot has been successful Brick gfs_serv0:/mnt/bricks/gfs_replicated_vol/brick Number of entries: 0 Brick gfs_serv1:/mnt/bricks/gfs_replicated_vol/brick Number of entries: 0 root@gfs_serv0:~] However, for a few files, I see that the md5sum of their replicas (in the two bricks) differ (even as their sizes are identical). To illustrate, ------------ root@gfs_serv0:~] ls -li /mnt/bricks/gfs_replicated_vol/brick/bin/bash 4187497 -rwxr-xr-x 2 root root 666648 Dec 6 22:48 /mnt/bricks/gfs_replicated_vol/brick/bin/bash root@gfs_serv0:~] md5sum /mnt/bricks/gfs_replicated_vol/brick/bin/bash fc61db7be6eeda79f0b0bff58e622ace /mnt/bricks/gfs_replicated_vol/brick/bin/bash root@gfs_serv1:~] ls -li /mnt/bricks/gfs_replicated_vol/brick/bin/bash 8389144 -rwxr-xr-x 2 root root 666648 Dec 6 22:48 /mnt/bricks/gfs_replicated_vol/brick/bin/bash root@gfs_serv1:~] md5sum /mnt/bricks/gfs_replicated_vol/brick/bin/bash 154b9852621b1651aff4af0764897c9a /mnt/bricks/gfs_replicated_vol/brick/bin/bash Further information on the two replicas: ------------ root@gfs_serv0:~] getfattr -m . -d -e hex /mnt/bricks/gfs_replicated_vol/brick/bin/bash getfattr: Removing leading '/' from absolute path names # file: mnt/bricks/gfs_replicated_vol/brick/bin/bash trusted.afr.gfs_replicated_vol-client-0=0x000000000000000000000000 trusted.afr.gfs_replicated_vol-client-1=0x000000000000000000000000 trusted.gfid=0xc577569a74cb4f23825daef95e9dcbb4 root@gfs_serv1:~] getfattr -m . -d -e hex /mnt/bricks/gfs_replicated_vol/brick/bin/bash getfattr: Removing leading '/' from absolute path names # file: mnt/bricks/gfs_replicated_vol/brick/bin/bash trusted.afr.gfs_replicated_vol-client-0=0x000000000000000000000000 trusted.afr.gfs_replicated_vol-client-1=0x000000000000000000000000 trusted.gfid=0xc577569a74cb4f23825daef95e9dcbb4 I tried to force a heal by triggering a lookup on the file using ls -l /gfs_replicated_vol/brick/bin/bash but that made no difference. Version-Release number of selected component (if applicable): How reproducible: Intermittent. Steps to Reproduce: I do not have a simple way to replicate this - just something we see on our servers on rare days. Actual results: Files are not in-sync even as gluster volume heal info does not report a single entry. Expected results: In the case that files are not in sync, glusterfs should be able to identify them and report them to the system administrator under the heal info command. Additional info: None
The root cause behind this problem seems to one I debugged a short while back. When we reboot. When we reboot gfs_serv1, so that we can unmount the bricks, we kill off the glusterfsd processes on it. At this point, as the tcp connection between the mount client on gfs_serv0 and the glusterfsd on gfs_serv1 breaks, the former makes portmap queries to the glusterd on gfs_serv1 to get the port number for glusterfsd. At this point, if glusterd is not aware that glusterfsd is dead, it is possible that glusterd will return the port number of the dead glusterfsd process. This will cause the portmap query to succeed, but no actual connection will be established between mount client and brick, and then updates to files on gfs_serv0 will not be immediately propagated to the replica on gfs_serv1. More details in the mail below: http://www.gluster.org/pipermail/gluster-users/2015-February/020770.html has all the technical details.
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5. This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs". If there is no response by the end of the month, this bug will get automatically closed.
GlusterFS 3.4.x has reached end-of-life. If this bug still exists in a later release please reopen this and change the version or open a new bug.
GlusterFS 3.4.x has reached end-of-life.\ \ If this bug still exists in a later release please reopen this and change the version or open a new bug.