Description of problem: - GlusterFS is not durable against power outage. In case of power loss it always leads to split brain. How reproducible: - Always Steps to Reproduce: 1. Create cluster: 3 nodes, 1 volume, 3 bricks, 3 replicas. 2. Mount the volume somewhere and start writing many files. 3. Switch power off for one if nodes (stop forcefully in case of virtual machine). 4. Start the node again. 5. Run: gluster volume heal volumename full 6. Check: gluster volume heal volumename info split-brain Actual results: - cat: /mnt/gluster-volume/filename.log: Input/output error Expected results: - If we have 3 replicas and 2 of them are identical (quorum) then it should be healed automatically and not leading to Input/output error. Additional info: - Check file size on all nodes: $ ls -ln /data/glusterfs/volume/filename.log gluster.1: -rw-r--r-- 2 106 114 9869 Dec 15 04:47 /data/glusterfs/volume/filename.log gluster.2: -rw-r--r-- 2 106 114 10008 Dec 15 04:49 /data/glusterfs/volume/filename.log gluster.3: -rw-r--r-- 2 106 114 10008 Dec 15 04:49 /data/glusterfs/volume/filename.log - Get attr on all nodes: $ getfattr -m . -d -e hex /data/glusterfs/volume/filename.log gluster.1: # file: /data/glusterfs/volume/filename.log trusted.afr.rpaas-client-21=0x000000000000000000000000 trusted.afr.rpaas-client-22=0x000000000000000000000000 trusted.afr.rpaas-client-23=0x000000000000000000000000 trusted.gfid=0x21b7709eca5e481ab2b9e5d73e219b03 gluster.2: # file: /data/glusterfs/volume/filename.log trusted.afr.rpaas-client-21=0x000000000000000000000000 trusted.afr.rpaas-client-22=0x000000000000000000000000 trusted.afr.rpaas-client-23=0x000000000000000000000000 trusted.gfid=0x21b7709eca5e481ab2b9e5d73e219b03 gluster.3: # file: /data/glusterfs/volume/filename.log trusted.afr.rpaas-client-21=0x000000000000000000000000 trusted.afr.rpaas-client-22=0x000000000000000000000000 trusted.afr.rpaas-client-23=0x000000000000000000000000 trusted.gfid=0x21b7709eca5e481ab2b9e5d73e219b03
This never happens if a host is rebooted (sudo reboot) and only happens when a host experiences sudden stop due to crash or power outage.
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.
We would like to confirm this is fixed in afr-v2. Seems to be fixed, but will close it after confirmation.
In afr-v2 (which is available in glusterfs-3.6 onwards), if there are no pending afr xattrs on a file but there is a size mismatch, it will choose the bigger file as source and trigger heals instead of returning EIO. Hence closing this bug.