Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1229226

Summary:	Gluster split-brain not logged and data integrity not enforced
Product:	[Community] GlusterFS	Reporter:	Dustin Black <dblack>
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED EOL	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	3.7.0	CC:	amukherj, bugs, jdarcy, ssampat
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	AFR
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-08 11:01:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1223758, 1224709

Description Dustin Black 2015-06-08 10:04:28 UTC

Description of problem:
Able to consistently reproduce a split-brain state that is never logged and where EIO is never triggered, leaving the file available for error-free rw access while in a split-brain state.

Version-Release number of selected component (if applicable):
[root@n2 ~]# gluster --version
glusterfs 3.7.0 built on May 20 2015 13:30:05
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

[root@n2 ~]# rpm -qa |grep gluster
glusterfs-libs-3.7.0-2.el7.x86_64
glusterfs-cli-3.7.0-2.el7.x86_64
glusterfs-3.7.0-2.el7.x86_64
glusterfs-fuse-3.7.0-2.el7.x86_64
glusterfs-client-xlators-3.7.0-2.el7.x86_64
glusterfs-server-3.7.0-2.el7.x86_64
glusterfs-api-3.7.0-2.el7.x86_64
glusterfs-geo-replication-3.7.0-2.el7.x86_64

How reproducible:
Consistently

Steps to Reproduce:
1. A test file is created:

[root@n1 ~]# dd if=/dev/urandom of=/rhgs/client/rep01/file002 bs=1k count=1k


2. Confirm that file hashes to bricks on n1 and n2:

[root@n1 ~]# ls -lh /rhgs/bricks/rep01/file002 
-rw-r--r-- 2 root root 22 Jun  3 12:18 /rhgs/bricks/rep01/file002

[root@n2 ~]# ls -lh /rhgs/bricks/rep01/file002
-rw-r--r-- 2 root root 22 Jun  3 12:18 /rhgs/bricks/rep01/file002


3. A network split is induced by using iptables to drop all packets from n1 to n2, and data is appended to the test file from n1:

#!/bin/bash
exe() { echo "\$ $@" ; "$@" ; }

if [ $HOSTNAME == "n1" ]; then
   echo "Inducing network split with iptables..."
   exe iptables -F
   exe iptables -A OUTPUT -d n2 -j DROP
   echo "Adding 1MB of random data to file002..."
   exe dd if=/dev/urandom bs=1k count=1k >> /rhgs/client/rep01/file002
   echo "Generating md5sum for file002..."
   exe md5sum /rhgs/client/rep01/file002
else
   echo "Wrong host!"
fi


4. Data is appended to the test file from n2:

#!/bin/bash
exe() { echo "\$ $@" ; "$@" ; }

if [ $HOSTNAME == "n2" ]; then
   echo "Adding 2MB of random data to file002..."
   exe dd if=/dev/urandom bs=1k count=2k >> /rhgs/client/rep01/file002
   echo "Generating md5sum for file002..."
   exe md5sum /rhgs/client/rep01/file002
else
   echo "Wrong host!"
fi


5. Correct the network split and stat the file from the client:

#!/bin/bash
exe() { echo "\$ $@" ; "$@" ; }

if [ $HOSTNAME == "n1" ]; then
   echo "Correcting network split with iptables..."
   exe iptables -F OUTPUT
   echo "Statting file002 to induce heal..."
   exe stat /rhgs/client/rep01/file002
else
   echo "Wrong host!"
fi


6. Cat the file (this should result in EIO, but does not):

[root@n1 ~]# cat /rhgs/client/rep01/file002  > /dev/null


7. Add new data to the file from n1:

[root@n1 ~]# dd if=/dev/urandom bs=1k count=1k >> /rhgs/client/rep01/file002 
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.138334 s, 7.6 MB/s


8. Look for expected split-brain errors in the gluster logs (nothing is returned):

[root@n1 ~]# grep -i split /var/log/glusterfs/{*,bricks/*} 2>/dev/null

[root@n2 ~]# grep -i split /var/log/glusterfs/{*,bricks/*} 2>/dev/null


9. Confirm files are different, and both copies think themselves as WISE:

[root@n1 ~]# md5sum /rhgs/bricks/rep01/file002
d70a816aab125567c185bc047f4358b0  /rhgs/bricks/rep01/file002
[root@n1 ~]# getfattr -d -m . -e hex /rhgs/bricks/rep01/file002
getfattr: Removing leading '/' from absolute path names
# file: rhgs/bricks/rep01/file002
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.rep01-client-0=0x000000000000000000000000
trusted.afr.rep01-client-1=0x000000910000000000000000
trusted.bit-rot.version=0x0200000000000000556dd9df000a770c
trusted.gfid=0x8740772d4f204ce183f010a80e76015c

[root@n2 ~]# md5sum /rhgs/bricks/rep01/file002
bcb17a86bf54db36fa874030fde8da4b  /rhgs/bricks/rep01/file002
[root@n2 ~]# getfattr -d -m . -e hex /rhgs/bricks/rep01/file002
getfattr: Removing leading '/' from absolute path names
# file: rhgs/bricks/rep01/file002
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.rep01-client-0=0x000000310000000000000000
trusted.afr.rep01-client-1=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000556dd9de000db404
trusted.gfid=0x8740772d4f204ce183f010a80e76015c

Actual results:
Able to stat, ls, and cat the split file from the client without error


Expected results:
File operations should result in EIO


Additional info:
Topology for volume rep01:

Distribute set
 |    
 +-- Replica set 0
 |    |    
 |    +-- Brick 0: n1:/rhgs/bricks/rep01
 |    |    
 |    +-- Brick 1: n2:/rhgs/bricks/rep01
 |    
 +-- Replica set 1
       |    
       +-- Brick 0: n3:/rhgs/bricks/rep01
       |    
       +-- Brick 1: n4:/rhgs/bricks/rep01


[root@n1 ~]# gluster volume info rep01
 
Volume Name: rep01
Type: Distributed-Replicate
Volume ID: 6ff17d21-035d-47e7-8bd1-d4a9e850be31
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: n1:/rhgs/bricks/rep01
Brick2: n2:/rhgs/bricks/rep01
Brick3: n3:/rhgs/bricks/rep01
Brick4: n4:/rhgs/bricks/rep01
Options Reconfigured:
performance.readdir-ahead: on


Client mounts on n1 and n2:

[root@n1 ~]# grep client /etc/fstab
n1:rep01		/rhgs/client/rep01	glusterfs	_netdev	0 0
[root@n1 ~]# mount | grep client
n1:rep01 on /rhgs/client/rep01 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

[root@n2 ~]# grep client /etc/fstab
n1:rep01		/rhgs/client/rep01	glusterfs	_netdev	0 0
[root@n2 ~]# mount | grep client
n1:rep01 on /rhgs/client/rep01 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Comment 1 Jeff Darcy 2015-06-08 21:20:10 UTC

FWIW, I tried a bunch of iptables tricks and couldn't find a way to reproduce this on a single node.  It does seem specific to a two-node (or at least two-glusterd) configuration.

Comment 2 Jeff Darcy 2015-06-08 21:39:54 UTC

OK, I lied.  Previously, I had been cutting off access to each brick sequentially, with client unmounts and remounts in between.  This time, I mounted twice simultaneously, and cut off each client's connection to one brick.  Something like this (your port numbers may vary).

> iptables -t mangle -I OUTPUT -p tcp --sport 1020 --dport 49152 -j DROP
> iptables -t mangle -I OUTPUT -p tcp --sport 1002 --dport 49153 -j DROP

With this, I got into a state where *one* client could still read the file from the still-connected brick without error.  Interestingly, it was not symmetric; the other client did report EIO, as it should.  Xattrs do show pending operations for each other, and "heal info" shows split-brain from both sides.

As I wrote this, the state changed yet again.  Now both clients correctly return EIO.  This strongly suggests that some state is being cached improperly on the clients, but not infinitely.  The plot thickens.

Comment 3 Jeff Darcy 2015-06-09 00:25:37 UTC

I can consistently reproduce this state now.  Just as consistently, it persists until I utter this familiar incantation:

    # echo 3 > /proc/sys/vm/drop_caches

As far as I can tell, we don't even *get* the read until we do this.  Therefore we can't fail it.  Instead, the kernel returns the version that we had written previously.  We could prevent that by checking for split-brain on open, but we don't seem to do that.  Perhaps this is related to the fact that NFS might not do an open before a read, so the emphasis has been on checking in the read path - which we don't get to in this case.  Just a theory.  In any case, maybe there are some clues that someone more familiar with AFR can pursue.

Comment 4 Jeff Darcy 2015-06-09 16:29:05 UTC

*** Bug 1220347 has been marked as a duplicate of this bug. ***

Comment 5 Dustin Black 2015-06-11 18:48:13 UTC

(In reply to Jeff Darcy from comment #3)
> I can consistently reproduce this state now.  Just as consistently, it
> persists until I utter this familiar incantation:
> 
>     # echo 3 > /proc/sys/vm/drop_caches
> 
> As far as I can tell, we don't even *get* the read until we do this. 
> Therefore we can't fail it.  Instead, the kernel returns the version that we
> had written previously.  We could prevent that by checking for split-brain
> on open, but we don't seem to do that.  Perhaps this is related to the fact
> that NFS might not do an open before a read, so the emphasis has been on
> checking in the read path - which we don't get to in this case.  Just a
> theory.  In any case, maybe there are some clues that someone more familiar
> with AFR can pursue.

So interestingly, I tried the drop caches a few different ways previously (at different points in the reproducer process), and it didn't help.

I'm going to try again and see if maybe I missed something before...

Comment 6 Dustin Black 2015-06-11 19:04:35 UTC

For my original reproducer, if I insert the cache drop where I logically think it should go in step 5:

5. Correct the network split and stat the file from the client:

#!/bin/bash
exe() { echo "\$ $@" ; "$@" ; }

if [ $HOSTNAME == "n1" ]; then
   echo "Correcting network split with iptables..."
   exe iptables -F OUTPUT
   echo "Dropping caches due to BZ 1229226..."
   echo 3 > /proc/sys/vm/drop_caches
   echo "Statting file002 to induce heal..."
   exe stat /rhgs/client/rep01/file002
else
   echo "Wrong host!"
fi


It does _not_ correct the problem.

It also doesn't help if I put the cache drop in step 2 just after modifying the file.

Comment 7 Dustin Black 2015-06-11 21:43:34 UTC

(In reply to Dustin Black from comment #6)

> It does _not_ correct the problem.

Nevermind; ignore me. Too little sleep...

Dropping the caches before reading the file after the split is resolved does work. The 'ls' command still completes without error, but a 'cat' results in the expected EIO.

Comment 8 Kaushal 2017-03-08 11:01:40 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.