Description of problem: Able to consistently reproduce a split-brain state that is never logged and where EIO is never triggered, leaving the file available for error-free rw access while in a split-brain state. Version-Release number of selected component (if applicable): [root@n2 ~]# gluster --version glusterfs 3.7.0 built on May 20 2015 13:30:05 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@n2 ~]# rpm -qa |grep gluster glusterfs-libs-3.7.0-2.el7.x86_64 glusterfs-cli-3.7.0-2.el7.x86_64 glusterfs-3.7.0-2.el7.x86_64 glusterfs-fuse-3.7.0-2.el7.x86_64 glusterfs-client-xlators-3.7.0-2.el7.x86_64 glusterfs-server-3.7.0-2.el7.x86_64 glusterfs-api-3.7.0-2.el7.x86_64 glusterfs-geo-replication-3.7.0-2.el7.x86_64 How reproducible: Consistently Steps to Reproduce: 1. A test file is created: [root@n1 ~]# dd if=/dev/urandom of=/rhgs/client/rep01/file002 bs=1k count=1k 2. Confirm that file hashes to bricks on n1 and n2: [root@n1 ~]# ls -lh /rhgs/bricks/rep01/file002 -rw-r--r-- 2 root root 22 Jun 3 12:18 /rhgs/bricks/rep01/file002 [root@n2 ~]# ls -lh /rhgs/bricks/rep01/file002 -rw-r--r-- 2 root root 22 Jun 3 12:18 /rhgs/bricks/rep01/file002 3. A network split is induced by using iptables to drop all packets from n1 to n2, and data is appended to the test file from n1: #!/bin/bash exe() { echo "\$ $@" ; "$@" ; } if [ $HOSTNAME == "n1" ]; then echo "Inducing network split with iptables..." exe iptables -F exe iptables -A OUTPUT -d n2 -j DROP echo "Adding 1MB of random data to file002..." exe dd if=/dev/urandom bs=1k count=1k >> /rhgs/client/rep01/file002 echo "Generating md5sum for file002..." exe md5sum /rhgs/client/rep01/file002 else echo "Wrong host!" fi 4. Data is appended to the test file from n2: #!/bin/bash exe() { echo "\$ $@" ; "$@" ; } if [ $HOSTNAME == "n2" ]; then echo "Adding 2MB of random data to file002..." exe dd if=/dev/urandom bs=1k count=2k >> /rhgs/client/rep01/file002 echo "Generating md5sum for file002..." exe md5sum /rhgs/client/rep01/file002 else echo "Wrong host!" fi 5. Correct the network split and stat the file from the client: #!/bin/bash exe() { echo "\$ $@" ; "$@" ; } if [ $HOSTNAME == "n1" ]; then echo "Correcting network split with iptables..." exe iptables -F OUTPUT echo "Statting file002 to induce heal..." exe stat /rhgs/client/rep01/file002 else echo "Wrong host!" fi 6. Cat the file (this should result in EIO, but does not): [root@n1 ~]# cat /rhgs/client/rep01/file002 > /dev/null 7. Add new data to the file from n1: [root@n1 ~]# dd if=/dev/urandom bs=1k count=1k >> /rhgs/client/rep01/file002 1024+0 records in 1024+0 records out 1048576 bytes (1.0 MB) copied, 0.138334 s, 7.6 MB/s 8. Look for expected split-brain errors in the gluster logs (nothing is returned): [root@n1 ~]# grep -i split /var/log/glusterfs/{*,bricks/*} 2>/dev/null [root@n2 ~]# grep -i split /var/log/glusterfs/{*,bricks/*} 2>/dev/null 9. Confirm files are different, and both copies think themselves as WISE: [root@n1 ~]# md5sum /rhgs/bricks/rep01/file002 d70a816aab125567c185bc047f4358b0 /rhgs/bricks/rep01/file002 [root@n1 ~]# getfattr -d -m . -e hex /rhgs/bricks/rep01/file002 getfattr: Removing leading '/' from absolute path names # file: rhgs/bricks/rep01/file002 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.rep01-client-0=0x000000000000000000000000 trusted.afr.rep01-client-1=0x000000910000000000000000 trusted.bit-rot.version=0x0200000000000000556dd9df000a770c trusted.gfid=0x8740772d4f204ce183f010a80e76015c [root@n2 ~]# md5sum /rhgs/bricks/rep01/file002 bcb17a86bf54db36fa874030fde8da4b /rhgs/bricks/rep01/file002 [root@n2 ~]# getfattr -d -m . -e hex /rhgs/bricks/rep01/file002 getfattr: Removing leading '/' from absolute path names # file: rhgs/bricks/rep01/file002 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.rep01-client-0=0x000000310000000000000000 trusted.afr.rep01-client-1=0x000000000000000000000000 trusted.bit-rot.version=0x0200000000000000556dd9de000db404 trusted.gfid=0x8740772d4f204ce183f010a80e76015c Actual results: Able to stat, ls, and cat the split file from the client without error Expected results: File operations should result in EIO Additional info: Topology for volume rep01: Distribute set | +-- Replica set 0 | | | +-- Brick 0: n1:/rhgs/bricks/rep01 | | | +-- Brick 1: n2:/rhgs/bricks/rep01 | +-- Replica set 1 | +-- Brick 0: n3:/rhgs/bricks/rep01 | +-- Brick 1: n4:/rhgs/bricks/rep01 [root@n1 ~]# gluster volume info rep01 Volume Name: rep01 Type: Distributed-Replicate Volume ID: 6ff17d21-035d-47e7-8bd1-d4a9e850be31 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: n1:/rhgs/bricks/rep01 Brick2: n2:/rhgs/bricks/rep01 Brick3: n3:/rhgs/bricks/rep01 Brick4: n4:/rhgs/bricks/rep01 Options Reconfigured: performance.readdir-ahead: on Client mounts on n1 and n2: [root@n1 ~]# grep client /etc/fstab n1:rep01 /rhgs/client/rep01 glusterfs _netdev 0 0 [root@n1 ~]# mount | grep client n1:rep01 on /rhgs/client/rep01 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@n2 ~]# grep client /etc/fstab n1:rep01 /rhgs/client/rep01 glusterfs _netdev 0 0 [root@n2 ~]# mount | grep client n1:rep01 on /rhgs/client/rep01 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
FWIW, I tried a bunch of iptables tricks and couldn't find a way to reproduce this on a single node. It does seem specific to a two-node (or at least two-glusterd) configuration.
OK, I lied. Previously, I had been cutting off access to each brick sequentially, with client unmounts and remounts in between. This time, I mounted twice simultaneously, and cut off each client's connection to one brick. Something like this (your port numbers may vary). > iptables -t mangle -I OUTPUT -p tcp --sport 1020 --dport 49152 -j DROP > iptables -t mangle -I OUTPUT -p tcp --sport 1002 --dport 49153 -j DROP With this, I got into a state where *one* client could still read the file from the still-connected brick without error. Interestingly, it was not symmetric; the other client did report EIO, as it should. Xattrs do show pending operations for each other, and "heal info" shows split-brain from both sides. As I wrote this, the state changed yet again. Now both clients correctly return EIO. This strongly suggests that some state is being cached improperly on the clients, but not infinitely. The plot thickens.
I can consistently reproduce this state now. Just as consistently, it persists until I utter this familiar incantation: # echo 3 > /proc/sys/vm/drop_caches As far as I can tell, we don't even *get* the read until we do this. Therefore we can't fail it. Instead, the kernel returns the version that we had written previously. We could prevent that by checking for split-brain on open, but we don't seem to do that. Perhaps this is related to the fact that NFS might not do an open before a read, so the emphasis has been on checking in the read path - which we don't get to in this case. Just a theory. In any case, maybe there are some clues that someone more familiar with AFR can pursue.
*** Bug 1220347 has been marked as a duplicate of this bug. ***
(In reply to Jeff Darcy from comment #3) > I can consistently reproduce this state now. Just as consistently, it > persists until I utter this familiar incantation: > > # echo 3 > /proc/sys/vm/drop_caches > > As far as I can tell, we don't even *get* the read until we do this. > Therefore we can't fail it. Instead, the kernel returns the version that we > had written previously. We could prevent that by checking for split-brain > on open, but we don't seem to do that. Perhaps this is related to the fact > that NFS might not do an open before a read, so the emphasis has been on > checking in the read path - which we don't get to in this case. Just a > theory. In any case, maybe there are some clues that someone more familiar > with AFR can pursue. So interestingly, I tried the drop caches a few different ways previously (at different points in the reproducer process), and it didn't help. I'm going to try again and see if maybe I missed something before...
For my original reproducer, if I insert the cache drop where I logically think it should go in step 5: 5. Correct the network split and stat the file from the client: #!/bin/bash exe() { echo "\$ $@" ; "$@" ; } if [ $HOSTNAME == "n1" ]; then echo "Correcting network split with iptables..." exe iptables -F OUTPUT echo "Dropping caches due to BZ 1229226..." echo 3 > /proc/sys/vm/drop_caches echo "Statting file002 to induce heal..." exe stat /rhgs/client/rep01/file002 else echo "Wrong host!" fi It does _not_ correct the problem. It also doesn't help if I put the cache drop in step 2 just after modifying the file.
(In reply to Dustin Black from comment #6) > It does _not_ correct the problem. Nevermind; ignore me. Too little sleep... Dropping the caches before reading the file after the split is resolved does work. The 'ls' command still completes without error, but a 'cat' results in the expected EIO.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.