Hide Forgot
[Migrated from savannah BTS] - bug 26470 [https://savannah.nongnu.org/bugs/index.php?26470] Wed 06 May 2009 04:22:42 PM GMT, original submission by Erick Tryzelaar <erickt>: Good morning! I found another bug with stripe and afr not working appropriately with each other. We're testing a cluster where we've got a mirror of a stripe set. One of the machines in a stripe was down, but because of the mirror we were still able to read and write to that file. However, if we tried to list the entries in the directory our file wouldn't be listed: > ls /mnt/glusterfs > echo > /mnt/glusterfs/foo > ls /mnt/glusterfs > ls /mnt/glusterfs/foo /mnt/glusterfs/foo Here's our client.vol: volume machine1 type protocol/client option transport-type tcp option remote-host machine1 option remote-subvolume locks end-volume volume machine2 type protocol/client option transport-type tcp option remote-host machine2 option remote-subvolume locks end-volume volume stripe1 type cluster/stripe subvolumes machine1 machine2 end-volume volume machine3 type protocol/client option transport-type tcp option remote-host machine3 option remote-subvolume locks end-volume volume machine4 type protocol/client option transport-type tcp option remote-host machine4 option remote-subvolume locks end-volume volume stripe2 type cluster/stripe subvolumes machine3 machine4 end-volume volume replicate type cluster/replicate subvolumes stripe1 stripe2 end-volume In this case, machine2 is down, but machine1 is up. What I suspect is happening is that for reads and writes, gluster first goes to the replicate volume, tries to read/write from stripe1, but since machine2 is down then stripe1 is broken, so gluster then uses stripe2. However, for directory entries, it's going to stripe1 because machine1 is still up. machine1 never got the file because it's in a broken stripe, so it returns the incorrect information. To prove this theory, I killed the glusterfsd that was running on machine1, and it worked: > ls /mnt/glusterfs foo > ls /mnt/glusterfs/foo /mnt/glusterfs/foo And bringing back glusterfsd on machine1 causes the problem again: > ls /mnt/glusterfs > ls /mnt/glusterfs/foo /mnt/glusterfs/foo On a side note since I'm not sure if this is related, I can't touch the file: > touch /mnt/glusterfs/foo touch: setting times of `/mnt/glusterfs/foo': Transport endpoint is not connected Unless I kill the glusterfsd on machine1: > touch /mnt/glusterfs/foo > I suspect this is because gluster is writing the metadata only to the first subvolume listed in replicate and it's getting confused by the stripe not being completely up. -------------------------------------------------------------------------------- Fri 08 May 2009 07:04:33 PM GMT, comment #1 by Amar Tumballi <amarts>: Erick, Your observations are correct. Some cleanup is happening in stripe code, where these issues should be fixed. Keeping the bug open.
Testing it with 2.0.6rc4 and the "Transport endpoint is not connected" error doesn't occur anymore.