I have a script running on one server atomically updating a file (creating a tmp file and then moving it into place) every 2 seconds, and then another server reading the file every 5 seconds. On the server that is reading, I sometimes get an input/output error. The log files on the client that experiences the error: [2011-11-19 18:05:23.619352] W [afr-common.c:1121:afr_conflicting_iattrs] 0-testvol-replicate-0: /testfile: gfid differs on subvolume 1 (3089007a-da1c-41ad-a111-d1a988de2420, 50eb7bf4-0516-4508-808c-909ac0f968f6) [2011-11-19 18:05:23.619391] W [afr-common.c:1121:afr_conflicting_iattrs] 0-testvol-replicate-0: /testfile: gfid differs on subvolume 1 (3089007a-da1c-41ad-a111-d1a988de2420, 50eb7bf4-0516-4508-808c-909ac0f968f6) [2011-11-19 18:05:23.619413] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-testvol-replicate-0: /testfile: gfid different on subvolume [2011-11-19 18:05:23.619452] I [afr-common.c:1038:afr_launch_self_heal] 0-testvol-replicate-0: background missing-entry self-heal triggered. path: /testfile [2011-11-19 18:05:23.624027] I [afr-self-heal-common.c:1858:afr_sh_post_nb_entrylk_conflicting_sh_cbk] 0-testvol-replicate-0: Non blocking entrylks failed. [2011-11-19 18:05:23.624062] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-testvol-replicate-0: split brain found, aborting selfheal of /testfile [2011-11-19 18:05:23.624084] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-testvol-replicate-0: background missing-entry self-heal failed on /testfile [2011-11-19 18:05:23.624108] W [afr-common.c:1121:afr_conflicting_iattrs] 0-testvol-replicate-0: /testfile: gfid differs on subvolume 1 (3089007a-da1c-41ad-a111-d1a988de2420, 50eb7bf4-0516-4508-808c-909ac0f968f6) [2011-11-19 18:05:23.624133] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 9142: LOOKUP() /testfile => -1 (Input/output error) And to reproduce, using two glusterfs (v3.2.5) servers with the following volume definition: Volume Name: testvol Type: Replicate Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.104.123.145:/gluster/testvol Brick2: 10.82.37.136:/gluster/testvol Run this on one client: # while true; do touch testfile.tmp; mv testfile.tmp testfile; done And this script on another client: # while true; do x=$(<testfile); done I couldn't get the error to occur either when both scripts were run on a single client, or when using the glusterfs servers instead separate clients. Also, it didn't matter if both clients were mount from the same glusterfs server or one from each of the servers.
I can reproduce this on the pages 3.2.6p3 release using glusterfs, but mounting the volume with NFS does not trigger the error. This works whether you define the same NFS host or a different NFS host for each mount. sean
CHANGE: http://review.gluster.com/3177 (features/locks: Find parent-entrylk presence in lookup) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/3178 (cluster/afr: Build parent loc for expunge) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/3179 (cluster/afr: Handle transient parent-entry xactions in lookup) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/3185 (cluster/afr: Set errno correctly in find_fresh_parents failures) merged in master by Anand Avati (avati)
I am not abel to reproduce thsi issue on the following build:- [root@dhcp201-104 ~]# rpm -qa|grep gluster glusterfs-devel-3.3.0qa43-1.el6.x86_64 glusterfs-3.3.0qa43-1.el6.x86_64 glusterfs-server-3.3.0qa43-1.el6.x86_64 glusterfs-geo-replication-3.3.0qa43-1.el6.x86_64 glusterfs-debuginfo-3.3.0qa43-1.el6.x86_64 glusterfs-fuse-3.3.0qa43-1.el6.x86_64 glusterfs-rdma-3.3.0qa43-1.el6.x86_64 My setup details are as follows dhcp201-104.englab.pnq.redhat.com (GlusterServer) dhcp201-175.englab.pnq.redhat.com (host for a brick and gluster client for volume on gluster server) dhcp201-207.englab.pnq.redhat.com(host for a brick and gluster client for volume on gluster server) cmds:- @dhcp201-104.englab.pnq.redhat.com gluster volume create test3 dhcp201-175.englab.pnq.redhat.com:/vol/test3 dhcp201-207.englab.pnq.redhat.com:/vol/test1 gluster volume start test3 @dhcp201-175.englab.pnq.redhat.com mount -t glusterfs dhcp201-104.englab.pnq.redhat.com:/test3 /mnt/test1 cd /mnt/test1 while true;do date >> testfile.tmp;sleep 2; mv -f testfile.tmp testfile;done @dhcp201-175.englab.pnq.redhat.com mount -t glusterfs dhcp201-104.englab.pnq.redhat.com:/test3 /mnt/test1 cd /mnt/test1 while true;do sleep 5;tail -n 1 testfile; done Please correct me if i am missing something.Else can i mark this as verified.
Hi, I stopped testing glusterfs because of this and a few other issues. However, I was only getting the error once every day or two with the sleeps. For the test case, I removed the "sleep" commands and was able to reproduce in a matter of minutes - actually, I think it was seconds. If it still works without the sleeps then I guess it's fixed. Regards, Jason Stubbs
Jason, I tried it without sleep cmd.I am not able to see log report that you have decribed about.As basic i/o theory while the same file is used for reading and writing then the writing process will take the exlusive lock over the file and reading proceess will get and error.In this case it can not be avoided so i am abel to see the reading proceess output like this :- Thu May 31 05:03:54 EDT 2012 Thu May 31 05:03:54 EDT 2012 tail: cannot open `testfile' for reading: Structure needs cleaning tail: cannot open `testfile' for reading: Structure needs cleaning tail: cannot open `testfile' for reading: Structure needs cleaning tail: cannot open `testfile' for reading: Structure needs cleaning tail: cannot open `testfile' for reading: Structure needs cleaning This is expected behaiour.I am marking this bug as verified,If you still have doubts about my steps to reroduce,Please correct me.