+++ This bug was initially created as a clone of Bug #1179169 +++ +++ This bug was initially created as a clone of Bug #1104618 +++ Description of problem: A script that includes a tar command on a gluster volume directory is failing with the error "file changed as we read it". There is no active I/O to the gluster volume when the tar command is issued. All previous writes were sync writes. Seeing this fairly consistently on a replica 2 volume; not yet seen on a distribute volume. Version-Release number of selected component (if applicable): Client and Servers are running glusterfs*-3.6.0.11-1.el6rhs.x86_64 How reproducible: Quite consistently. Some runs with only a few small files did not show the error message. But have been seeing it every time I have runs with the steps below. Steps to Reproduce: 1. create a replicate or distribute-replicate volume. Native mount on a single client. 2. mkdir /mnt/glustervol/dir_1 (where /mnt/glustervol is the mount point for gluster volume) 3. for ii in `seq 5`; do dd bs=1024k count=1024 if=/dev/zero of=/mnt/glustervol/dir_1/f_${ii} conv=fsync done 4. tar -cf /home/tar_compare.tar /mnt/glustervol/dir_1 (/home is a separate local file system) Actual results: tar: Removing leading `/' from member names tar: /mnt/glustervol/dir_1/f_1: file changed as we read it Expected results: In this test scenario, tar should not give the "file changed as we read it" message. Additional info: --- Additional comment from Krutika Dhananjay on 2014-11-18 02:00:22 EST --- Was able to recreate the bug and debug it. I used the same steps as the one described above on a plain replicate except that i created 3 files (as opposed to 5) and the following are my observations: I attached the tar program to strace and the output tells me that tar does an fstat() before and after archiving a given file: fstat(5</mnt/dir_1/f_1>, {st_dev=makedev(0, 35), st_ino=12126758327912018215, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072, st_blocks=524288, st_size=268435456, st_atime=2014/11/18-11:54:23, st_mtime=2014/11/18-11:54:46, st_ctime=2014/11/18-11:54:49}) = 0 <============= fstat #1 read(5</mnt/dir_1/f_1>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 9216) = 9216 write(3</home/compare.tar>, "mnt/dir_1/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 read(5</mnt/dir_1/f_1>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 write(3</home/compare.tar>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 read(5</mnt/dir_1/f_1>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 ... ... ... write(3</home/compare.tar>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 read(5</mnt/dir_1/f_1>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 write(3</home/compare.tar>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 fstat(5</mnt/dir_1/f_1>, {st_dev=makedev(0, 35), st_ino=12126758327912018215, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072, st_blocks=524288, st_size=268435456, st_atime=2014/11/18-11:56:39, st_mtime=2014/11/18-11:54:47, st_ctime=2014/11/18-11:54:50}) = 0 <======= fstat #2 write(2</dev/pts/1>, "tar: ", 5) = 5 write(2</dev/pts/1>, "/mnt/dir_1/f_1: file changed as "..., 42) = 42 write(2</dev/pts/1>, "\n", 1) = 1 In case of /dir_1/f_1, the ctime of the file before and after the archival was found by tar to be different, causing it to think f_1's attributes changed while it was being processed. Initial suspicion was on client-side caching xlators like io-cache, and stat-prefetch. Disabling the two translators still caused the bug to be hit. Hence the possibility of client side caching leading up to this bug was eliminated. Initial suspicion was on afr, that the read child of "/dir_1/f_1" in AFR could have possibly changed between the first and the second fstat (every file has a designated read_child in AFR on which AFR performs read-only FOPS like getxattr(), readv(), stat(), fstat(), readdirp() etc). Adding more logs and trying again revealed that the read child of "/dir_1/f_1" never changed between the two fstat()s. The possibility of a bug in AFRs's read_child selection was eliminated. Of the 3 files under /dir_1, the file-changed-as-we-read-it error occurred only with f_1. From the added logs, the read children of /dir_1, /dir_1/f_1, /dir_1/f_2 and /dir_1/f_3 were found to be the following: /dir_1 : 0 /dir_1/f_1 : 1 /dir_1/f_2 : 0 /dir_1/f-3 : 0 Therefore, on /dir_1, readdirp() was wound to child-0 of AFR, and the directory entries and their attributes from brick-0 were cached by the fuse kernel. Also, the first of the two fstat()s was never called on user space (glusterfs fuse mount), and was possibly served up by fuse kernel using the cached values from readdirp() on brick-0. Then at the end, when the second fstat() was called (which made it into user space this time) on /dir_1/f_1, the attributes were being served up from the actual read child of f_1, which is child-1. When the tar program compared the output of the two fstat()s, it found the ctime to have changed and threw this error. I tried the same experiment, by mounting the volume with --user-readdirp=no option, and this warning was not seen on the mount. Also the logs indicated that an explicit fstat() was passed down to the user space before tar-ing f_1. Will try the same experiment with plain distribute, 2x2 dis-rep volume, and on nfs mount and update the bug if there is anything more to it. --- Additional comment from Manoj Pillai on 2014-11-18 14:04:26 EST --- (In reply to Krutika Dhananjay from comment #2) Nice Analysis! > > I tried the same experiment, by mounting the volume with --user-readdirp=no > option, and this warning was not seen on the mount. Also the logs indicated > that an explicit fstat() was passed down to the user space before tar-ing > f_1. > > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and > on nfs mount and update the bug if there is anything more to it. I think this is worse than a warning because tar is returning an exit code that is interpreted as an error. Is setting cluster.read-subvolume-index also expected to make the problem go away? --- Additional comment from Krutika Dhananjay on 2014-11-19 02:02:41 EST --- (In reply to Manoj Pillai from comment #3) > (In reply to Krutika Dhananjay from comment #2) > > Nice Analysis! > > > > > I tried the same experiment, by mounting the volume with --user-readdirp=no > > option, and this warning was not seen on the mount. Also the logs indicated > > that an explicit fstat() was passed down to the user space before tar-ing > > f_1. > > > > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and > > on nfs mount and update the bug if there is anything more to it. > > I think this is worse than a warning because tar is returning an exit code > that is interpreted as an error. > > Is setting cluster.read-subvolume-index also expected to make the problem go > away? Interesting. That never crossed my mind! I am guessing it will work fine ONLY for this case, where we don't bring down any child of AFR during the course of this test. However, let me dig into the code a bit more and I'll post whatever I find here. I am actually now able to recreate this bug even with --use-readdirp=no option. It happened 2/3 times. There is definitely more to this bug than whatever I said in comment #2. --- Additional comment from Krutika Dhananjay on 2014-11-20 02:13:56 EST --- > > I am actually now able to recreate this bug even with --use-readdirp=no > option. It happened 2/3 times. There is definitely more to this bug than > whatever I said in comment #2. So I hand-wrote the volfile for the fuse client to load trace xlator between md-cache and quick-read and tried this again on a plain replicate volume mounted with use-readdirp=no <volfile-snippet> ... ... volume rep-quick-read type performance/quick-read subvolumes rep-io-cache end-volume volume rep-trace type debug/trace subvolumes rep-quick-read option log-file on end-volume volume rep-md-cache type performance/md-cache subvolumes rep-trace end-volume ... ... </volfile-snippet> What I found was that even with readdirp disabled, trace xlator logged occurrences of readdirp(): [2014-11-20 07:00:00.402564] I [trace.c:2530:trace_opendir] 0-rep-trace: 6940: gfid=552c56e2-87e4-4334-bf50-f9df4e473631 path=/dir_1 fd=0x9cdb5c [2014-11-20 07:00:00.462940] I [trace.c:848:trace_opendir_cbk] 0-rep-trace: 6940: gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=0, op_errno=117, fd=0x9cdb5c [2014-11-20 07:00:00.463454] I [trace.c:2562:trace_readdirp] 0-rep-trace: 6941: gfid=552c56e2-87e4-4334-bf50-f9df4e473631 fd=0x9cdb5c, size=4096, offset=0 dict=0x7fd164e6edf8 [2014-11-20 07:00:00.464761] I [trace.c:319:trace_readdirp_cbk] 0-rep-trace: 6941 : gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=5, op_errno=0 [2014-11-20 07:00:00.468683] I [trace.c:2562:trace_readdirp] 0-rep-trace: 6942: gfid=552c56e2-87e4-4334-bf50-f9df4e473631 fd=0x9cdb5c, size=4096, offset=512 dict=0x7fd164e6e998 [2014-11-20 07:00:00.474459] I [trace.c:319:trace_readdirp_cbk] 0-rep-trace: 6942 : gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=0, op_errno=0 [2014-11-20 07:00:00.477221] I [md-cache.c:907:mdc_stat] (-->/usr/local/lib/glusterfs/3.6.0.32/xlator/meta.so(meta_stat+0x55) [0x7fd156603569] (-->/usr/local/lib/libglusterfs.so.0(default_stat+0x97) [0x7fd166678e28] (-->/usr/local/lib/glusterfs/3.6.0.32/xlator/debug/io-stats.so(io_stats_stat+0x2a1) [0x7fd156824d1c]))) 0-MDC_STAT: 552c56e2-87e4-4334-bf50-f9df4e473631: serving up cached data md-cache converts readdir()s to readdirp()s internally to cache the entries' metadata unless the option performance.force-readdirp is set to "off" (thanks to Raghavendra G for this piece of information). And what came after the readdir() from fuse kernel was a stat() (see logs above) which was served up by md-cache based on the attributes it had cached from previous readdir(), in which case the overall RCA posted in comment #2 is still valid. I tried this again, with --use-readdirp=no and force-readdirp set to off. This time the bug was not hit. --- Additional comment from Krutika Dhananjay on 2014-11-20 07:41:19 EST --- (In reply to Manoj Pillai from comment #3) > (In reply to Krutika Dhananjay from comment #2) > > Nice Analysis! > > > > > I tried the same experiment, by mounting the volume with --user-readdirp=no > > option, and this warning was not seen on the mount. Also the logs indicated > > that an explicit fstat() was passed down to the user space before tar-ing > > f_1. > > > > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and > > on nfs mount and update the bug if there is anything more to it. > > I think this is worse than a warning because tar is returning an exit code > that is interpreted as an error. > > Is setting cluster.read-subvolume-index also expected to make the problem go > away? The problem with read-subvolume-index configured to one single sub-volume is that the chosen subvolume will be the read child only as long as it has good copies of the files/dirs. And in the event that it is down, AFR will necessarily have to pick its other subvol(s). Also, I got to know from Pranith that in the event of a split brain, the read child will unconditionally be chosen as the source during self-heal (when ideally it should be failing with EIO). And that can lead to loss of data. --- Additional comment from Anand Avati on 2015-01-08 07:26:56 EST --- REVIEW: http://review.gluster.org/9418 (cluster/afr: Fix perceived change in file due to difference in [mc]time across replicas) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Anand Avati on 2015-01-16 04:23:27 EST --- REVIEW: http://review.gluster.org/9457 (cluster/dht: Fix incorrect updates to parent times) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Anand Avati on 2015-01-19 06:12:33 EST --- COMMIT: http://review.gluster.org/9457 committed in master by Raghavendra G (rgowdapp) ------ commit b45f623a7a8e14ca09a10c6a04c4c5f81e3a62e2 Author: Krutika Dhananjay <kdhananj> Date: Fri Jan 16 14:26:45 2015 +0530 cluster/dht: Fix incorrect updates to parent times In directory write FOPs, as far as updates to timestamps associated with parent by DHT is concerned, there are three possibilities: a) time (in sec) gotten from child of DHT < time (in sec) in inode ctx b) time (in sec) gotten from child of DHT = time (in sec) in inode ctx c) time (in sec) gotten from child of DHT > time (in sec) in inode ctx In case (c), for time in nsecs, it is the value returned by DHT's child that must be selected. But what DHT_UPDATE_TIME ends up doing is to choose the maximum of (time in nsec gotten from DHT's child, time in nsec in inode ctx). Change-Id: I535a600b9f89b8d9b6714a73476e63ce60e169a8 BUG: 1179169 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/9457 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Tested-by: Raghavendra G <rgowdapp> --- Additional comment from Anand Avati on 2015-01-20 05:13:27 EST --- REVIEW: http://review.gluster.org/9465 (cluster/dht: In MKDIR(), aggregate iatts from non-hashed subvols too) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Anand Avati on 2015-01-22 05:06:19 EST --- REVIEW: http://review.gluster.org/9476 (performance/md-cache: Initialise local->loc before winding nameless lookup) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Anand Avati on 2015-01-22 05:06:38 EST --- REVIEW: http://review.gluster.org/9477 (cluster/afr: When parent and entry read subvols are different, set entry->inode to NULL) posted (#1) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Anand Avati on 2015-01-23 01:55:13 EST --- REVIEW: http://review.gluster.org/9476 (performance/md-cache: Initialise local->loc before winding nameless lookup) posted (#2) for review on master by Krutika Dhananjay (kdhananj) --- Additional comment from Anand Avati on 2015-01-24 00:37:48 EST --- COMMIT: http://review.gluster.org/9476 committed in master by Raghavendra G (rgowdapp) ------ commit 304cfeea0106ee7a9e1786637bab403f72c96155 Author: Krutika Dhananjay <kdhananj> Date: Thu Jan 22 12:57:10 2015 +0530 performance/md-cache: Initialise local->loc before winding nameless lookup That way, in the cbk, the latest values are updated in the cache. Change-Id: Ia149e352e4763e8f5b910a3f4cb64d2dda4534b1 BUG: 1179169 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/9476 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Tested-by: Raghavendra G <rgowdapp> --- Additional comment from Anand Avati on 2015-01-24 00:43:18 EST --- COMMIT: http://review.gluster.org/9465 committed in master by Raghavendra G (rgowdapp) ------ commit 6c10f6db7c53e96881b29ca70dfc08b1ffb231b3 Author: Krutika Dhananjay <kdhananj> Date: Tue Jan 20 14:58:34 2015 +0530 cluster/dht: In MKDIR(), aggregate iatts from non-hashed subvols too PROBLEM: Gathering iatts from ONLY the hashed subvol during MKDIR and unwinding them can cause md-cache to cache and serve these values for a while to the application. And then, at a later point of time, when a LOOKUP on either the dir or its parent gathers attributes from all subvolumes of dht and things are evened out as part of DHT_UPDATE_TIME, the application could be getting a different set of [cm]times (i.e., one of the non-hashed subvolumes' times could be selected by virtue of having the highest values), causing it to think the directory underwent modification even when it might not have. The effect of this bug becomes apparent in programs like tar, which rely on the ctime of the files before and after archiving a file to ascertain that the file remained unchanged during this time. FIX: Aggregate iatts from ALL sub-volumes of DHT during MKDIR. Change-Id: I04c4ca3e3b9552772e2b089be680f8afeb72089e BUG: 1179169 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/9465 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Tested-by: Raghavendra G <rgowdapp> --- Additional comment from Anand Avati on 2015-01-24 03:16:12 EST --- REVIEW: http://review.gluster.org/9477 (cluster/afr: When parent and entry read subvols are different, set entry->inode to NULL) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/9496 (cluster/dht: Fix incorrect updates to parent times) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/9497 (performance/md-cache: Initialise local->loc before winding nameless lookup) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/9497 committed in release-3.5 by Niels de Vos (ndevos) ------ commit c0419befa9d0d470a921863cd700d778ce5da194 Author: Krutika Dhananjay <kdhananj> Date: Thu Jan 22 12:57:10 2015 +0530 performance/md-cache: Initialise local->loc before winding nameless lookup Backport of: http://review.gluster.org/9476 That way, in the cbk, the latest values are updated in the cache. Change-Id: I3f8701d7043f06f512364e20c4456f254776cba7 BUG: 1186121 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/9497 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Niels de Vos <ndevos>
REVIEW: http://review.gluster.org/9504 (cluster/afr: When parent and entry read subvols are different, set entry->inode to NULL) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/9504 (cluster/afr: When parent and entry read subvols are different, set entry->inode to NULL) posted (#2) for review on release-3.5 by Krutika Dhananjay (kdhananj)
REVIEW: http://review.gluster.org/9504 (cluster/afr: When parent and entry read subvols are different, set entry->inode to NULL) posted (#3) for review on release-3.5 by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/9504 committed in release-3.5 by Niels de Vos (ndevos) ------ commit bb2df4e63fa8a5d65f18b4a5efc757e8d475fbff Author: Krutika Dhananjay <kdhananj> Date: Thu Jan 22 17:02:20 2015 +0530 cluster/afr: When parent and entry read subvols are different, set entry->inode to NULL Backport of: http://review.gluster.org/#/c/9477 That way a lookup would be forced on the entry, and its attributes will always be selected from its read subvol. Additionally, directory write fops as well as LOOKUP have been made to unwind parent attributes from parent's read child in AFR. Change-Id: I9fca49fa91cc3a65f53db855fedb90b08f1ca7f4 BUG: 1186121 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/9504 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Pranith Kumar Karampuri <pkarampu>
COMMIT: http://review.gluster.org/9496 committed in release-3.5 by Niels de Vos (ndevos) ------ commit 4b3981d2c9b4cafa24398765b3d0c8a4d45d43b1 Author: Krutika Dhananjay <kdhananj> Date: Fri Jan 16 14:26:45 2015 +0530 cluster/dht: Fix incorrect updates to parent times Backport of: http://review.gluster.org/9457 In directory write FOPs, as far as updates to timestamps associated with parent by DHT is concerned, there are three possibilities: a) time (in sec) gotten from child of DHT < time (in sec) in inode ctx b) time (in sec) gotten from child of DHT = time (in sec) in inode ctx c) time (in sec) gotten from child of DHT > time (in sec) in inode ctx In case (c), for time in nsecs, it is the value returned by DHT's child that must be selected. But what DHT_UPDATE_TIME ends up doing is to choose the maximum of (time in nsec gotten from DHT's child, time in nsec in inode ctx). Change-Id: I1388e374c8a2029f3b8919380e68620e7591bde6 BUG: 1186121 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/9496 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Reviewed-by: Shyamsundar Ranganathan <srangana> Reviewed-by: Niels de Vos <ndevos>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.4, please reopen this bug report. glusterfs-3.5.4 has been announced on the Gluster Packaging mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.packaging/2 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user