Description of problem: While tarring up a large directory (274 G) that resides on a Gluster 3.4.2 volume (native FUSE mount), I kept seeing these sporadic "tar: ... file changed as we read it" messages scroll by. The happened maybe once every 50 files. There are no errors in the gluster mount log. I've verified that nothing else is accessing these files (via lsof), so I'm a bit puzzled. The tar process completed with an exit (return) value of 1. Version-Release number of selected component (if applicable): Client: glusterfs-libs-3.4.2-1.el6.x86_64 glusterfs-3.4.2-1.el6.x86_64 glusterfs-fuse-3.4.2-1.el6.x86_64 Server(s): glusterfs-3.4.2-1.el6.x86_64 glusterfs-libs-3.4.2-1.el6.x86_64 glusterfs-cli-3.4.2-1.el6.x86_64 glusterfs-server-3.4.2-1.el6.x86_64 glusterfs-fuse-3.4.2-1.el6.x86_64 How reproducible: At least on the directory I observed it on, regularly, as I killed and repeated the tar process after confirming that nothing else was accessing the files. It wouldn't necessarily always produce the "file changed as we read it" on the same file either. Additional info: I'm running a distributed-replicate 4x2 configuration. This bug seems to be nearly identical to bug # 762345, but that one was closed, so I'm not sure if this is a regression or something different.
I've done further testing on this, and this is easily reproducible: Go to kernel.org, download the latest linux kernel: wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.13.1.tar.xz untar it. After the tar archive is completely extracted, attempt to create a new tar archive. Upon a tar cvf operation (you can even leave the compression off), you will get a ton of those "tar: ... file changed as we read it" messages: linux-3.13.1/tools/perf/bench/numa.c tar: linux-3.13.1/tools/perf/bench/numa.c: file changed as we read it linux-3.13.1/tools/perf/bench/sched-messaging.c linux-3.13.1/tools/perf/builtin-buildid-list.c linux-3.13.1/tools/perf/builtin-diff.c tar: linux-3.13.1/tools/perf/builtin-diff.c: file changed as we read it linux-3.13.1/tools/perf/builtin-kmem.c linux-3.13.1/tools/perf/builtin-list.c tar: linux-3.13.1/tools/perf/builtin-list.c: file changed as we read it linux-3.13.1/tools/perf/builtin-record.c linux-3.13.1/tools/perf/builtin-report.c tar: linux-3.13.1/tools/perf/builtin-report.c: file changed as we read it linux-3.13.1/tools/perf/config/ and then then the tar operation will exit with a status of 1.
Problem is tar 'ver 1.23' and is not seen with tar 'ver 1.26' Can you verify that?
The box is the latest EL 6, and yes: # rpm -q tar tar-1.23-11.el6.x86_64 Are there no plans to backport the fix to EL6? (or update the version in EL6?) Also, I'm guessing that the error is a false postive? Or is there actually a data integrity issue?
False positive, its a tar bug which is fixed in 1.26 - not GlusterFS
I grabbed the latest tar src rpm from Fedora 20, and built it on RHEL6, and installed it: # rpm -q tar tar-1.26-30.el6.x86_64 but I'm still seeing the same thing when running "tar cvf": linux-3.13.1/arch/ia64/include/asm/xen/events.h tar: linux-3.13.1/arch/ia64/include/asm/xen/events.h: file changed as we read it linux-3.13.1/arch/ia64/include/asm/xen/hypercall.h linux-3.13.1/arch/ia64/include/asm/xen/inst.h tar: linux-3.13.1/arch/ia64/include/asm/xen/inst.h: file changed as we read it linux-3.13.1/arch/ia64/include/asm/xen/interface.h linux-3.13.1/arch/ia64/include/asm/xen/irq.h tar: linux-3.13.1/arch/ia64/include/asm/xen/irq.h: file changed as we read it linux-3.13.1/arch/ia64/include/asm/xen/minstate.h linux-3.13.1/arch/ia64/include/asm/xen/page.h I rebooted and tried again just to make sure nothing was being cached, and same thing. tar --version confirms that I'm running the correct version too: # tar --version tar (GNU tar) 1.26 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by John Gilmore and Jay Fenlason. Thoughts?
I just took this a step further, remembering that I do have a Fedora 20 Gluster client. Tar version check: # rpm -q tar tar-1.26-30.fc20.x86_64 I tried to reproduce everything from comment #1 on the Fedora 20 machine and am getting the exact same results.
I found the root cause for the issue: Basically tar before opening the file it needs to archive does 'lstat' of the file and stores the header info of filesize, mtime etc. After reading from file writing to archive it will check if the info of the file is still same or not by checking filesize/mtime using 'fstat'. I found according to the strace output of tar that mtimes are different. I checked the backend bricks for this information and the timestamps are different on the bricks. So basically lstat, fstat are giving output from different bricks. I need to find the reason why there is such change in afr. Will update the bug once I find the reason for this problem in afr. Strace outputs (Notice that the mtimes are different): lstat("/mnt/usr/libexec/postfix/error", {st_dev=makedev(0, 28), st_ino=11897595819233709911, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072, st_blocks=482, st_size=246696, st_atime=2014/07 /30-12:45:29, st_mtime=2014/07/30-12:45:27, st_ctime=2014/07/30-12:45:28}) = 0 fstat(3, {st_dev=makedev(0, 28), st_ino=11897595819233709911, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072, st_blocks=482, st_size=246696, st_atime=2014/07/30-12:45:29, st_mtime=2014/07/30-12:45:29, st_ctime=2014/07/30-12:45:30}) = 0 stat info from bricks (mtimes are different): # stat /brick2/data/usr/libexec/postfix/error File: `/brick2/data/usr/libexec/postfix/error' Size: 246696 Blocks: 488 IO Block: 4096 regular file Device: fd03h/64771d Inode: 734803 Links: 2 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-07-30 12:45:29.000000000 +0530 Modify: 2014-07-30 12:45:29.262000775 +0530 Change: 2014-07-30 12:45:30.517000778 +0530 # stat /brick2/data/usr/libexec/postfix/error File: `/brick2/data/usr/libexec/postfix/error' Size: 246696 Blocks: 488 IO Block: 4096 regular file Device: fd03h/64771d Inode: 653909 Links: 2 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-07-30 12:45:29.000000000 +0530 Modify: 2014-07-30 12:45:27.223001128 +0530 Change: 2014-07-30 12:45:28.477001126 +0530
I have the same error with tar, and I try use stat like Pranith Kumar K brick-1: File: `/data/examplefile.tgz' Size: 40614639 Blocks: 79336 IO Block: 4096 regular file Device: fd00h/64768d Inode: 34653799 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 500/ admin) Gid: ( 500/ admin) Access: 2014-08-13 10:55:57.632918644 +0800 Modify: 2014-08-13 10:55:45.000000000 +0800 Change: 2014-08-13 10:55:49.903108670 +0800 brick-2: File: `/data/examplefile.tgz' Size: 40614639 Blocks: 79336 IO Block: 4096 regular file Device: fd00h/64768d Inode: 34662215 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 500/ admin) Gid: ( 500/ admin) Access: 2014-08-13 11:38:34.755595732 +0800 Modify: 2014-08-13 10:55:45.000000000 +0800 Change: 2014-08-13 10:55:49.819567151 +0800 brick-3: File: `/data/examplefile.tgz' Size: 40614639 Blocks: 79336 IO Block: 4096 regular file Device: fd00h/64768d Inode: 34664711 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 500/ admin) Gid: ( 500/ admin) Access: 2014-08-13 11:23:34.817132315 +0800 Modify: 2014-08-13 10:55:45.000000000 +0800 Change: 2014-08-13 10:55:49.899321741 +0800 When I use stat on glusterfs filesystem, stat is show same as one brick stat /home/admin/glusterfsdata/examplefile.tgz output: File: `/home/admin/glusterfsdata/examplefile.tgz' Size: 40614639 Blocks: 79326 IO Block: 131072 regular file Device: 14h/20d Inode: 11028341610854568098 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/ admin) Gid: ( 500/ admin) Access: 2014-08-13 11:38:34.755595732 +0800 Modify: 2014-08-13 10:55:45.000000000 +0800 Change: 2014-08-13 10:55:49.819567151 +0800 I think it maybe error when tar read file from other brick glusterfs version: glusterfs-3.5.2-1.el6.x86_64 glusterfs-fuse-3.5.2-1.el6.x86_64 glusterfs-libs-3.5.2-1.el6.x86_64
I think I may have bumped into the same problem today. Not when creating tar files, but when extracting symlinks from tar files. A good explanation can be found here: http://gluster.org/pipermail/gluster-users.old/2013-July/013524.html The comment ends with: "In addition to this, using an option like hashed read-child (available with GlusterFS 3.4) can ensure that read calls for an inode/file land on the same node always." I tried that by setting "gluster volume set <vol> cluster.read-hash-mode 1", unfortunately that does not fix the problem. The access pattern tar uses is: open; fstat; close; lstat. Now if the mtime from fstat and lstat differ (because they were obtained from different bricks) tar complains. I wrote a small test program to trigger this problem. In our case (4 bricks in Distributed-Replicate 2x2 setup) approx. 1 in 4 stat calls show different mtime's. There's no difference in cluster.read-hash-mode = 0, 1 or 2.
Created attachment 929364 [details] test case for triggering tar problem Compile this program with gcc -Wall -o openfstatclose openfstatclose.c and run it with ./openfstatclose 1000 This will open+fstat 1000 files, remember their mtime's. Then run lstat on these 1000 files and compare the mtime reported by lstat to the mtime reported by fstat. Typical output: testfile.001: 1408657007.333940988 != 1408657007.332241660 testfile.003: 1408657007.343941013 != 1408657007.342241915 testfile.005: 1408657007.353941036 != 1408657007.352242171 testfile.007: 1408657007.353941036 != 1408657007.362242426 testfile.010: 1408657007.373941080 != 1408657007.372242681 ... testfile.865: 1408657010.203947694 != 1408657010.202314931 testfile.867: 1408657010.203947694 != 1408657010.202314931 testfile.872: 1408657010.223947743 != 1408657010.222315442 testfile.874: 1408657010.223947743 != 1408657010.222315442 221/1000 ( 22.10%) mismatches
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5. This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs". If there is no response by the end of the month, this bug will get automatically closed.
I'm still seeing this on GlusterFS 3.6.3
I'm also seeing this same behaviour running a Cygwin environment on a Windows 8.1 server. My suspicion is that it has to do with virus and on-access scanners. The "Change" attribute is changing by a little less than 3 seconds between creation and subsequent tar operations. The creation is via a copy from one location to another. Is it possible to just ignore this attribute and not fail? This is killing build pipelines unreasonably :-( Looking forward to a fix as there is no real workaround for this.
Having looked at the code in src/create.c that generates this error I'm wondering if the test that is being done is the correct test. It is testing for ctime changing (which is permissions based). I would have expected it to be testing for mtime (modification). By changing this code to compare mtime instead of ctime my problem goes away. It would now generate a warning if the file was modified rather than the permissions changed. I'm not clear on whether this is what is intended by this warning or not. Personally I'm not particularly worried about permissions changes while executing a tar, but I would be worried about a file actually changing. Hopefully I haven't completely misunderstood this issue and the intended behaviour. Interested in feedback on whether this is the intended behaviour or not. And if not, what is it that we are looking for and why?
(In reply to Mark O'Keefe from comment #15) > I'm also seeing this same behaviour running a Cygwin environment on a > Windows 8.1 server. > My suspicion is that it has to do with virus and on-access scanners. > > The "Change" attribute is changing by a little less than 3 seconds between > creation and subsequent tar operations. The creation is via a copy from one > location to another. > > Is it possible to just ignore this attribute and not fail? This is killing > build pipelines unreasonably :-( > > Looking forward to a fix as there is no real workaround for this. Mark, As a workaround, have you tried invoking tar with the "--warning=no-file-changed" option? According to the man page of tar, using this option should suppress warnings of this nature.
Hi. Thanks for that response. This only suppresses the warning message, not the exit code. It would be necessary to check the exit code everywhere used to try and detect the issue and work around it. I've since realised that this bug isn't against tar. I believe that is where the bug is. My current workaround has been to build my own version of tar to fix the issue. I'll look at raising a bug against tar for a longer term fix. Again, thanks.
RH customer seeing this issue.
Since people seem to think this is a trivial bug, and noone seems to care about it, I'll just point out here that this bug renders the exit status of tar utterly useless. Anyone who uses tar to zip up files on a glusterfs filesystem faces the problem that tar returns error status 1, and they then have to resort to other means to check whether the tarball is valid. VERY unsatisfactory. Since tar is an absolutely vital tool for dealing with files, THIS IS A PROBLEM WHICH SHOULD NOT BE IGNORED.
@all, There are a couple more fixes for this in AFR,DHT and gNFS components of Glusterfs that should fix this issue, in the upcoming .x releases. Most of them are out for review at this point. Thanks, Krutika
REVIEW: http://review.gluster.org/13826 (debug/trace: Print {acm}times as integers) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
(In reply to BugMasta from comment #20) > Since people seem to think this is a trivial bug, and noone seems to care > about it, I'll just point out here that this bug renders the exit status of > tar utterly useless. 1 Engineer is NFS, 3 Engineers in replication and 3 Engineers in DHT are working on this issue for the past 5 weeks. > > Anyone who uses tar to zip up files on a glusterfs filesystem faces the > problem that tar returns error status 1, and they then have to resort to > other means to check whether the tarball is valid. VERY unsatisfactory. > > Since tar is an absolutely vital tool for dealing with files, THIS IS A > PROBLEM WHICH SHOULD NOT BE IGNORED. This bug is difficult to fix. All the solutions we have now reduce the performance for this consistency. We probably are going to give an option under which things will fine. There is also a design change that we are looking for the correct fix in the long term.
REVIEW: http://review.gluster.org/13826 (debug/trace: Print {acm}times as integers) posted (#3) for review on master by Krutika Dhananjay (kdhananj)
This error also shows when I do a "tar" from a cifs file system. It seems to be happening with nearly every file: in @ 206 MiB/s, out @ 206 MiB/s, 35.0 GiB total, buffer 0% fulltar: Data/Textures/IE/nm/more2: file changed as we read it in @ 218 MiB/s, out @ 218 MiB/s, 35.4 GiB total, buffer 0% fulltar: Data/Textures/IET/m: file changed as we read it in @ 216 MiB/s, out @ 218 MiB/s, 35.9 GiB total, buffer 0% fulltar: Data/Textures/Imp/m: file changed as we read it in @ 176 MiB/s, out @ 176 MiB/s, 37.3 GiB total, buffer 0% fulltar: Data/Textures/LLOD/Gen: file changed as we read it in @ 216 MiB/s, out @ 218 MiB/s, 37.5 GiB total, buffer 0% full .... The WIN7-workstation I am copying this from (x64-sp1,NTFS) is about as idle as it gets, as I watch the "backup" work. The NTFS has "access times" turned off. The output from tar (GNU tar 1.28), is being redirected/piped into "mbuffer" which is using direct-I/O to save it to disk. It seems like the bug is still present in tar V 1.28, (kernel=4.4.3).
Hi, We still have the issue with glusterfs 3.7.11 and tar 1.26. Is there any workaround? Thanks.
(In reply to Lionel from comment #26) > Hi, > > We still have the issue with glusterfs 3.7.11 and tar 1.26. > Is there any workaround? > > Thanks. We do have one workaround, but it decreases the performance. I think all the patches are merged for 3.7.12: You have to have the following configuration: 1) gluster volume set <volname> cluster.consistent-metadata on 2) gluster volume set <volname> cluster.post-op-delay-secs 0 Pranith
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.
While the testcase from #12 doesn't work for me, I do still see the problem with glusterfs-3.8.8-1.el7.x86_64 tar-1.26-29.el7.x86_64 It doesn't take a large number of files, and with my use case (Jenkins build workspaces that are tarred for various reasons) it's 100% reproducible. The problem appeared immediately after switching from a single-node single-brick shared storage to 'replica 3 arbiter 1' which makes sense in the context.
Previously my solution was to use gNFS which doesn't produce this issues but since gnfs is broken on latest release: 3.9.1-1, i was forced to downgrade to Fuse Native client and facing into same issue also now. Would be nice if you provide a solution for that.
(In reply to Jules from comment #30) > Previously my solution was to use gNFS which doesn't produce this issues but > since gnfs is broken on latest release: 3.9.1-1, i was forced to downgrade > to Fuse Native client and facing into same issue also now. > > Would be nice if you provide a solution for that. Did you enable cluster.consistent-metadata? -Krutika
You are writing that this workaround hurts performance. I don't think that this is not a solution for me then.
(In reply to Krutika Dhananjay from comment #31) > Did you enable cluster.consistent-metadata? > > -Krutika Hi - while consistent-metadata is a valid workaround, it's not fixing the underlying issue: files on various bricks will still have different a/m/c times. There's also a significant performance impact, would be very nice to avoid that.
i think a better workaround for this would be to tell the connected client to pass the noatime, nodiratime attributes, isn't it?
We have been using the cluster.consistent-metadata workaround, and suffering with the impaired performance. This bug is now 4.5 years old. Any chance it will ever get fixed properly?
hi Tom, There is a new feature in glusterfs developed by Rafi which makes sure ctime is consistent across the replicas and prevents this issue from happening. I am assigning the bug to Rafi to further comment about what more needs to be done so that we can revert consistent-metadata option. Pranith
Hi Tom, A new feature called consistent time (ctime) has been introduced in the latest 4.1 release. You can see the release not here https://docs.gluster.org/en/latest/release-notes/4.1.0/#standalone , which is mentioned as point no4. You can expect more enhancement to this feature in coming releases. Rafi KC
Thanks Rafi, Just to clarify that I understand correctly, starting with 4.1, we should be able to remove the current workaround: 1) gluster volume set <volname> cluster.consistent-metadata on 2) gluster volume set <volname> cluster.post-op-delay-secs 0 and instead use the new feature: 1) gluster volume set <volname> features.utime and this will give us the same timestamp consistency but with better performance than the current workaround? Thanks, Tom
Yes. You can do a reset on options cluster.consistent-metadata, cluster.post-op-delay-secs and enable ctime xlator. Please note that your clients and servers should be in time sync.
In addition to features.utime, you also need to set features.ctime.
'ctime' feature is now available with glusterfs-5.0 (upstream) version. Please use it and reopen the bug if the issue persists.
2024, just in case if someone's looking for the same issue, without using the provided command line arguments/flags, so if you're using a wrapper or an alias it might increase the chance of the warning being provided by the tar... ``` # tar -czf asd.tgz "fun/abcd (copy 5)" # rtz ./fun/abcd\ \(copy\ 5\)/ tar: ./fun/SpanishBitch (copy 5)/.tar.gz: file changed as we read it ``` Even if you're not staying inside the folder while performing it...
2024, just in case if someone's looking for the same issue, without using the provided command line arguments/flags, so if you're using a wrapper or an alias it might increase the chance of the warning being provided by the tar... ``` # tar -czf asd.tgz "fun/abcd (copy 5)" # rtz ./fun/abcd\ \(copy\ 5\)/ tar: ./fun/abcd (copy 5)/.tar.gz: file changed as we read it ``` Even if you're not staying inside the folder while performing it...