Bug 888072 - Some files are getting deleted from distributed gluster volume
Some files are getting deleted from distributed gluster volume
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: distribute (Show other bugs)
3.3.0
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Jeff Darcy
:
Depends On:
Blocks: 902209
  Show dependency treegraph
 
Reported: 2012-12-17 18:41 EST by Glenn
Modified: 2014-04-17 07:39 EDT (History)
4 users (show)

See Also:
Fixed In Version: glusterfs-3.5.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 902209 (view as bug list)
Environment:
Last Closed: 2014-04-17 07:39:53 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Client log file (10.80 KB, text/plain)
2012-12-17 21:59 EST, Glenn
no flags Details

  None (edit)
Description Glenn 2012-12-17 18:41:00 EST
Description of problem:
We have a distributed gluster volume in service as a scratch file system on an HPC cluster. The transport is IPoIB. There are 4 gluster servers, each serving 4 RAID-6 arrays from an HP MSA2312 storage unit. One of the cluster users runs a set of jobs that puts a fairly high load (~40) on the servers. Also, his computations are subject to getting restarted. It seems that jobs that are restarted under heavy load cause a problem with certain files. In particular, there are 2 files written out during the jobs, 1 of about 30GB and one of about 3MB. After job completion and some undetermined amount of time (on the order of hours) the smaller files are deleted from the gluster volume. This seems to be just files from jobs that were restarted (overwriting existing files) although more investigation is needed for that.

Version-Release number of selected component (if applicable):
3.3.0

How reproducible:
There would appear to be several factors involved but it has not occurred every time, but frequently enough to say there is a problem. My guess is that load is part of the equation and perhaps overwriting of existing files.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Jeff Darcy 2012-12-17 19:59:29 EST
I've had some email exchanges with the reporter (hi Glenn).  Part of the problem seems to be that some requests are timing out and connections are being lost, perhaps related to load.  Most of the time this doesn't seem particularly problematic and connection recovery works fine, but sometimes there seems to be a substantial amount of "noise" - EBADFD errors, DHT anomalies found, etc. - between the first timed-out request (usually an fsync in the logs I've seen) and the disconnect/reconnect.  In those same cases it's possible that we're hitting some race condition that leaves an fd in an as-yet-unknown state.  Subsequently, the job is restarted and tries to open a file of the same name as the one in the unknown state.  It's surprising that this leads to an existing file being deleted, but there's plenty that we still don't understand here (including application behavior) so it would be premature to conclude that the failed requests and disconnections are mere coincidence.

Shishir, feel free to assign to me (and set NEW->ASSIGNED) if you want me to continue pursuing this.
Comment 2 Jeff Darcy 2012-12-17 20:31:49 EST
Trying to narrow the field a bit..

(1) Do you use striping?

(2) Is there any chance that the application tries to rename anything as part of the restart process (e.g. the common "write tmp then rename over target" idiom)

Most of the code paths that seem likely here involve one or the other.
Comment 3 Glenn 2012-12-17 21:51:26 EST
Hi Jeff,

(1) No, there is no striping, just distribute.
(2) No, he was just overwriting the existing file
Comment 4 Glenn 2012-12-17 21:59:55 EST
Created attachment 665243 [details]
Client log file

Attached a client log that is hopefully representative of what is happening.
Comment 5 Jeff Darcy 2012-12-19 10:07:17 EST
I'm running out of ideas for how to diagnose this further.  I've combed the codebase for cases where we call unlink directly, or call another translator's fops->unlink entry point.  Most of them, predictably, are in response to a user's unlink request.  The remainder fall into several categories.

(a) They're in translators that you're not using, including some that are built but not configurable via the CLI (and that quite likely wouldn't even work correctly if you wrote your own volfiles).

(b) They're for internal files, such as those used by index/marker.  I've gone through all of those and not found any cases where it seems even remotely likely that they'd unlink the user-visible file instead.

(c) They're related to other operations, e.g. rename as mentioned above (BTW that's only in a cleanup path).

The one really oddball case is in DHT, when it's removing a stale linkfile.  I've gone through that carefully, and it seems highly unlikely that we would mistake a regular file for a linkfile unless it has some very odd permissions (S_IVTX or 01000 set) *and* a trusted.dht.linkfile extended attribute.  I guess we can check that, but it's a long shot.

The other suggestion I have, besides being *absolutely* sure that nothing in the job restart invokes unlink/rename, would be to change permissions (or ACLs?) on the parent directory to force an error when removal is attempted.  That might generate some clues, in our logs or others', about where that's happening.
Comment 6 Glenn 2012-12-19 10:51:03 EST
What is curious as well is that for each job two files are written out simultaneously during the job. One of those is ~30GB and the other is ~3MB. It is always the 3MB file that is deleted, never the 30GB file. Could caching play a role? I am thinking of the caches on the 8 different RAID controllers.

What do you want me to do to gather more information?
Comment 7 Glenn 2012-12-19 10:54:36 EST
Here is some information from the person running the jobs. The bai files are the ones getting deleted when this happens.

"bai files are binary sam(bam) index files, the spec of the format can be found at http://samtools.sourceforge.net/.
my jobs use picard (http://picard.sourceforge.net/) to create the bam and bai files. both files are created simultaneously. When I ran the jobs I would run many at a time and gluster would have been under heavy read and write load. Each job would have 4-8 files open for reading, and writing to 2, I think for the batch of jobs on the 10th >20 jobs were running concurrently"
Comment 8 Jeff Darcy 2012-12-19 11:03:25 EST
A file's name and linked/unlinked status are metadata.  We do cache metadata, but only for read.  We never delay its progress toward storage like we do for data when write-behind is enabled.  Since this is metadata, size shouldn't make a difference.  It's vastly more likely that the difference lies in those files' state (not contents) during the preempt/restart sequence.  The actions I can suggest are:

(1) Double check the modes/xattrs/etc. between the time that the job is preempted and the time that the file disappears, to make sure that we're not hitting something weird like the linkfile case mentioned above.

(2) Consider adding some kind of monitoring or interception of unlink/rename calls to see where they're coming from.  Changing permissions to force an error is a very crude form of this.  Other possibilities might involve systemtap, LD_PRELOAD, or even a new situation-specific translator.
Comment 9 Glenn 2012-12-19 11:20:59 EST
With regard to stale link files, if I run the following command on the servers, I get quite a few hits.

 find /mnt/gluster-[0-7] -type f -perm +01000

This includes files in the directory that was used to hold the output on December 10. Those files correspond to the jobs that had the bai files deleted. Also, the picard process is a java program. I am not sure if that might be important or not.
Comment 10 Glenn 2012-12-19 11:33:26 EST
The stale link files that I am seeing are files that were touched after the other two files were processed. So, after the picard process finished, there was a "complete" file touched. For jobs where the bai file disappeared, those "complete" files show up in the stale link output.
Comment 11 Jeff Darcy 2012-12-19 11:34:46 EST
Having linkfiles is fairly normal, as a consequence of how GlusterFS handles brick addition/deletion.  Having too many could indicate a problem, though rarely one with any effects other than a minor impact on performance.  Also, they should get cleaned up as part of rebalancing, and that does bring to mind the observation that the problem "went into hiding" for a while after a rebalance would be done.  Linkfiles should always be zero length.  It might be worth checking whether you have any files with that mode and non-zero length, which might indicate some sort of a race between linkfile and regular-file creation.

As for Java, I doubt that it makes a difference but at this point I wouldn't rule it out.  It is *possible* that the JVM has some sort of odd closing/cleanup behavior that plays into this somehow.
Comment 12 Glenn 2012-12-19 11:42:57 EST
There is this on a client:

ls -l Users/xxxxx/XXXXXX/Timing.mvapich   
-rw-r--r-- 1 xxxxxxx xxxx 936 Dec 11 12:45 Users/xxxxxxx/XXXXXX/Timing.mvapich

and on a gluster server

ls -l /mnt/gluster-3/Users/xxxxxx/XXXXXX/Timing.mvapich
---------T 2 xxxxxxx xxxx 0 Dec 11 12:45 /mnt/gluster-3/Users/xxxxxxx/XXXXXX/Timing.mvapich

The above is actually a file for a different user.
Comment 13 Glenn 2012-12-19 11:46:08 EST
Should link files like above be present when no bricks were added or deleted? Or is that a consequence of the disconnects?
Comment 14 Jeff Darcy 2012-12-19 12:55:38 EST
The above looks like an example of a normal linkfile.  If you were to look at the trusted.dht.linkto attribute on the zero-length/weird-mode file, you'd see that its value identifies the brick (actually the client translator corresponding to the brick) where the non-zero-length copy lives.

Linkfiles can be created when there are no brick additions or deletions, but I would expect such cases to be rare.  It requires that a file be placed somewhere other than where hashing would put it, because that brick was down at some time.  For example, it should go on A but A's down so we put it on B.  Later we look for it on A and it's not there but we eventually find it on B so we add a linkfile on A pointing to B's copy.  When a system has many transient failures, more perverse combinations can occur, in the worst case leading to files that are created in two different places with different GFIDs and contents.  Untangling some of those messes can be difficult, which is why I still think some of that cleanup is involved here.  BTW, this is also why I think this problem has shown up in a pure-distribute environment.  If AFR is protecting DHT from seeing bricks as unavailable, even when they've become so due to high load, then we can't get into these kinds of complex situations.

Thus, the combination of high load (leading to unreliable connections) and the absence of AFR underneath cause us to reach a state that would be unachievable otherwise.  We just don't know what that state is, or what other factors (e.g. idiosyncratic behavior in the application's or library's I/O layer) led to it.
Comment 15 Glenn 2012-12-19 13:59:07 EST
(In reply to comment #14)
> The above looks like an example of a normal linkfile.  If you were to look
> at the trusted.dht.linkto attribute on the zero-length/weird-mode file,
> you'd see that its value identifies the brick (actually the client
> translator corresponding to the brick) where the non-zero-length copy lives.
> 
> Linkfiles can be created when there are no brick additions or deletions, but
> I would expect such cases to be rare.  It requires that a file be placed
> somewhere other than where hashing would put it, because that brick was down
> at some time.  For example, it should go on A but A's down so we put it on
> B.  Later we look for it on A and it's not there but we eventually find it
> on B so we add a linkfile on A pointing to B's copy.  When a system has many
> transient failures, more perverse combinations can occur, in the worst case
> leading to files that are created in two different places with different
> GFIDs and contents.  Untangling some of those messes can be difficult, which
> is why I still think some of that cleanup is involved here.  BTW, this is
> also why I think this problem has shown up in a pure-distribute environment.
> If AFR is protecting DHT from seeing bricks as unavailable, even when
> they've become so due to high load, then we can't get into these kinds of
> complex situations.
> 
> Thus, the combination of high load (leading to unreliable connections) and
> the absence of AFR underneath cause us to reach a state that would be
> unachievable otherwise.  We just don't know what that state is, or what
> other factors (e.g. idiosyncratic behavior in the application's or library's
> I/O layer) led to it.

The link files were all created at the same time, Dec 11 08:44, which seems like an indication of a problem.
Comment 16 Glenn 2012-12-26 11:36:02 EST
I came across the following while looking through mailing lists. This seems like it might be similar. Unfortunately, there do not seem to be any follow-ups to that query.

http://gluster.org/pipermail/gluster-users/2012-June/033534.html
Comment 17 Jeff Darcy 2013-01-03 09:10:32 EST
I don't think we'll be able to make any progress on this without knowing where the unlink calls are coming from.  Since we do know (thanks to your good detective work) which files are likely to be vulnerable, could we try doing something with permissions/ACLs/SELinux to force an error and get that information?
Comment 18 Glenn 2013-01-03 16:53:59 EST
Here is some more information as I have been trying to define the parameters.

I ran several sets of jobs to get a feel for what has been happening. I can
definitely reproduce this so that is good. I was suspicious of the server
load as being a factor so I ran several runs with different job quota levels to
limit the number of simultaneous jobs. It is easy to drive the server load
up pretty high with these jobs and I was seeing a lot of job restarts.
There were more restarts than could be explained by job eviction and it
turns out that a high load causes disconnects and IO timeouts and
subsequent job failures. The job failures then lead to job restarts.

As I scaled the load there was a correlation with the number of restarts
and with the number of missing files. In order to test this I set a low
quota (16) to keep the server load to a max of about 10. In this scenario,
the only restarts were from job evictions and there were no problems with
disconnects and IO timeouts. From that, of the 353 jobs, 348 of them
completed. I am not sure of the reasons for the 5 failures but those 5 did
not restart, they just stopped. It looks like they may have gotten evicted
but just never restarted for some reason. Interestingly, the bam and bai
files are present for those 5.

In the end there were 352 bam files and 344 bai files left on the file
system. The following jobs completed after restarts, meaning the files were
in place and then got deleted if deletion happened:
NDD-011: no bam or bai files
NDD- {1015,1019,10-227,1022,1024,10-270,1044,1368}: bam file but no bai
file
NDD-1372: both bam and bai files present

Some observations:
- this is not dependent on load but only restarts of jobs and rewriting
files. Although, a high server load will exasperate the problem by causing
more job restarts
- the bam files can also get deleted.
- the 8 missing bai files account for the difference in the number of
bam and bai files, showing that this only happens with restarts
- NDD-1372 would seem to be an exception. Digging deeper, that job was
the only one that restarted before the output files were written. Once
it restarted, it then started writing output and did not restart again. All
of the other restarted jobs restarted after output file writing had
begun.

I repeated the test using an NFS file system for the output. Again , the load was kept pretty low and the only restarts were from job evictions. With this set up there were no file deletions. This provides evidence that the problem does not stem from any mishandling of SIGTERM during job eviction but is unique to GlusterFS.
Comment 19 Glenn 2013-01-03 16:54:50 EST
(In reply to comment #17)
> I don't think we'll be able to make any progress on this without knowing
> where the unlink calls are coming from.  Since we do know (thanks to your
> good detective work) which files are likely to be vulnerable, could we try
> doing something with permissions/ACLs/SELinux to force an error and get that
> information?

Jeff,

Can you tell me what it is you want me to do to try to capture the information that you need?
Comment 20 Jeff Darcy 2013-01-04 13:34:59 EST
(In reply to comment #19)
> (In reply to comment #17)
> > I don't think we'll be able to make any progress on this without knowing
> > where the unlink calls are coming from.  Since we do know (thanks to your
> > good detective work) which files are likely to be vulnerable, could we try
> > doing something with permissions/ACLs/SELinux to force an error and get that
> > information?
> 
> Jeff,
> 
> Can you tell me what it is you want me to do to try to capture the
> information that you need?


The easiest thing to do would be to take one of the directories for a just-restarted job, while the files are still there, then change its ownership to some other user and its permissions to read-only.  Hopefully, when we subsequently attempt to delete it, we'll get EPERM and log that fact so we'll know which code path is involved.  If that still doesn't work, we might need to figure out a different way to intercept those deletions.
Comment 21 Glenn 2013-01-04 14:50:01 EST
Changing the permissions would cause the jobs to fail and I do not know if the subsequent rewriting is important. Since the files are always deleted after job completion though I can change permissions at the end of the job. I think that should fully mimic the case. I will get those started today.
Comment 22 Glenn 2013-01-04 15:14:25 EST
Jeff,

Would gluster be doing the deletion as root? How could I effectively block deletion if that is the case?
Comment 23 Jeff Darcy 2013-01-21 12:31:59 EST
Sadly, for reasons I don't feel like repeating since Chrome threw away my text when I tried to search the rest of the page, it looks like the accesses will be done as root.  That leaves SELinux as an option, if you have it enabled.  By setting the directory to a different security context than that used by glusterfsd, even root can be prevented from accessing those files.  Failing that, it would be pretty easy to write a translator which would selectively intercept unlink calls (bonus is that we could get stack traces).  Would you be able to deploy such a translator, if I were to create one?
Comment 24 Glenn 2013-01-21 12:36:31 EST
Jeff,

SELinux is not enabled on any of these systems. I could certainly deploy a translator. I assume that would require a restart of the volume.
Comment 25 Jeff Darcy 2013-01-21 12:55:01 EST
Since that would require actual development, no matter how little, I'll consult the Powers That Be and see what we can do.  Thank you for your (extreme) patience.
Comment 26 Glenn 2013-01-30 21:42:12 EST
Jeff,

I ran a set of computations on Gluster, with restarts, etc. and we do have deleted files. What I did was after a job was finished, I executed the following command on the subdirectory containing the output files:

chmod -rwx

Here is what I was able to cull from the log files.

grep -r '\] [WE] \[' /var/log/glusterfs/bricks/mnt-gluster-*.log

OSS-0-3
/var/log/glusterfs/bricks/mnt-gluster-5.log:[2013-01-29 21:09:21.793417] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-5/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)
/var/log/glusterfs/bricks/mnt-gluster-5.log:[2013-01-29 21:09:21.803874] E [posix.c:968:posix_mkdir] 0-krypton-posix: setting gfid on /mnt/gluster-5/Users/adeluca/cilia_glenn_redo/NDD-1388 failed

OSS-0-2
/var/log/glusterfs/bricks/mnt-gluster-4.log:[2013-01-29 21:09:22.196585] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-4/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)
/var/log/glusterfs/bricks/mnt-gluster-4.log:[2013-01-29 21:09:22.219090] E [posix.c:968:posix_mkdir] 0-krypton-posix: setting gfid on /mnt/gluster-4/Users/adeluca/cilia_glenn_redo/NDD-1388 failed
/var/log/glusterfs/bricks/mnt-gluster-6.log:[2013-01-29 21:09:21.989340] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-6/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)

OSS-0-1
/var/log/glusterfs/bricks/mnt-gluster-1.log:[2013-01-29 21:09:21.659640] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1392 -> /mnt/gluster-1/.glusterfs/63/43/63435f28-2ab0-41e3-bf1c-321c7e34cffd failed (File exists)
/var/log/glusterfs/bricks/mnt-gluster-1.log:[2013-01-29 21:09:22.136479] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-1/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)
/var/log/glusterfs/bricks/mnt-gluster-3.log:[2013-01-29 21:09:21.953582] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-3/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)

OSS-0-0
/var/log/glusterfs/bricks/mnt-gluster-0.log:[2013-01-29 21:09:22.017791] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-0/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)
/var/log/glusterfs/bricks/mnt-gluster-2.log:[2013-01-29 21:09:22.399179] W [posix-handle.c:529:posix_handle_soft] 0-krypton-posix: symlink ../../b7/bb/b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 -> /mnt/gluster-2/.glusterfs/18/65/18658117-6f6a-4da0-996e-c8da830d9738 failed (File exists)

It is interesting that NDD-1388 shows up frequently but that is not one that was deleted.
Comment 27 Glenn 2013-01-30 21:50:29 EST
A new development on this problem that may indicate it is more severe than previously thought. Another person reported missing files. His steps were:

1. mkdir /glusterscratch/bwa-temp
2. cp /files/from/somewhere /glusterscratch/bwa-temp
3. run computations only reading from /glusterscratch/bwa-temp
4. observe missing files in the newly created directory
Comment 28 Jeff Darcy 2013-02-01 14:09:08 EST
I don't know about the "new development" but the other result sheds some interesting light on what might be happening.  This is clearly a mkdir that's failing.  We're trying to create a symlink inside .glusterfs because we can't create hard links to directories (which is what we would do for regular files).  Symlinks don't care about the destination, so "File exists" refers to the symlink itself - i.e. we already have a symlink from .../b7bbbde8-f564-468a-aed9-687e855afa7d/NDD-1388 to somewhere.  Why?  Because the permissions change caused a previous rmdir/rename to fail - silently.  That's kind of bad because it doesn't immediately tell us where the call came from, but we might still be able to figure that out by eliminating the paths that *would* have generated messages.  Setting the GFID subsequently might have failed because the mkdir had (at least partially) failed.

I'll look through some of the mkdir/rename paths to see if this tells us anything useful.  Unfortunately, the fact that NDD-1388 wasn't being deleted previously suggests that this might be a red herring - a side effect of having changed the permissions, and quite possibly another bug, but not the same problem we had seen previously.  It would help to understand how files are being created and moved around especially during job restarts (e.g. is there a move-out/do-stuff/move-back kind of pattern at play?) but I realize that might not be possible.
Comment 29 Glenn 2013-02-01 14:30:34 EST
(In reply to comment #28)
> It would help to understand how files
> are being created and moved around especially during job restarts (e.g. is
> there a move-out/do-stuff/move-back kind of pattern at play?) but I realize
> that might not be possible.

There is no movement of the files. It goes like this

- job starts via SGE
- program writes to FILE
- job is killed with SIGTERM
- SGE finds another host for job
- job restarts (no checkpoint) on new host
- program writes to FILE
- program finishes, with complete output in FILE
- sometime later, FILE is gone
Comment 30 Jeff Darcy 2013-02-01 14:38:54 EST
(In reply to comment #29)
> (In reply to comment #28)
> > It would help to understand how files
> > are being created and moved around especially during job restarts (e.g. is
> > there a move-out/do-stuff/move-back kind of pattern at play?) but I realize
> > that might not be possible.
> 
> There is no movement of the files. It goes like this
> 
> - job starts via SGE
> - program writes to FILE
> - job is killed with SIGTERM
> - SGE finds another host for job
> - job restarts (no checkpoint) on new host
> - program writes to FILE
> - program finishes, with complete output in FILE
> - sometime later, FILE is gone

Thanks, Glenn.  So where is this mkdir coming from?  Is it in any way related to this job, or entirely separate?  If it's the latter, then it probably is a red herring.  :(
Comment 31 Glenn 2013-02-01 15:36:14 EST
There is a mkdir command that is part of the job, run before the actual compution. It is the directory where the output files for the job go. So, to modify the previous order:

- job starts via SGE
- output directory is created if it does not exist (shell scripting here)
- program writes to FILE in output directory
- job is killed with SIGTERM
- SGE finds another host for job
- job restarts (no checkpoint) on new host
- program writes to FILE in output directory
- program finishes, with complete output in FILE in output directory
- sometime later, FILE is gone

The NDD-1388 directory was created and has its output files.
Comment 32 Jeff Darcy 2013-02-04 09:50:33 EST
I think what we're seeing with NDD-1388 is a red herring.  Ironically, I can reproduce those symptoms with AFR, but not with your configuration.  I'm going to put together that interception translator that was mentioned earlier, so we can get a full stack trace for the deletions as they happen.  Are they just regular files within a single directory, or will we need to watch subdirectories etc. as well?
Comment 33 Glenn 2013-02-04 11:52:46 EST
(In reply to comment #32)
> I think what we're seeing with NDD-1388 is a red herring.  Ironically, I can
> reproduce those symptoms with AFR, but not with your configuration.  I'm
> going to put together that interception translator that was mentioned
> earlier, so we can get a full stack trace for the deletions as they happen. 
> Are they just regular files within a single directory, or will we need to
> watch subdirectories etc. as well?

They are just single files. I have run the jobs in two ways. One with all output files from the jobs in one subdirectory and one with the output files from each job in a separate subdirectory, with 2 files per directory. Both cases result in deleted files.
Comment 34 Anand Avati 2013-02-08 16:11:25 EST
Glenn:
 Can you confirm that the files are indeed missing from the bricks/backend? Or just failing to list/access on the mount point? You will have to look into every brick backend for the missing filename in the corresponding directory.

Also can you get the output of 'getfattr -d -m . -e hex /backend/dir/NDD-1338' from all bricks, for all directories which were supposed to hold the missing files?
Comment 35 Glenn 2013-02-12 20:18:28 EST
The missing files are not on the backend file systems.

Here is the output for NDD-1388

[root@OSS-0-0:~]# find /mnt/gluster-*/Users/adeluca/cilia_glenn_redo/ -type d -name NDD-1388 -exec getfattr -d -m . -e hex {} \;
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-0/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000007ffffff88ffffff6

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-2/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000008ffffff79ffffff5

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-4/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x0000000100000000000000000ffffffe

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-6/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000000fffffff1ffffffd


[root@OSS-0-1:~]# find /mnt/gluster-*/Users/adeluca/cilia_glenn_redo/ -type d -name NDD-1388 -exec getfattr -d -m . -e hex {} \;
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-1/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000001ffffffe2ffffffc

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-3/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000002ffffffd3ffffffb

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-5/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000003ffffffc4ffffffa

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-7/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000004ffffffb5ffffff9


[root@OSS-0-2:~]# find /mnt/gluster-*/Users/adeluca/cilia_glenn_redo/ -type d -name NDD-1388 -exec getfattr -d -m . -e hex {} \;
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-0/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000005ffffffa6ffffff8

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-2/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000006ffffff97ffffff7

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-4/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x00000001000000009ffffff6affffff4

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-6/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x0000000100000000affffff5bffffff3


[root@OSS-0-3:~]# find /mnt/gluster-*/Users/adeluca/cilia_glenn_redo/ -type d -name NDD-1388 -exec getfattr -d -m . -e hex {} \;
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-1/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x0000000100000000bffffff4cffffff2

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-3/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x0000000100000000cffffff3dffffff1

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-5/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x0000000100000000dffffff2effffff0

getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster-7/Users/adeluca/cilia_glenn_redo/NDD-1388
trusted.gfid=0x186581176f6a4da0996ec8da830d9738
trusted.glusterfs.dht=0x0000000100000000effffff1ffffffff
Comment 36 Jeff Darcy 2013-02-14 15:09:59 EST
Thanks, Glenn.  This looks like a perfectly valid and normal 16-node distribution, which I'll just repeat in a more compact form just in case we have a need for it later.

00000000  0ffffffe      OSS-0-0/gluster-4
0fffffff  1ffffffd      OSS-0-0/gluster-6
1ffffffe  2ffffffc      OSS-0-1/gluster-1
2ffffffd  3ffffffb      OSS-0-1/gluster-3
3ffffffc  4ffffffa      OSS-0-1/gluster-5
4ffffffb  5ffffff9      OSS-0-1/gluster-7
5ffffffa  6ffffff8      OSS-0-2/gluster-0
6ffffff9  7ffffff7      OSS-0-2/gluster-2
7ffffff8  8ffffff6      OSS-0-0/gluster-0
8ffffff7  9ffffff5      OSS-0-0/gluster-2
9ffffff6  affffff4      OSS-0-2/gluster-4
affffff5  bffffff3      OSS-0-2/gluster-6
bffffff4  cffffff2      OSS-0-3/gluster-1
cffffff3  dffffff1      OSS-0-3/gluster-3
dffffff2  effffff0      OSS-0-3/gluster-5
effffff1  ffffffff      OSS-0-3/gluster-7

No holes, no overlaps, everything exactly as it should be.

I have implemented a delete-logging translator, which you can see here where it's being reviewed.

  http://review.gluster.org/#change,4496

It's very minimal.  The delete-logging state is specific to a single directory, not inherited by all of its children, and it's only in memory so it doesn't survive reboots/remounts and has to be applied separately on each client (but of course you have tools to do exactly that sort of thing).  The steps to deploy it would be:

(1) Build from whichever version you're using plus the patch.

(2) Install the volfile filter on the servers.

(3) Install the new prot_*.so translators on the clients (existing components can be left alone) and remount to pick up the volfiles that refer to them.

(4) Use setfattr on all clients to set the (virtual) trusted.glusterfs.protect xattr on the directories where these deletions occur, whenever the necessary preconditions are met.

(5) When the deletions do occur, look in the client logs for the messages identifying the code paths involved.

With a little more time, I could wrap up the new bits into a separate RPM, but that would still have to be built against the proper version of glusterfs-devel*.rpm to assure compatibility.  Also, I have to reiterate that this is still highly experimental code.  For safety's sake, I'd recommend only running in this mode long enough to collect the needed information.  You can then disable it by removing (at least) the volfile filter on the servers and remounting on clients.
Comment 37 Glenn 2013-02-14 16:11:53 EST
Thanks Jeff. I think I will need a little more help with getting this set up. How do I get the patch? I get the following message:

fatal: Not a git repository (or any of the parent directories): .git

I am wondering if I should set this up on different hardware and a different file system. I would first have to verify that the problem occurs on other hardware.
Comment 38 Vijay Bellur 2013-02-17 15:05:02 EST
CHANGE: http://review.gluster.org/4496 (features: add a directory-protection translator) merged in master by Anand Avati (avati@redhat.com)
Comment 39 Jeff Darcy 2013-02-18 10:04:09 EST
Glenn, setting this up on a separate system first is probably an excellent idea in any case.  I was able to build on 3.3 (which seems to be what you're using) with the following commands.

(1) git clone git://git.gluster.com/glusterfs.git

(2) cd glusterfs

(3) git branch -t r3.3.0 origin/release-3.3

(4) git checkout r3.3.0

(5) git reset --hard v3.3.1

(6) git fetch http://review.gluster.org/p/glusterfs refs/changes/96/4496/2 && git format-patch -1 FETCH_HEAD

(7) patch -p1 < 0001-features-add-a-directory-protection-translator.patch 

(8) ./autogen.sh 

(9) ./configure --enable-fusermount

(10) make -j16

I think the key to the first step is using git: and .com instead of .org for the clone URI.  The patch command will also give you a warning about trace.c but you can ignore that because AFAICT the problem fixed by that chunk didn't exist in the 3.3 branch.  Lastly, I should note that I do have a regular development environment on this machine (autoconf/automake/libtool etc.) but the above steps don't rely on any special credentials etc.

The files you really need from the build are as follows (the rest should be the same as what you already have):

    export DESTDIR=/usr/lib64/glusterfs/$VERSION
    export XLATORDIR=$DESTDIR/xlator/features
    cp xlators/features/protect/src/.libs/prot_client.so $XLATORDIR/
    cp xlators/features/protect/src/.libs/prot_dht.so $XLATORDIR/
    # Next two only needed on servers.
    mkdir $DESTDIR/filter
    cp extras/prot_filter.py $DESTDIR/filter/
Comment 40 Glenn 2013-03-05 15:22:00 EST
Jeff,

Sorry for the delay but I have hit a stumbling block. I have set up another gluster volume on different hardware in preparation for testing your translator. The problem that I am having now is that I can not replicate the file deletion problem on the test hardware. I am still using the same gluster binaries, I just created a new volume. I wanted to be sure that I could replicate on the test hardware before setting up the version with the translator. 

The server is the same type as the production servers and the MSA is the same but there still a couple of differences. One, the test system uses SAS drives and the production system uses SATA drives. The SAS drives are much smaller but the RAID set up is the same. Secondly, I have 2 bricks on the test server but the production system has 4 bricks on a server. I will next try reconfiguring the arrays to be a set of 4 rather than a set of 2 to more closely match the production system. Given all of this does anything jump out at you that should be looked at? The MSAs are configured identically, the gluster volumes are configured the same way. The job process is the same on both. The production system exhibits the problem but the test system does not.
Comment 41 Glenn 2013-03-14 16:20:33 EDT
I did one more verification run on the production glusterscratch.

[root@helium-login-0-1:~]# ls /glusterscratch/Users/adeluca/cilia_glenn_redo/complete|wc -l
80
[root@helium-login-0-1:~]# find /glusterscratch/Users/adeluca/cilia_glenn_redo -name \*.bam|wc -l
76
[root@helium-login-0-1:~]# find /glusterscratch/Users/adeluca/cilia_glenn_redo -name \*.bai|wc -l
67

Those should all be 80.

A file listing is done at the end of the job and put into the stdout file. Picking one of the missing ones as an example.

tail -n 2 /Users/adeluca/cilia/merge/glenn_redo_logs_glusterscratch/gr_NDD-10-102.o143664 
-rw-r--r-- 1 adeluca clcg 3.2M Mar  8 22:52 /glusterscratch/Users/adeluca//cilia_glenn_redo//NDD-10-102/NDD-10-102.bai
-rw-r--r-- 1 adeluca clcg  27G Mar  8 22:52 /glusterscratch/Users/adeluca//cilia_glenn_redo//NDD-10-102/NDD-10-102.bam

Clearly, both files were present. But then the bai file was deleted.

ls -hl /glusterscratch/Users/adeluca/cilia_glenn_redo/NDD-10-102
total 27G
-rw-r--r-- 1 adeluca clcg 27G Mar  8 22:52 NDD-10-102.bam

And here is when the file was deleted.

ls -dhl /glusterscratch/Users/adeluca/cilia_glenn_redo/NDD-10-102
drwxr-xr-x 2 adeluca clcg 117 Mar  9 07:18 /glusterscratch/Users/adeluca/cilia_glenn_redo/NDD-10-102

There were 13 other files deleted from this run. They were all deleted at the same time, March 9 07:18.
Comment 42 Glenn 2013-03-26 16:11:32 EDT
(In reply to comment #36)
> (3) Install the new prot_*.so translators on the clients (existing
> components can be left alone) and remount to pick up the volfiles that refer
> to them.

Jeff,

How do I get the volfiles to refer to the translators? Can that be done through the gluster cli once they are put into place?
Comment 43 Glenn 2013-03-26 16:15:06 EDT
(In reply to comment #42)
> (In reply to comment #36)
> > (3) Install the new prot_*.so translators on the clients (existing
> > components can be left alone) and remount to pick up the volfiles that refer
> > to them.
> 
> Jeff,
> 
> How do I get the volfiles to refer to the translators? Can that be done
> through the gluster cli once they are put into place?

Or does the filter on the server force those to be loaded?
Comment 44 Glenn 2013-03-27 23:06:43 EDT
So this file deletion problem became significantly worse recently. Someone started up an rsync of a directory and within 24 hours, 7T had been deleted by gluster. We were actually not aware of this at the time. We then had a maintenance window and the gluster volume was stopped, clients unmounted, and then eventually restarted. The rsync process was fired up again and this is when the deletions really started happening. I observed the following behavior when looking at some files

- find /some/dir | wc -l
- run the above again and the file count is significantly less.

Files were being deleted just by accessing them. This would seem to point to something going wrong with the metadata hash calculations but that is just speculation on my part.
Comment 45 Glenn 2013-04-01 22:26:47 EDT
I have the translator in place and am running the jobs that have exhibited the problem. It is not clear to me how the translator gets picked up in the configuration. Is there a way that I can tell if it is active? I set the volume name in prot_filter.py to match what the volume name is here. Was there anything else that I need to do?
Comment 46 Glenn 2013-04-05 13:49:38 EDT
A couple of notes. I have not been able to replicate the problem again with the translator in place. It could partly because we have kicked everyone off of the system.

I finally did see file deletions on the test volume I set up. Unfortunately, it happened before I got the translator in place. There was no one else on that volume but since it was small I was able to ramp it up. However, the deletions occurred following a shutdown of the file system and then bringing it back up. The odd thing is that the deletions occurred about 22 hours after I brought the volume back up.

Finally, I am still not clear as to whether anything more needs to be done for the translator than what is in your instructions. I am not seeing how the translator gets picked up in the volumes. Could you comment on that?

Thanks.
Comment 47 Niels de Vos 2014-04-17 07:39:53 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report.

glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.