Description of problem: The DistributedFileSystem.listStatus() method first analyzes the size of the directory and then manages the retrieval of the directory objects according to the amount of objects in the directory. Conversely, the same method in GlusterFileSystem retrieves all the directory objects in a single operation. This might not be a bug, but we should understand why the approach differs within HDFS and whether we might want to implement the same behavior within GlusterFileSystem as well.
Could you provide a link to the DistributedFileSystem you're referencing? Also, what's the bug or alternative behavior you're proposing?
I am referencing org.apache.hadoop.hdfs.DistributedFileSystem.listStatus() in Apache Hadoop 1.0.4. I am recommending that we understand why the implementation of this method differs within the DistributedFileSystem so that we can understand whether we need to incorporate the same approach for the GlusterFileSystem implementation of this method. I am not explicitly suggesting an alternative but we should at least understand why there is a difference. HDFS has been used at an extreme scale and there might be a non-obvious, but important reason for this difference that we should carefully consider.
I did some research, and queried the gluster mailing list if a partial directory listing through FUSE layer is possible. Anand Avati respond that there is partial directory listing for GLUSTER but its only available through the C layer. Java's FILE mechanism doesn't tie into the right gluster calls for a partial C listing. If this was essential, we could write a sep C- command to do the partial listing for now, then if we move to a pure native client list this as a feature.
I queried the o.a.h hdfs mailing list around this area, and i'm posting my question and the lists response below. TL;DR: partial directory listing so not to overwhelm hdfs' namenode nor introduce long hdfs locks & delays to other clients. My post: Could someone explain why the DistributedFileSystem's listStatus() method does a piecemeal assembly of a directory listing within the method? Is there a locking issue? What if an element is added to the the directory during the operation? What if elements are removed? It would make sense to me that the FileSystem class listStatus() method returned an Iterator allowing only partial fetching/chatter as needed. But I dont understand why you'd want to assemble a giant array of the listing chunk by chunk. Response: (Tod Lipcon)- The reasoning is that the NameNode locking is somewhat coarse grained. In older versions of Hadoop, before it worked this way, we found that listing large directories (eg with 100k+ files) could end up holding the namenode's lock for a quite long period of time and starve other clients. Additionally, I believe there is a second API that does the "on-demand" fetching of the next set of files from the listing as well, no? As for the consistency argument, you're correct that you may have a non-atomic view of the directory contents, but I can't think of any applications where this would be problematic. (Suresh Srinivas)- Additional reason, HDFS does not have limit on number of files in a directory. Some clusters had millions of files in a single directory. Listing such a directory resulted in very large responses, requiring large contiguous memory allocation in JVM (for the array) and unpredictable GC failures.
consider as 'fixed' since this bug is investigative and not a fixable issue. if we can demonstrate performance problems around large directories then we should re-evaluate the performance problem itself and the solution at that point may be to use partial directory listings or something similar.