Bug 981456

Summary: RFE: Please create an "initial offline bulk load" tool for data, for GlusterFS
Product: [Community] GlusterFS Reporter: Justin Clift <jclift>
Component: coreAssignee: bugs <bugs>
Status: CLOSED EOL QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, gluster-bugs, jdarcy, kwade, vbhat
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-22 15:46:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Clift 2013-07-04 18:53:34 UTC
Description of problem:

  For new adopters of GlusterFS with a large existing data set, the
  initial time to load their data into Gluster can take days.

  We should be able to improve this significantly by creating a
  specialised "bulk data load" tool for Gluster.

  So far, people have been able to use rsync() to copy data to the
  individual bricks in order to achieve something similar.  But it
  doesn't work with striped nor distributed volumes, where each host only
  has one part of the total data.

  This tool should support all Gluster volume types, including both
  striped and distributed volumes, and set the extended attributes
  correctly as it goes.

  To support striped and distributed volumes, it should send the
  appropriate file data to each host, as the gluster* processes
  would expect to find it.

  The tool may need to run while glusterd and glusterfs* are offline, so
  no conflict occurs during operation.

  The thinking behind this RFE is from awareness of similar tools for
  SQL databases.  With a SQL database, if a person loads a large data set
  using the normal transaction processing (one transaction / commit per insert
  statement, all triggers fired each time), the data load can take ages.
  (also days)  So, most SQL databases have the ability to do bulk loading,
  which disables the transaction features (eg. one commit at start and end,
  triggers deferred until end of bulk loading).  Each SQL database project
  / vendor has their own way of doing it, but the high level principle is
  the same.  


Version-Release number of selected component (if applicable):

  Upstream git master, as of Thur 4th July 2013.


Actual results:

  Initial loading of data can take days.


Expected results:

  Initial loading of data should not be significantly longer than what
  an rsync() would achieve.


Additional info:

  We should save a significant amount of time this way, by cutting out the
  stat() calls (and similar) that would otherwise occur between hosts during
  normal Gluster operation.

Comment 1 Jeff Darcy 2013-07-05 13:23:35 UTC
The only way I could see this being done safely is offline, and the only way I could see it being done efficiently is by maximizing local I/O on the source host(s).  Message traffic is the thing that makes normal loading slow - not just stat etc. but extra operations around writes, lots of small messages to deal with lots of small files, and so on.  Here's a very brief sketch of what such a tool, running on a source host, would have to do for each file.

* Parse the volfile.

* For each file, do the basic DHT elastic-hashing calculation (*not* an actual DHT lookup which might generate multiple messages) to figure out where the file "should" go.

* Do the same to calculate locations through AFR and stripe.

* Add the relevant file contents, plus newly generated GFIDs, to a series of archive files (e.g. tar/cpio), one per brick.

* Ship the archive files in bulk to the bricks.

* On each brick, "execute" the archive by creating and writing the individual files, including creation of necessary xattrs other than GFID.

This is only the tip of the iceberg.  Many other issues need to be considered and dealt with, such as the need to ensure that other copies of a file do *not* exist on DHT subvolumes other than the one we're populating as part of the bulk load.  The stub/white-out files needed for this, plus the possibility of sparse files on striped volumes, probably means that none of the existing archive-file formats are sufficient and we'll need to create our own.  :(

My biggest concern is verifying the correctness of the result.  The bulk-load tool would have to be very closely locked to a specific version of the regular I/O code, because any tiny change to that I/O code could make the bulk-load result "incorrect" in subtle ways that could potentially lead to data loss.  The QA and support risks need to be very carefully considered, and more than ordinary efforts made to mitigate them, before we could even consider supporting such a bulk-load tool itself.

Comment 2 Justin Clift 2013-07-05 13:30:59 UTC
Good thoughts. :)

I wonder if adjusting the Gluster I/O code, so it can also be called by other utils, would help there.  (eg making that code into a shared library?)

Then, if we ship such a bulk load tool as part of every Gluster release, it would be automatically using the correct (matching) I/O code.

Comment 3 Niels de Vos 2014-11-27 14:45:16 UTC
Feature requests make most sense against the 'mainline' release, there is no ETA for an implementation and requests might get forgotten when filed against a particular version.

Comment 5 Kaleb KEITHLEY 2015-10-22 15:46:38 UTC
because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.