Bug 981456
Summary: | RFE: Please create an "initial offline bulk load" tool for data, for GlusterFS | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Justin Clift <jclift> |
Component: | core | Assignee: | bugs <bugs> |
Status: | CLOSED EOL | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | mainline | CC: | bugs, gluster-bugs, jdarcy, kwade, vbhat |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-10-22 15:46:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Justin Clift
2013-07-04 18:53:34 UTC
The only way I could see this being done safely is offline, and the only way I could see it being done efficiently is by maximizing local I/O on the source host(s). Message traffic is the thing that makes normal loading slow - not just stat etc. but extra operations around writes, lots of small messages to deal with lots of small files, and so on. Here's a very brief sketch of what such a tool, running on a source host, would have to do for each file. * Parse the volfile. * For each file, do the basic DHT elastic-hashing calculation (*not* an actual DHT lookup which might generate multiple messages) to figure out where the file "should" go. * Do the same to calculate locations through AFR and stripe. * Add the relevant file contents, plus newly generated GFIDs, to a series of archive files (e.g. tar/cpio), one per brick. * Ship the archive files in bulk to the bricks. * On each brick, "execute" the archive by creating and writing the individual files, including creation of necessary xattrs other than GFID. This is only the tip of the iceberg. Many other issues need to be considered and dealt with, such as the need to ensure that other copies of a file do *not* exist on DHT subvolumes other than the one we're populating as part of the bulk load. The stub/white-out files needed for this, plus the possibility of sparse files on striped volumes, probably means that none of the existing archive-file formats are sufficient and we'll need to create our own. :( My biggest concern is verifying the correctness of the result. The bulk-load tool would have to be very closely locked to a specific version of the regular I/O code, because any tiny change to that I/O code could make the bulk-load result "incorrect" in subtle ways that could potentially lead to data loss. The QA and support risks need to be very carefully considered, and more than ordinary efforts made to mitigate them, before we could even consider supporting such a bulk-load tool itself. Good thoughts. :) I wonder if adjusting the Gluster I/O code, so it can also be called by other utils, would help there. (eg making that code into a shared library?) Then, if we ship such a bulk load tool as part of every Gluster release, it would be automatically using the correct (matching) I/O code. Feature requests make most sense against the 'mainline' release, there is no ETA for an implementation and requests might get forgotten when filed against a particular version. because of the large number of bugs filed against mainline version\ is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it. |