The current FS connector has 2 problems with regard to binary files: a) extra properties are stored incorrectly b) computing the hash code for large files takes an lot of time.
Randall Hauch <rhauch> made a comment on jira MODE-2060 Corrected use of extra properties for "jcr:content" nodes in the FileSystemConnector, and added a test to verify the behavior.
Randall Hauch <rhauch> made a comment on jira MODE-2061 The file system connector now automatically tries to use 'openssl' to compute the SHA-1 hash of files. If it is available and it returns the same SHA-1 as streaming for a small test, then it is used for all files larger than the 'largeFileSize' property. Note that forking off a new 'openssl' process for small files is actually more expensive; thus the new 'largeFileSize' property that defaults to 50kB. The connector will upon initialization attempt to use 'openssl', and if it is not available it will log warning messages that streaming will be used for all SHA-1 computations. Note that 'openssl' must be on the path for the environment in which the ModeShape process is run.
Randall Hauch <rhauch> made a comment on jira MODE-2061 Merged into the 'master' branch. Leaving open until it can be cherry-picked into the '3.3.x-prod-ip6.0' branch.
Randall Hauch <rhauch> made a comment on jira MODE-2060 Merged into the 'master' branch. Leaving open until it can be cherry-picked into the '3.3.x-prod-ip6.0' branch.
Randall Hauch <rhauch> made a comment on jira MODE-2061 The file system connector now automatically tries to use 'openssl' to compute the SHA-1 hash of files. If it is available and it returns the same SHA-1 as streaming for a small test, then it is used for all files larger than the 'largeFileSize' property. Note that forking off a new 'openssl' process for small files is actually more expensive; thus the new 'largeFileSize' property that defaults to 50kB. The connector will upon initialization attempt to use 'openssl', and if it is not available it will log warning messages that streaming will be used for all SHA-1 computations. Note that 'openssl' must be on the path for the environment in which the ModeShape process is run. (This PR does not use any of [~yulgit1]'s code. The PR is simpler because it automatically tries to use 'openssl' for all "large" files and assumes that 'openssl' is on the path; the connector falls back to streaming if 'openssl' is not available. Therefore, the only configuration property exposed is the optional one that says how large the files should be before 'openssl' is used, and it defaults to 50kB.)
https://github.com/jboss-integration/modeshape/commit/960b21d79e68654496e315f122cfe676adac4443 and https://github.com/jboss-integration/modeshape/commit/9f812e808c4cf1d8b56fc4d26bc60571fdaa92f7
Eric James <eric.james> made a comment on jira MODE-2061 In the testing I did it was actually the smaller files that showed increased performance using openssl not the larger. For the 64GB size file streaming vs openssl were basically the same. Did you find differently? My findings: stats for retrieving jcr:content node of 1.3GB and 64GB file: default modeshape secureHash - 12 sec/16.6 minutes cached openssl - 3.5 sec/8.1 minutes cached modeshape secureHash - 6.2 sec/8.1 minutes mocked - .002 sec But even with the openssl alternative, whether faster or slower, this issue - that touching jcr:content nodes recalculates the hash every time (that takes on the order of minutes to an hour for large file) still remains. Ideally, being naive to modeshape, it seems hashes of a full file should NOT be the BinaryKey identifier but I don't know the fallout of changing this. In the Fedora Commons Akubra module for example https://wiki.duraspace.org/display/FEDORA34/Configuring+Low+Level+Storage, the key is an MD5 of an ID rather than the full file, an approach I was trying to implement similarly with the "mocked" implementation of creating the SHA-1 hash from the file path rather than full file, but Randall expressed reservations about the fallout from this when files move around. One idea I just realized is perhaps for large files (above a threshold) to allow the option of a "disinterested hash", which rather than a full file hash (which takes to long), or a path hash (that may change), to calculate the hash on the first arbitrary KBs of a file, or the first and last KBs of a file (is that possible?) with the assumption that the probability of this being the same in more than one file is little to none. I'd be interested in your opinion about this approach.
Randall Hauch <rhauch> made a comment on jira MODE-2061 {quote} In the testing I did it was actually the smaller files that showed increased performance using openssl not the larger. For the 64GB size file streaming vs openssl were basically the same. Did you find differently? {quote} I didn't do a lot of in-depth testing, but I noticed during some of our unit tests that computed the SHA-1 for lots of very small to small files (e.g., mostly .class files in the '{{target}}' directory) that the results were noticeably slower when we always used {{openssl}}. For a few larger files, {{openssl}} was actually faster. So the results are more inline with my expectation that streaming small files and computing the SHA-1 in our code is actually faster that forking off a separate process, whereas streaming (very) large files in Java should be more expensive than a native {{openssl}}, which has access to more operating system optimizations. (That doesn't mean that {{openssl}} _uses_ all the available optimizations on all platforms.) You'll notice that my solution was to always use {{openssl}} (if available) for all files larger than {{n}}, where {{n}} defaults to 50kB but is entirely configurable. This approach kept the configuration as simple as possible while still allowing someone control over {{n}}, including setting {{n}} to a very large value if {{openssl}} is never to be used or a very small value if {{openssl}} should always be used. *Can you test the code in 'master' in similar ways to what you did before, and see if this change fixes your problem? If you can do this and let us know ASAP, because we plan to release 3.6 in the middle of next week (roughly October 16). Currently, there is nothing holding up the release.* {quote} Ideally, being naive to modeshape, it seems hashes of a full file should NOT be the BinaryKey identifier but I don't know the fallout of changing this. {quote} The whole point of the binary key is to have a unique identifier that is representative of the content but independent of location (and thus moves). Note that most of the requirements come from how ModeShape manages/stores binary values in its binary store, where a content-based key allows us to store only one copy of any given binary value. Now, the FileSystemConnector (or rather a subclass of it) may actually be able to use a different binary key, since the connector actually [stores the URI of the file inside the BinaryValue|https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/connector/filesystem/FileSystemConnector.java#L385]. We've not really investigated that, though it could very well be that this causes problems when copying or moving external content into internal content. Feel free to subclass the connector and try different options.
Randall Hauch <rhauch> made a comment on jira MODE-2061 {quote} In the testing I did it was actually the smaller files that showed increased performance using openssl not the larger. For the 64GB size file streaming vs openssl were basically the same. Did you find differently? {quote} I didn't do a lot of in-depth testing, but I noticed during some of our unit tests that computed the SHA-1 for lots of very small to small files (e.g., mostly .class files in the '{{target}}' directory) that the results were noticeably slower when we always used {{openssl}}. For a few larger files, {{openssl}} was actually faster. So the results are more inline with my expectation that streaming small files and computing the SHA-1 in our code is actually faster that forking off a separate process, whereas streaming (very) large files in Java should be more expensive than a native {{openssl}}, which has access to more operating system optimizations. (That doesn't mean that {{openssl}} _uses_ all the available optimizations on all platforms.) You'll notice that my solution was to always use {{openssl}} (if available) for all files larger than {{n}}, where {{n}} defaults to 50kB but is entirely configurable. This approach kept the configuration as simple as possible while still allowing someone control over {{n}}, including setting {{n}} to a very large value if {{openssl}} is never to be used or a very small value if {{openssl}} should always be used. *Can you test the code in 'master' in similar ways to what you did before, and see if this change fixes your problem? If you can do this and let us know ASAP, because we plan to release 3.6 in the middle of next week (roughly October 16). Currently, there is nothing holding up the release.* {quote} Ideally, being naive to modeshape, it seems hashes of a full file should NOT be the BinaryKey identifier but I don't know the fallout of changing this. {quote} The whole point of the binary key is to have a unique identifier that is representative of the content but independent of location (and thus moves). Note that most of the requirements come from how ModeShape manages/stores binary values in its binary store, where a content-based key allows us to store only one copy of any given binary value. Now, the FileSystemConnector (or rather a subclass of it) may actually be able to use a different binary key, since the connector actually [stores the URI of the file inside the BinaryValue|https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/connector/filesystem/FileSystemConnector.java#L385]. We've not really investigated that, though it could very well be that this causes problems when copying or moving external content into internal content. Feel free to subclass the connector and try different options. Subclassing would also be a way to provide caching of binary keys. IMO, caching is not worth the complexity, especially when we can't use an _unbounded_ cache.
Eric James <eric.james> made a comment on jira MODE-2061 Reconfirmed that openssl is counterintuitivily faster for smaller rather than large files: openssl 1.3GB = (6.982+14.658+6.466)/3 = 9.369 sec 64GB = (15.52+15.40+15.28)/3 = 15.4 min securehash 1.3GB = (11.812+14.698+15.684)/3 = 14.065 sec 64GB = (15.30+15.23+15.22)/3 = 15.25 min Working on an option to use a hashing mechanism that reads only a part of large files. Probably won't be ready for midweek release, but will submit when I can.
Eric James <eric.james> made a comment on jira MODE-2061 ... but small here is 1.3GB, maybe it's a bell curve, in anycase doesn't improve very large files.
Randall Hauch <rhauch> made a comment on jira MODE-2061 {quote} openssl 1.3GB = (6.982+14.658+6.466)/3 = 9.369 sec 64GB = (15.52+15.40+15.28)/3 = 15.4 min securehash 1.3GB = (11.812+14.698+15.684)/3 = 14.065 sec 64GB = (15.30+15.23+15.22)/3 = 15.25 min {quote} Note that for the 64GB case, the securehash (internal) approach is within 1% of the openssl, plus I'm sure there's some variability in those numbers. I'd really be surprised if openssl turns out to be statistically slower than computing the SHA-1 in the Java process. There might also be a difference in openssl performance on different file systems. Plus, at least on OS X, openssl appears to be caching results: if I use it on some of the largest files I have (500+MB), the first time is about 10 sec while subsequent times is repeatedly <2 seconds. Also, please note that the [default value of the new "`largeFileSize`" property|https://github.com/ModeShape/modeshape/blob/e7ea3fe62b70cf60b73d7ef4ffdd79c52cdb4e0c/modeshape-jcr/src/main/java/org/modeshape/connector/filesystem/FileSystemConnector.java#L179] is just _**50kB**_. Computing the SHA-1 of files this small using openssl is slower than doing it in-process by streaming the file through our internal algorithm because forking the new openssl process has some overhead.
Randall Hauch <rhauch> updated the status of jira MODE-2061 to Reopened
Randall Hauch <rhauch> made a comment on jira MODE-2061 I'm reopening this. After quite a bit of discussion with [~hchiorean], we concluded that because the {{FileSystemConnector}} uses a {{UrlBinaryValue}} with the URL of the underlying file, the {{UrlBinaryValue}} actually does not need the SHA-1-based binary key at all. This means that the {{FileSystemConnector}} doesn't actually have to compute the SHA-1 at all, that the recent commit to use openssl can be reverted, and that the {{UrlBinaryValue}} constructor can be changed to not even take a binary key. Thus, we can improve the performance dramatically without any complication of using openssl or [other proposed mechanisms for computing the SHA-1|https://github.com/ModeShape/modeshape/pull/975].
Moving back to ASSIGNED, as this still needs a commit.
Randall Hauch <rhauch> made a comment on jira MODE-2061 I'm reopening this. After quite a bit of discussion with [~hchiorean], we concluded that because the {{FileSystemConnector}} uses a {{UrlBinaryValue}} with the URL of the underlying file, the {{UrlBinaryValue}} actually does not need the SHA-1-based binary key at all. This means that the {{FileSystemConnector}} doesn't actually have to compute the SHA-1 at all, that the recent commit to use openssl can be reverted, and that the {{UrlBinaryValue}} constructor can be changed to not even take a binary key. Thus, we can improve the performance dramatically without any complication of using openssl or [other proposed mechanisms for computing the SHA-1|https://github.com/ModeShape/modeshape/pull/975]. I will create a new pull-request that reverts the [first pull-request|https://github.com/ModeShape/modeshape/pull/967] and that changes the {{UrlBinaryValue}} to no longer need the SHA-1.
Randall Hauch <rhauch> made a comment on jira MODE-2061 See MODE-2073 for a bug in the logic of copying external nodes (with {{ExternalBinaryValue}} property values) into internal nodes. This was discovered when [~hchiorean] and I were trying to determine if the {{FileSystemConnector}} really needed to compute binary keys based on the SHA-1 (we concluded it does not need to; see previous comment).
Randall Hauch <rhauch> made a comment on jira MODE-2061 Reverted the previous change to use openssl when available, and instead changed the FileSystemConnector to generate `UrlBinaryValue` instances that don't require computing a SHA-1 based binary key. This will dramatically improve the response time for obtaining nodes with binary values.
Randall Hauch <rhauch> made a comment on jira MODE-2061 Merged into the 'master' branch. Leaving this issue open until it is cherry-picked into the '3.3.x-prod-ip6.0' branch.
https://github.com/jboss-integration/modeshape/commit/823898041cb7b26f4e7405e725d018eb6ee6ffd3 https://github.com/jboss-integration/modeshape/commit/53ae071a3c75d1221b2022f52d5ca6d5f14042fd
Eric James <eric.james> made a comment on jira MODE-2061 With these changes the parameter "key" of org.modeshape.jcr.value.binary.AbstractBinary is now the node URI rather than the hash. This breaks the functionality of the getHash() and getHexHash() methods which are now returning this URI. There maybe several options, perhaps one may be to create a "hash" variable in URLBinaryValue that is looked up, and if not found generated when calling the getHash() and getHexHash()?
Horia Chiorean <hchiorea> made a comment on jira MODE-2061 {quote} This breaks the functionality of the getHash() and getHexHash() methods which are now returning this URI {quote} What exactly does "break" mean here ? In the case of {{UrlBinaryValue}} it may be true that "a true SHA1" isn't returned, but from looking at the code the main purpose of the {{getHash}} methods is to aid in binary values comparison (i.e. value1.equals(value2)). This should still hold when the string version of the URL is used. Yes, the Javadocs for the above methods aren't really correct in this case, but I'm not sure I see the functional problem.
Randall Hauch <rhauch> made a comment on jira MODE-2061 The {{UrlBinaryKey}}'s {{getHash()}} and {{getHexHash()}} methods were returning values that were the not same size and non characteristic of SHA-1 values. This modification changes these methods to return the SHA-1 of the string form of the URL. This is consistent with the requirements of the methods signatures, and with the equals/compare behavior of how these methods are used within ModeShape: if the hash is the same then it is expected that the binary value is the same.
Randall Hauch <rhauch> made a comment on jira MODE-2061 The values returned from these methods should be consistent with the expectations/requirements in the JavaDoc, and currently they are not. (The string form of the URL is returned as the {{getHextHash()}}, and that's definitely not correct.) However, I think [~hchiorean] is correct in that they SHA-1 does not really need to match the SHA-1 of the content, but rather simply needs to be a valid SHA-1 that has the same equals/compare characteristics of the SHA-1 of the content. The following changes would have the correct behavior, although two different URLs with the same content would be have different SHA-1s. That's not ideal and isn't optimal, but it is very fast (regardless of binary size) and doesn't violate any of the expected behavior. The {{UrlBinaryKey}}'s {{getHash()}} and {{getHexHash()}} methods were returning values that were the not same size and non characteristic of SHA-1 values. This modification changes these methods to return the SHA-1 of the string form of the URL. This is consistent with the requirements of the methods signatures, and with the equals/compare behavior of how these methods are used within ModeShape: if the hash is the same then it is expected that the binary value is the same.
Randall Hauch <rhauch> made a comment on jira MODE-2061 Okay, we have another proposal instead of the option in my previous comment: We would change the {{FileSystemConnector}} to have a specific method for computing the SHA-1 of a file. By default, this method would compute the SHA-1 of the URL (so in that sense it would be similar to the previous proposal, and lightning fast). However, if that is not sufficient or desirable, you can always create a subclass of {{FileSystemConnector}} and compute/determine the SHA-1 of a given file. The subclass can always use caching, OpenSSL or any other technology.
Horia Chiorean <hchiorea> made a comment on jira MODE-2061 Added an additional pull request based on [~rhauch]'s last comment.
Eric James <eric.james> made a comment on jira MODE-2061 Horia was asking about what I meant by the hash "breaking". In fedora commons there is a fixity service . The fixity functionality is using the getHash() method of the BinaryValue class instance as a SHA1 check, and was finding a file path URI instead. You could do things differently in fedora to accommodate this, but it seems having a checksum as a property in the BinaryValue is worth having which is why I asked about it here on modeshape. But I'm still uncertain about the purpose of all this. I thought the rationale for using the SHA-1 as an identifier in the first place was so 2 or more equivalent files living in different places could be identified singularly. Would making the hash based on path still allow for this? It seems to me a URLBinaryValue class should have both a unique identifier AND a unique representation of its content (or at least the option of having a unique representation of its content). I think it pretty safe to say the unique representation should be a SHA-1 of its contents, and there's not really a way around this even for large files (although that bookend end method I proposing may be a shortcut). As far as a unique identifier I guess what is needed is something consistent that is cheap to compute. With small files it didn't matter, but with large files its probably better to treat the full SHA1 separately from identification.
Horia Chiorean <hchiorea> made a comment on jira MODE-2061 Thanks for the update Eric. How is the SHA1 used by the fixity service ? Is it used to make assumptions about the content (bytes) of the file itself ?
Eric James <eric.james> made a comment on jira MODE-2061 Binary binary = (Binary) contentNode.getProperty(JCR_DATA).getBinary(); String dsChecksum = binary.getHexHash(); return ContentDigest.asURI("SHA-1",dsChecksum); The above code is used in the fixity fixitiy service to get the sha-1 from a specific fedora datastream. What I'm going to try to do over the next days is create a FileSystemConnector option, The default would be as before (i.e. a SHA1 of the file = key) good for small files. The option for large files would be to create a SHA1 key off of the URL or portion of the file for speed, but still have the URLBinaryValue keep the file path so a full hash value separate from the key can be lazily generated and kept upon a call to getHash and getHexHash.
Horia Chiorean <hchiorea> made a comment on jira MODE-2061 [~yulgit1] if you're dealing with binary consistency wouldn't it be easier to compute & compare CRC values ? (e.g. CRC-32). The CRC algorithms are a lot faster than the crypto (SHA1, MD5 etc) ones. They are also a lot more collision-prone, but as long as you're interested in consistency (as opposed uniqueness) they should do the trick.
Randall Hauch <rhauch> made a comment on jira MODE-2061 Merged in Horia's changes into the 'master' branch. Leaving open until it can be cherry-picked into the '3.3.x-prod-ip6.0' branch. Also updated the [documentation|https://docs.jboss.org/author/display/MODE/File+system+connector] to describe the new "{{contentBasedSha1}}" field of the {{FileSystemConnector}}: {quote} *_(Added in 3.6.0.Final)_* Optional _advanced_ boolean property that controls whether the binary value's hash values are SHA-1s based upon the file contents. This property is "{{true}}" by default, and therefore has exactly the same behavior as all other binary values within the repository. The connector has to compute the SHA-1 every time a binary value is returned (including every "{{jcr:data}}" property on the "{{jcr:content}}" children of "{{nt:file}}" nodes. If the underlying files are changed by processes other than ModeShape, the computed SHA-1 may not accurately represent the changed file contents, though the time ModeShape caches the SHA-1 in the binary value is controlled as part of the connector's {{cacheTtlSeconds}} property. Also, computing the SHA-1 can be quite expensive and time consuming for very large files and may thus introduce a lengthy and noticeable lag when returning a "{{jcr:content}}" node until the SHA-1 is computed. If you are using very large files, consider setting this "{{contentBasedSha1}}" property to "{{false}}" so that the connector computes the SHA-1 based upon the URL to the file on the file system. Such a SHA-1 can be computed very quickly, eliminating the lag for very large files mentioned above. ModeShape still uses these SHA-1s internally in a consistent fashion (two SHA-1s will be the same only when they are for the same file), but the {{BinaryValue.getHash()}} and {{BinaryValue.getHexHash()}} methods will return this non-content-based SHA-1 value. _(If you are dealing with very large binary values and are not satisfied with the speed of ModeShape dynamically computing the SHA-1, you can subclass the {{FileSystemConnector}} and override the {{sha1(File)}} method to compute/cache/lookup the SHA-1 for a given file using your preferred mechanism.)_ {quote}
Randall Hauch <rhauch> made a comment on jira MODE-2061 [~yulgit1], as noted above the fix defaults the {{contentBasedSha1}} property to "{{true}}", meaning that the {{FileSystemConnector}} by default will indeed compute the correct SHA-1 based upon the content of the file. This is backward compatible with prior versions, is the most consistent with the rest of the way {{BinaryValue.getHash()}} and {{BinaryValue.getHexHash())}} is used throughout ModeShape, and will perform adequately for small to medium sized files. For situations like yours where you are dealing with very large files, you should at a minimum set the {{contentBasedSha1}} option to "{{false}}" so that the binary value's hash will be a SHA-1 of the URL. This will perform very nicely, but has the tradeoff that the SHA-1 is no longer based upon the content. If you require content-based SHA-1s, you have several options: # Subclass the {{FileSystemConnector}} and override the {{sha1(...)}} method to use {{openssl}} and/or caching the SHA-1, or if possible use another faster technique to lookup/compute the SHA-1. # Subclass the {{FileSystemConnector}} and override the {{binaryFor(File file)}} method to use a completely different {{ExternalBinaryValue}} subclass, such as one that will compute the correct SHA-1 lazily only when {{getHash()}} or {{getHexHash()}} are called.
https://github.com/ModeShape/modeshape/commit/53cffe6832093dcee005e6031a42d17f8ebbe23e
The previous link points to the wrong commit. The correct one is: https://github.com/jboss-integration/modeshape/commit/5ff5077d9df8418da78a2bd418863a4f1ed6a9c2
Eric James <eric.james> made a comment on jira MODE-2061 https://github.com/ModeShape/modeshape/pull/988 https://github.com/yulgit1/modeshape/commit/129f5213d1223beb5c8988476ab2b3e25c1b0dbb https://github.com/yulgit1/modeshape/commit/80732f26b030de174e787adea3820d3d74a994f5 https://github.com/yulgit1/modeshape/commit/61e557787cf176cc593460b98c45d871b8f26c2b https://github.com/yulgit1/modeshape/commit/de6afb7e30d7449746f297282cb68a78c824b71e https://github.com/yulgit1/modeshape/commit/56673459da3b0eec45b511561ce3a5883d488e14 https://github.com/yulgit1/modeshape/commit/468cbea59a3f3fa5d250d9e8af867690e3df9b28 https://github.com/yulgit1/modeshape/commit/56092bddd99bef16f6c817de0992b078a195b04d https://github.com/yulgit1/modeshape/commit/7f1309c573b22c425031dcdab3c1b5cab0625ee1 https://github.com/yulgit1/modeshape/commit/36534d0648451ae12c1c1a9189b8dfcad8fec182 Pardon clumsy commit attempts in this PR. The first 5 commits were reverted. The 468 one is the major change. The 560 involved conflict resolution, the last 2 were small typo fixes. If I should be cleaning or squashing please let me know.
Randall Hauch <rhauch> made a comment on jira MODE-2061 [~yulgit1], this issue has already been closed as part of the 3.6.0.Final release, since we think the specific problem described by this issue has been fixed. Please log a new issue for the addition of the {{FileSystemConnector}} subclass you are proposing in the [pull-request|https://github.com/ModeShape/modeshape/pull/988], and we'll be happy to merge it. Also, both [~hchiorean] and I have already commented on the PR, including several requests for changes to the code (including squashing the commits). Thanks!
Randall Hauch <rhauch> updated the status of jira MODE-2061 to Closed
Randall Hauch <rhauch> updated the status of jira MODE-2060 to Closed
Assuming this does not need to be reported in Release Notes - setting to requires_doc_text-
No, this is relatively minor and does not need to be reported in the release notes.