Bug 1025675
Summary: | RFE: 'multisum' - read file once, compute multiple checksums | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Daniel Mach <dmach> |
Component: | coreutils | Assignee: | Ondrej Vasik <ovasik> |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | admiller, kdudka, kzak, ooprala, ovasik, pbrady, p, twaugh |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-01 16:30:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Daniel Mach
2013-11-01 09:02:58 UTC
Note newer coreutils checksum utils already support the --tag option to identify the checksum used, and we reused the BSD format here to gain extra compat. $ tee >(md5sum --tag) >(sha1sum --tag) < /etc/passwd >/dev/null SHA1 (-) = 30529b9c1622452b4488f229e7f8d36cc49579ba MD5 (-) = 6d8d8033d929f93998c08a30c92a5b8d Using tee like above does have disadvantages. 1. Redundant write to /dev/null (Note you can't really pipe to another chksum util as then the output from previous utils would go to that (the pipe is setup before the coprocesses) 2. This doesn't support multiple files well, since the file name isn't output, and also it would be 1 process for file, which waste CPU p.s. I just fixed the "s/three/four/" issue you mentioned in: http://lists.gnu.org/archive/html/coreutils/2013-11/msg00000.html (In reply to Pádraig Brady from comment #1) > 2. This doesn't support multiple files well, since the file name isn't output Exactly. If I want to use this approach to process more files, I need to compute checksums per file, call sed to inject the correct file names and merge results into one file. Note if you were going to have a multisum util, then you would really need to be doing the reading in one process and the checksumming in other processes to take advantage of multicore. Now you could get much of that implicitly with separate checksum utilities (processes) and file caching to avoid the multiple IO overhead. But you would also have to be careful that you wouldn't have multiple processes fighting over a disk head for example. Now this sort of processing is not specific to the coreutils checksumming utilities, and therefore probably best handled outside if possible. To illustrate how you might split up file processing across CPUS while taking advantage of cache, consider the following xargs command which would be tuned for 2 CPUs (md5sum & sha1sum). The runs are batched in groups of 10 so that later files don't evict yet to be processed from cache. If you have more CPUs then you could add a -P2 or whatever to the xargs command. Note also you may want to change the '&' to a ';' in the command below if you had a mechanical disk rather than an SSD, to avoid multiple processes fighting over a disk head. $ seq 20 | xargs -n10 sh -c 'echo m5sum "$0" "$@" & echo sha1sum "$0" "$@"' sha1sum 1 2 3 4 5 6 7 8 9 10 m5sum 1 2 3 4 5 6 7 8 9 10 sha1sum 11 12 13 14 15 16 17 18 19 20 m5sum 11 12 13 14 15 16 17 18 19 20 Note also the GNU parallel command for dividing up such workloads. So currently I don't think a separate utility is required for this. p.s. It's best first broaching questions like this upstream at coreutils, and I'll copy some of this discussion there for posterity. |