Red Hat Bugzilla – Bug 1025675
RFE: 'multisum' - read file once, compute multiple checksums
Last modified: 2013-11-01 12:30:26 EDT
In some cases we want to compute multiple checksums at once.
CPU is frequently cheaper than IO and this would allow us to read each file just once.
We're currently use cat | tee >(md5sum) >(sha256sum) etc., but that's not exactly user-friendly and I'd like to replace it with a better tool.
Here's my proposal for 'multisum' behaviour:
-b, --binary read in binary mode
-c, --check read checksums from the FILEs and check them
-t, --text read in text mode (default)
Note: There is no difference between binary and text mode option on GNU system.
-s, --checksum checksum type
Note: Checksum type can be specified multiple times, but at least once.
If specified when verifying checksums, only checksums for given
checksum types will be verified.
# display list of supported checksum types?
The following three options are useful only when verifying checksums:
# *three* options? I see 4 :)
--quiet don't print OK for each successfully verified file
--status don't output anything, status code shows success
-w, --warn warn about improperly formatted checksum lines
--strict with --check, exit non-zero for any invalid input
--help display this help and exit
--version output version information and exit
\?<checksum_type>:<checksum> [* ]<path>
records should be:
* grouped by path
* ordered by checksum type within one group
mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; md5sum -b test-data/*
7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b
Multisum output would look like:
mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; multisum -b -s sha256 -s md5 test-data/*
md5:7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b
sha256:01186fcf04b4b447f393e552964c08c7b419c1ad7a25c342a0b631b1967d3a27 *test-data/a b
You may also consider adding 'multisum' file format support to existing tools.
For instance md5sum -c being able to verify all records starting with 'md5:'
Note newer coreutils checksum utils already support the --tag option to identify
the checksum used, and we reused the BSD format here to gain extra compat.
$ tee >(md5sum --tag) >(sha1sum --tag) < /etc/passwd >/dev/null
SHA1 (-) = 30529b9c1622452b4488f229e7f8d36cc49579ba
MD5 (-) = 6d8d8033d929f93998c08a30c92a5b8d
Using tee like above does have disadvantages.
1. Redundant write to /dev/null (Note you can't really pipe to another chksum util as then the output from previous utils would go to that (the pipe is setup before the coprocesses)
2. This doesn't support multiple files well, since the file name isn't output,
and also it would be 1 process for file, which waste CPU
p.s. I just fixed the "s/three/four/" issue you mentioned in:
(In reply to Pádraig Brady from comment #1)
> 2. This doesn't support multiple files well, since the file name isn't output
If I want to use this approach to process more files, I need to compute checksums per file, call sed to inject the correct file names and merge results into one file.
Note if you were going to have a multisum util, then you would really need to be doing the reading in one process and the checksumming in other processes to take advantage of multicore.
Now you could get much of that implicitly with separate checksum utilities (processes) and file caching to avoid the multiple IO overhead.
But you would also have to be careful that you wouldn't have multiple processes
fighting over a disk head for example.
Now this sort of processing is not specific to the coreutils checksumming
utilities, and therefore probably best handled outside if possible.
To illustrate how you might split up file processing across CPUS
while taking advantage of cache, consider the following xargs command
which would be tuned for 2 CPUs (md5sum & sha1sum). The runs are batched in
groups of 10 so that later files don't evict yet to be processed from cache.
If you have more CPUs then you could add a -P2 or whatever to the xargs command.
Note also you may want to change the '&' to a ';' in the command below
if you had a mechanical disk rather than an SSD, to avoid multiple
processes fighting over a disk head.
$ seq 20 | xargs -n10 sh -c 'echo m5sum "$0" "$@" & echo sha1sum "$0" "$@"'
sha1sum 1 2 3 4 5 6 7 8 9 10
m5sum 1 2 3 4 5 6 7 8 9 10
sha1sum 11 12 13 14 15 16 17 18 19 20
m5sum 11 12 13 14 15 16 17 18 19 20
Note also the GNU parallel command for dividing up such workloads.
So currently I don't think a separate utility is required for this.
p.s. It's best first broaching questions like this upstream at
email@example.com, and I'll copy some of this discussion there for posterity.