In some cases we want to compute multiple checksums at once. CPU is frequently cheaper than IO and this would allow us to read each file just once. We're currently use cat | tee >(md5sum) >(sha256sum) etc., but that's not exactly user-friendly and I'd like to replace it with a better tool. Here's my proposal for 'multisum' behaviour: Usage ----- -b, --binary read in binary mode -c, --check read checksums from the FILEs and check them -t, --text read in text mode (default) Note: There is no difference between binary and text mode option on GNU system. -s, --checksum checksum type Note: Checksum type can be specified multiple times, but at least once. If specified when verifying checksums, only checksums for given checksum types will be verified. # display list of supported checksum types? The following three options are useful only when verifying checksums: # *three* options? I see 4 :) --quiet don't print OK for each successfully verified file --status don't output anything, status code shows success -w, --warn warn about improperly formatted checksum lines --strict with --check, exit non-zero for any invalid input --help display this help and exit --version output version information and exit File format ----------- \?<checksum_type>:<checksum> [* ]<path> records should be: * grouped by path * ordered by checksum type within one group Example md5sum: mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; md5sum -b test-data/* 7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b daa8075d6ac5ff8d0c6d4650adb4ef29 *test-data/ab \537c3478b5faa724bc71ed7fc1ac0f60 *test-data/a\\b \ae2af30d93d6dbd7f6f62c775c038c60 *test-data/a\\nb Multisum output would look like: mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; multisum -b -s sha256 -s md5 test-data/* md5:7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b sha256:01186fcf04b4b447f393e552964c08c7b419c1ad7a25c342a0b631b1967d3a27 *test-data/a b md5:daa8075d6ac5ff8d0c6d4650adb4ef29 *test-data/ab sha256:a63d8014dba891345b30174df2b2a57efbb65b4f9f09b98f245d1b3192277ece *test-data/ab \md5:537c3478b5faa724bc71ed7fc1ac0f60 *test-data/a\\b \sha256:eaba35b63f3a21c43bc4d579fa4ae0cd388ec8633c08e0a54859d07d33a0c487 *test-data/a\\b \md5:ae2af30d93d6dbd7f6f62c775c038c60 *test-data/a\\nb \sha256:d32170bf6e447af933ecabf6607a7e94b1fc35e01f71b618141aab849e00f45b *test-data/a\\nb You may also consider adding 'multisum' file format support to existing tools. For instance md5sum -c being able to verify all records starting with 'md5:'
Note newer coreutils checksum utils already support the --tag option to identify the checksum used, and we reused the BSD format here to gain extra compat. $ tee >(md5sum --tag) >(sha1sum --tag) < /etc/passwd >/dev/null SHA1 (-) = 30529b9c1622452b4488f229e7f8d36cc49579ba MD5 (-) = 6d8d8033d929f93998c08a30c92a5b8d Using tee like above does have disadvantages. 1. Redundant write to /dev/null (Note you can't really pipe to another chksum util as then the output from previous utils would go to that (the pipe is setup before the coprocesses) 2. This doesn't support multiple files well, since the file name isn't output, and also it would be 1 process for file, which waste CPU p.s. I just fixed the "s/three/four/" issue you mentioned in: http://lists.gnu.org/archive/html/coreutils/2013-11/msg00000.html
(In reply to Pádraig Brady from comment #1) > 2. This doesn't support multiple files well, since the file name isn't output Exactly. If I want to use this approach to process more files, I need to compute checksums per file, call sed to inject the correct file names and merge results into one file.
Note if you were going to have a multisum util, then you would really need to be doing the reading in one process and the checksumming in other processes to take advantage of multicore. Now you could get much of that implicitly with separate checksum utilities (processes) and file caching to avoid the multiple IO overhead. But you would also have to be careful that you wouldn't have multiple processes fighting over a disk head for example. Now this sort of processing is not specific to the coreutils checksumming utilities, and therefore probably best handled outside if possible. To illustrate how you might split up file processing across CPUS while taking advantage of cache, consider the following xargs command which would be tuned for 2 CPUs (md5sum & sha1sum). The runs are batched in groups of 10 so that later files don't evict yet to be processed from cache. If you have more CPUs then you could add a -P2 or whatever to the xargs command. Note also you may want to change the '&' to a ';' in the command below if you had a mechanical disk rather than an SSD, to avoid multiple processes fighting over a disk head. $ seq 20 | xargs -n10 sh -c 'echo m5sum "$0" "$@" & echo sha1sum "$0" "$@"' sha1sum 1 2 3 4 5 6 7 8 9 10 m5sum 1 2 3 4 5 6 7 8 9 10 sha1sum 11 12 13 14 15 16 17 18 19 20 m5sum 11 12 13 14 15 16 17 18 19 20 Note also the GNU parallel command for dividing up such workloads. So currently I don't think a separate utility is required for this. p.s. It's best first broaching questions like this upstream at coreutils, and I'll copy some of this discussion there for posterity.