Bug 1025675

Summary: RFE: 'multisum' - read file once, compute multiple checksums
Product: [Fedora] Fedora Reporter: Daniel Mach <dmach>
Component: coreutilsAssignee: Ondrej Vasik <ovasik>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: admiller, kdudka, kzak, ooprala, ovasik, pbrady, p, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-01 16:30:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Mach 2013-11-01 09:02:58 UTC
In some cases we want to compute multiple checksums at once.
CPU is frequently cheaper than IO and this would allow us to read each file just once.

We're currently use cat | tee >(md5sum) >(sha256sum) etc., but that's not exactly user-friendly and I'd like to replace it with a better tool.


Here's my proposal for 'multisum' behaviour:

Usage
-----
  -b, --binary         read in binary mode
  -c, --check          read checksums from the FILEs and check them
  -t, --text           read in text mode (default)
  Note: There is no difference between binary and text mode option on GNU system.
  -s, --checksum       checksum type
  Note: Checksum type can be specified multiple times, but at least once.
        If specified when verifying checksums, only checksums for given
        checksum types will be verified.
# display list of supported checksum types?

The following three options are useful only when verifying checksums:
# *three* options? I see 4 :)
      --quiet          don't print OK for each successfully verified file
      --status         don't output anything, status code shows success
  -w, --warn           warn about improperly formatted checksum lines
      --strict         with --check, exit non-zero for any invalid input
      --help     display this help and exit
      --version  output version information and exit


File format
-----------
\?<checksum_type>:<checksum> [* ]<path>

records should be:
* grouped by path
* ordered by checksum type within one group

Example md5sum:
mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; md5sum -b test-data/*
7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b
daa8075d6ac5ff8d0c6d4650adb4ef29 *test-data/ab
\537c3478b5faa724bc71ed7fc1ac0f60 *test-data/a\\b
\ae2af30d93d6dbd7f6f62c775c038c60 *test-data/a\\nb


Multisum output would look like:
mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; multisum -b -s sha256 -s md5 test-data/*
md5:7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b
sha256:01186fcf04b4b447f393e552964c08c7b419c1ad7a25c342a0b631b1967d3a27 *test-data/a b
md5:daa8075d6ac5ff8d0c6d4650adb4ef29 *test-data/ab
sha256:a63d8014dba891345b30174df2b2a57efbb65b4f9f09b98f245d1b3192277ece *test-data/ab
\md5:537c3478b5faa724bc71ed7fc1ac0f60 *test-data/a\\b
\sha256:eaba35b63f3a21c43bc4d579fa4ae0cd388ec8633c08e0a54859d07d33a0c487 *test-data/a\\b
\md5:ae2af30d93d6dbd7f6f62c775c038c60 *test-data/a\\nb
\sha256:d32170bf6e447af933ecabf6607a7e94b1fc35e01f71b618141aab849e00f45b *test-data/a\\nb


You may also consider adding 'multisum' file format support to existing tools.
For instance md5sum -c being able to verify all records starting with 'md5:'

Comment 1 Pádraig Brady 2013-11-01 12:01:19 UTC
Note newer coreutils checksum utils already support the --tag option to identify
the checksum used, and we reused the BSD format here to gain extra compat.

$ tee >(md5sum --tag) >(sha1sum --tag) < /etc/passwd >/dev/null
SHA1 (-) = 30529b9c1622452b4488f229e7f8d36cc49579ba
MD5 (-) = 6d8d8033d929f93998c08a30c92a5b8d

Using tee like above does have disadvantages.
1. Redundant write to /dev/null (Note you can't really pipe to another chksum util as then the output from previous utils would go to that (the pipe is setup before the coprocesses)
2. This doesn't support multiple files well, since the file name isn't output,
and also it would be 1 process for file, which waste CPU

p.s. I just fixed the "s/three/four/" issue you mentioned in:
http://lists.gnu.org/archive/html/coreutils/2013-11/msg00000.html

Comment 2 Daniel Mach 2013-11-01 12:07:56 UTC
(In reply to Pádraig Brady from comment #1)
> 2. This doesn't support multiple files well, since the file name isn't output
Exactly.
If I want to use this approach to process more files, I need to compute checksums per file, call sed to inject the correct file names and merge results into one file.

Comment 3 Pádraig Brady 2013-11-01 13:05:32 UTC
Note if you were going to have a multisum util, then you would really need to be doing the reading in one process and the checksumming in other processes to take advantage of multicore.

Now you could get much of that implicitly with separate checksum utilities (processes) and file caching to avoid the multiple IO overhead.

But you would also have to be careful that you wouldn't have multiple processes
fighting over a disk head for example.

Now this sort of processing is not specific to the coreutils checksumming
utilities, and therefore probably best handled outside if possible.

To illustrate how you might split up file processing across CPUS
while taking advantage of cache, consider the following xargs command
which would be tuned for 2 CPUs (md5sum & sha1sum). The runs are batched in
groups of 10 so that later files don't evict yet to be processed from cache.
If you have more CPUs then you could add a -P2 or whatever to the xargs command.
Note also you may want to change the '&' to a ';' in the command below
if you had a mechanical disk rather than an SSD, to avoid multiple
processes fighting over a disk head.

$ seq 20 | xargs -n10 sh -c 'echo m5sum "$0" "$@" & echo sha1sum "$0" "$@"'
sha1sum 1 2 3 4 5 6 7 8 9 10
m5sum 1 2 3 4 5 6 7 8 9 10
sha1sum 11 12 13 14 15 16 17 18 19 20
m5sum 11 12 13 14 15 16 17 18 19 20

Note also the GNU parallel command for dividing up such workloads.

So currently I don't think a separate utility is required for this.

p.s. It's best first broaching questions like this upstream at
coreutils, and I'll copy some of this discussion there for posterity.