Bug 1025675 - RFE: 'multisum' - read file once, compute multiple checksums
Summary: RFE: 'multisum' - read file once, compute multiple checksums
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: coreutils
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Ondrej Vasik
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-01 09:02 UTC by Daniel Mach
Modified: 2013-11-01 16:30 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-01 16:30:26 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Daniel Mach 2013-11-01 09:02:58 UTC
In some cases we want to compute multiple checksums at once.
CPU is frequently cheaper than IO and this would allow us to read each file just once.

We're currently use cat | tee >(md5sum) >(sha256sum) etc., but that's not exactly user-friendly and I'd like to replace it with a better tool.


Here's my proposal for 'multisum' behaviour:

Usage
-----
  -b, --binary         read in binary mode
  -c, --check          read checksums from the FILEs and check them
  -t, --text           read in text mode (default)
  Note: There is no difference between binary and text mode option on GNU system.
  -s, --checksum       checksum type
  Note: Checksum type can be specified multiple times, but at least once.
        If specified when verifying checksums, only checksums for given
        checksum types will be verified.
# display list of supported checksum types?

The following three options are useful only when verifying checksums:
# *three* options? I see 4 :)
      --quiet          don't print OK for each successfully verified file
      --status         don't output anything, status code shows success
  -w, --warn           warn about improperly formatted checksum lines
      --strict         with --check, exit non-zero for any invalid input
      --help     display this help and exit
      --version  output version information and exit


File format
-----------
\?<checksum_type>:<checksum> [* ]<path>

records should be:
* grouped by path
* ordered by checksum type within one group

Example md5sum:
mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; md5sum -b test-data/*
7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b
daa8075d6ac5ff8d0c6d4650adb4ef29 *test-data/ab
\537c3478b5faa724bc71ed7fc1ac0f60 *test-data/a\\b
\ae2af30d93d6dbd7f6f62c775c038c60 *test-data/a\\nb


Multisum output would look like:
mkdir -p test-data; for i in "ab" "a b" 'a\b' "a\nb"; do echo "$i" > "test-data/$i"; done; multisum -b -s sha256 -s md5 test-data/*
md5:7557d2f3a6ad1a3a8ebd23a94ab0c642 *test-data/a b
sha256:01186fcf04b4b447f393e552964c08c7b419c1ad7a25c342a0b631b1967d3a27 *test-data/a b
md5:daa8075d6ac5ff8d0c6d4650adb4ef29 *test-data/ab
sha256:a63d8014dba891345b30174df2b2a57efbb65b4f9f09b98f245d1b3192277ece *test-data/ab
\md5:537c3478b5faa724bc71ed7fc1ac0f60 *test-data/a\\b
\sha256:eaba35b63f3a21c43bc4d579fa4ae0cd388ec8633c08e0a54859d07d33a0c487 *test-data/a\\b
\md5:ae2af30d93d6dbd7f6f62c775c038c60 *test-data/a\\nb
\sha256:d32170bf6e447af933ecabf6607a7e94b1fc35e01f71b618141aab849e00f45b *test-data/a\\nb


You may also consider adding 'multisum' file format support to existing tools.
For instance md5sum -c being able to verify all records starting with 'md5:'

Comment 1 Pádraig Brady 2013-11-01 12:01:19 UTC
Note newer coreutils checksum utils already support the --tag option to identify
the checksum used, and we reused the BSD format here to gain extra compat.

$ tee >(md5sum --tag) >(sha1sum --tag) < /etc/passwd >/dev/null
SHA1 (-) = 30529b9c1622452b4488f229e7f8d36cc49579ba
MD5 (-) = 6d8d8033d929f93998c08a30c92a5b8d

Using tee like above does have disadvantages.
1. Redundant write to /dev/null (Note you can't really pipe to another chksum util as then the output from previous utils would go to that (the pipe is setup before the coprocesses)
2. This doesn't support multiple files well, since the file name isn't output,
and also it would be 1 process for file, which waste CPU

p.s. I just fixed the "s/three/four/" issue you mentioned in:
http://lists.gnu.org/archive/html/coreutils/2013-11/msg00000.html

Comment 2 Daniel Mach 2013-11-01 12:07:56 UTC
(In reply to Pádraig Brady from comment #1)
> 2. This doesn't support multiple files well, since the file name isn't output
Exactly.
If I want to use this approach to process more files, I need to compute checksums per file, call sed to inject the correct file names and merge results into one file.

Comment 3 Pádraig Brady 2013-11-01 13:05:32 UTC
Note if you were going to have a multisum util, then you would really need to be doing the reading in one process and the checksumming in other processes to take advantage of multicore.

Now you could get much of that implicitly with separate checksum utilities (processes) and file caching to avoid the multiple IO overhead.

But you would also have to be careful that you wouldn't have multiple processes
fighting over a disk head for example.

Now this sort of processing is not specific to the coreutils checksumming
utilities, and therefore probably best handled outside if possible.

To illustrate how you might split up file processing across CPUS
while taking advantage of cache, consider the following xargs command
which would be tuned for 2 CPUs (md5sum & sha1sum). The runs are batched in
groups of 10 so that later files don't evict yet to be processed from cache.
If you have more CPUs then you could add a -P2 or whatever to the xargs command.
Note also you may want to change the '&' to a ';' in the command below
if you had a mechanical disk rather than an SSD, to avoid multiple
processes fighting over a disk head.

$ seq 20 | xargs -n10 sh -c 'echo m5sum "$0" "$@" & echo sha1sum "$0" "$@"'
sha1sum 1 2 3 4 5 6 7 8 9 10
m5sum 1 2 3 4 5 6 7 8 9 10
sha1sum 11 12 13 14 15 16 17 18 19 20
m5sum 11 12 13 14 15 16 17 18 19 20

Note also the GNU parallel command for dividing up such workloads.

So currently I don't think a separate utility is required for this.

p.s. It's best first broaching questions like this upstream at
coreutils, and I'll copy some of this discussion there for posterity.


Note You need to log in before you can comment on or make changes to this bug.