Bug 504579

Summary: [RFE]: re-use hashes for files with the same device and inode numbers
Product: [Fedora] Fedora Reporter: Daniel Mach <dmach>
Component: coreutilsAssignee: Ondrej Vasik <ovasik>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: kdudka, ovasik, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-11 07:24:39 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Daniel Mach 2009-06-08 06:50:40 EDT
When md5sum is run on bunch of files, some of them might be hardlinked and I don't see any reason to compute hash on the same content again.

It should be possible to implement hash caching without significant code change:
- create a new function which will handle (dev, ino) -> hash cache
- call it instead of original function
- on cache hit, return cached hash
- otherwise call original function, compute hash, store it to cache and return
Comment 1 Ondrej Vasik 2009-06-08 10:31:21 EDT
It's quite easy to handle such thing with short wrapper shell script(just get info about file dev/inode (ls/find/stat/whatever) , sort it by device and inode, and call md5/shaxxxsum just for the case that device and inode differs from previous file (they are sorted, so you have hardlinks with same sums in the row, you just need to remember only last one file)). At the moment md5/shaxxxsum utitilities are not calling stat(2) - just fopen(3) - and in file desriptor structure there is AFAIK no information about dev/inode of the file - so IMHO adding cache/stat/dynamicmemoryallocation will make compact code of md5sum.c much more difficult to read for quite a small benefit for common users. I'll check the upstream opinion before working on it...