Bug 1017046

Summary: grep -f uses lot of memory
Product: [Fedora] Fedora Reporter: Jens Petersen <petersen>
Component: grepAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: jskarvad, lkundrak
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-09 09:38:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jens Petersen 2013-10-09 08:12:04 UTC
Description of problem:
grep -f seems to use many gigs of memory for a large input fail.
The resources used seem far greater than the size of the inputs.

Steps to Reproduce:
$ koji list-tagged f19 | tail -n +3 | awk '{print $1}' > f19
$ koji list-tagged f19-updates | tail -n +3 | awk '{print $1}' > f19-updates
$ wc -l f19 f19-updates
 13606 371784 f19.tagged
  4176 110596 f19-updates.tagged
$ grep -f f19-updates f19

Actual results:
Uses many gigs of ram: I killed the process as it reached 10GB...

Expected results:
Memory usage to be more constant in space and time.

Comment 1 Jaroslav Škarvada 2013-10-09 08:39:31 UTC
For your use case, you should rather use:
$ grep -Ff f19-updates f19

Compiling and running cca. 112kB regex which includes wildcards can be really expensive.

Comment 2 Jens Petersen 2013-10-09 09:16:11 UTC
Ah yes good point.  Okay - dunno if there is anything more that can be done to improve the efficiency - I suppose I was wondering why grep keeps it all
in memory but that is surely something for upstream.  Likely this can be
closed..

Comment 3 Jaroslav Škarvada 2013-10-09 09:37:14 UTC
(In reply to Jens Petersen from comment #2)
> Ah yes good point.  Okay - dunno if there is anything more that can be done
> to improve the efficiency - I suppose I was wondering why grep keeps it all
> in memory but that is surely something for upstream.  Likely this can be
> closed..

I guess it is not worth to partially read the data from the disk and/or go through the multiple passes - there could be big performance penalty.

Also if you need the regex match, you can escape the 'dots' in your pattern file to significantly reduce the space of the problem:

$ sed -i 's/\./\\\./g' f19-updates

Then it required less than 2 GB of resident memory on my box.

Closing as notabug according to previous comments.