Bug 1017046 - grep -f uses lot of memory
Summary: grep -f uses lot of memory
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: grep
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jaroslav Škarvada
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-09 08:12 UTC by Jens Petersen
Modified: 2013-10-09 09:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-09 09:38:09 UTC
Type: Bug


Attachments (Terms of Use)

Description Jens Petersen 2013-10-09 08:12:04 UTC
Description of problem:
grep -f seems to use many gigs of memory for a large input fail.
The resources used seem far greater than the size of the inputs.

Steps to Reproduce:
$ koji list-tagged f19 | tail -n +3 | awk '{print $1}' > f19
$ koji list-tagged f19-updates | tail -n +3 | awk '{print $1}' > f19-updates
$ wc -l f19 f19-updates
 13606 371784 f19.tagged
  4176 110596 f19-updates.tagged
$ grep -f f19-updates f19

Actual results:
Uses many gigs of ram: I killed the process as it reached 10GB...

Expected results:
Memory usage to be more constant in space and time.

Comment 1 Jaroslav Škarvada 2013-10-09 08:39:31 UTC
For your use case, you should rather use:
$ grep -Ff f19-updates f19

Compiling and running cca. 112kB regex which includes wildcards can be really expensive.

Comment 2 Jens Petersen 2013-10-09 09:16:11 UTC
Ah yes good point.  Okay - dunno if there is anything more that can be done to improve the efficiency - I suppose I was wondering why grep keeps it all
in memory but that is surely something for upstream.  Likely this can be
closed..

Comment 3 Jaroslav Škarvada 2013-10-09 09:37:14 UTC
(In reply to Jens Petersen from comment #2)
> Ah yes good point.  Okay - dunno if there is anything more that can be done
> to improve the efficiency - I suppose I was wondering why grep keeps it all
> in memory but that is surely something for upstream.  Likely this can be
> closed..

I guess it is not worth to partially read the data from the disk and/or go through the multiple passes - there could be big performance penalty.

Also if you need the regex match, you can escape the 'dots' in your pattern file to significantly reduce the space of the problem:

$ sed -i 's/\./\\\./g' f19-updates

Then it required less than 2 GB of resident memory on my box.

Closing as notabug according to previous comments.


Note You need to log in before you can comment on or make changes to this bug.