Bug 538423 - grep performance is terrible with UTF-8 locales (compared to C locale)
Summary: grep performance is terrible with UTF-8 locales (compared to C locale)
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: grep
Version: 12
Hardware: i386
OS: Linux
low
high
Target Milestone: ---
Assignee: Jaroslav Škarvada
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-11-18 14:55 UTC by Maurizio Paolini
Modified: 2010-02-23 16:40 UTC (History)
8 users (show)

Fixed In Version: grep-2.5.4-1
Clone Of: 499220
Environment:
Last Closed: 2010-02-15 13:50:33 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Maurizio Paolini 2009-11-18 14:55:55 UTC
+++ This bug was initially created as a clone of Bug #499220 +++

The problem initially reported as bug 499220 is still present in
Fedora 11, grep version "grep-2.5.3-4.fc11.i586"

The problem can be reproduced as follows:

--------------------------------------
$ for n in `seq 10000`
> do
>  echo "0" >>test.txt
>done
$ export LANG=en_US.UTF-8
$ time grep [01] test.txt >/dev/null

real    0m9.102s
user    0m8.419s
sys     0m0.021s
--------------------------------------

while without utf8 the result is OK:

$ export LANG=en_US
$ time grep [01] test.txt >/dev/null

real    0m0.018s
user    0m0.004s
sys     0m0.001s
--------------------------------------
We have the same results with "[0]" in place of
"[01]" as regular expression.
This is a nasty bug because it could impacts a lot of system scripts.

One note: the same grep command but without the '[' and ']'
does not have the problem:

$ export LANG=en_US.utf8
$ time grep 0 test.txt >/dev/null

real    0m0.009s
user    0m0.004s
sys     0m0.002s



--------------------

It seems that the rpm package includes, among others, the
patch "grep-2.5.3-egf-speedup.patch", which is the most relevant
in this respect.  It fixes some unicode problems, but *not* the
one that I am reporting.

Here is an extract from that patch:
--- extract from grep-2.5.3-egf-speedup.patch ---
From aac37e1939632dbc7d2ade6f991af7ce103b0cba Mon Sep 17 00:00:00 2001
From: Tim Waugh <twaugh>
Date: Sun, 23 Nov 2008 17:30:59 +0100
Subject: [PATCH] EGF Speedup

The full story behind this patch is that grep-2.5.1a does not handle UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a is:

    * whenever a buffer is parsed, go through the entire buffer deciding how many bytes make up each character
    * use this information when necessary

This patch changes that to:

    * when information about how many bytes make up a character is needed, work it out on demand
[...]

Comment 1 Maurizio Paolini 2009-11-26 16:47:40 UTC
The problem is still present in Fedora 12, comment above applies without any
change

Comment 2 Jaroslav Škarvada 2010-02-15 13:50:33 UTC
Fixed in grep-2.5.4-1 in rawhide.

Comment 3 Maurizio Paolini 2010-02-23 16:40:54 UTC
(In reply to comment #2)
> Fixed in grep-2.5.4-1 in rawhide.    

grep-2.5.4-1 solves the problem for me!  What I did is
- download the .src.rpm of fedora 13
- rpmbuild -ba ...
- rpm -U ...
- test with utf8 locale --> OK

Thank you very much!


Note You need to log in before you can comment on or make changes to this bug.