Bug 538423 - grep performance is terrible with UTF-8 locales (compared to C locale)
grep performance is terrible with UTF-8 locales (compared to C locale)
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: grep (Show other bugs)
12
i386 Linux
low Severity high
: ---
: ---
Assigned To: Jaroslav Škarvada
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-11-18 09:55 EST by Maurizio Paolini
Modified: 2010-02-23 11:40 EST (History)
8 users (show)

See Also:
Fixed In Version: grep-2.5.4-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 499220
Environment:
Last Closed: 2010-02-15 08:50:33 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Maurizio Paolini 2009-11-18 09:55:55 EST
+++ This bug was initially created as a clone of Bug #499220 +++

The problem initially reported as bug 499220 is still present in
Fedora 11, grep version "grep-2.5.3-4.fc11.i586"

The problem can be reproduced as follows:

--------------------------------------
$ for n in `seq 10000`
> do
>  echo "0" >>test.txt
>done
$ export LANG=en_US.UTF-8
$ time grep [01] test.txt >/dev/null

real    0m9.102s
user    0m8.419s
sys     0m0.021s
--------------------------------------

while without utf8 the result is OK:

$ export LANG=en_US
$ time grep [01] test.txt >/dev/null

real    0m0.018s
user    0m0.004s
sys     0m0.001s
--------------------------------------
We have the same results with "[0]" in place of
"[01]" as regular expression.
This is a nasty bug because it could impacts a lot of system scripts.

One note: the same grep command but without the '[' and ']'
does not have the problem:

$ export LANG=en_US.utf8
$ time grep 0 test.txt >/dev/null

real    0m0.009s
user    0m0.004s
sys     0m0.002s



--------------------

It seems that the rpm package includes, among others, the
patch "grep-2.5.3-egf-speedup.patch", which is the most relevant
in this respect.  It fixes some unicode problems, but *not* the
one that I am reporting.

Here is an extract from that patch:
--- extract from grep-2.5.3-egf-speedup.patch ---
From aac37e1939632dbc7d2ade6f991af7ce103b0cba Mon Sep 17 00:00:00 2001
From: Tim Waugh <twaugh@redhat.com>
Date: Sun, 23 Nov 2008 17:30:59 +0100
Subject: [PATCH] EGF Speedup

The full story behind this patch is that grep-2.5.1a does not handle UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a is:

    * whenever a buffer is parsed, go through the entire buffer deciding how many bytes make up each character
    * use this information when necessary

This patch changes that to:

    * when information about how many bytes make up a character is needed, work it out on demand
[...]
Comment 1 Maurizio Paolini 2009-11-26 11:47:40 EST
The problem is still present in Fedora 12, comment above applies without any
change
Comment 2 Jaroslav Škarvada 2010-02-15 08:50:33 EST
Fixed in grep-2.5.4-1 in rawhide.
Comment 3 Maurizio Paolini 2010-02-23 11:40:54 EST
(In reply to comment #2)
> Fixed in grep-2.5.4-1 in rawhide.    

grep-2.5.4-1 solves the problem for me!  What I did is
- download the .src.rpm of fedora 13
- rpmbuild -ba ...
- rpm -U ...
- test with utf8 locale --> OK

Thank you very much!

Note You need to log in before you can comment on or make changes to this bug.