Bug 538423

Summary: grep performance is terrible with UTF-8 locales (compared to C locale)
Product: [Fedora] Fedora Reporter: Maurizio Paolini <paolini>
Component: grepAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 12CC: jbgallagher2000, jskarvad, kasal, kdudka, lkundrak, ovasik, paolini, twaugh
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: grep-2.5.4-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 499220 Environment:
Last Closed: 2010-02-15 13:50:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Maurizio Paolini 2009-11-18 14:55:55 UTC
+++ This bug was initially created as a clone of Bug #499220 +++

The problem initially reported as bug 499220 is still present in
Fedora 11, grep version "grep-2.5.3-4.fc11.i586"

The problem can be reproduced as follows:

--------------------------------------
$ for n in `seq 10000`
> do
>  echo "0" >>test.txt
>done
$ export LANG=en_US.UTF-8
$ time grep [01] test.txt >/dev/null

real    0m9.102s
user    0m8.419s
sys     0m0.021s
--------------------------------------

while without utf8 the result is OK:

$ export LANG=en_US
$ time grep [01] test.txt >/dev/null

real    0m0.018s
user    0m0.004s
sys     0m0.001s
--------------------------------------
We have the same results with "[0]" in place of
"[01]" as regular expression.
This is a nasty bug because it could impacts a lot of system scripts.

One note: the same grep command but without the '[' and ']'
does not have the problem:

$ export LANG=en_US.utf8
$ time grep 0 test.txt >/dev/null

real    0m0.009s
user    0m0.004s
sys     0m0.002s



--------------------

It seems that the rpm package includes, among others, the
patch "grep-2.5.3-egf-speedup.patch", which is the most relevant
in this respect.  It fixes some unicode problems, but *not* the
one that I am reporting.

Here is an extract from that patch:
--- extract from grep-2.5.3-egf-speedup.patch ---
From aac37e1939632dbc7d2ade6f991af7ce103b0cba Mon Sep 17 00:00:00 2001
From: Tim Waugh <twaugh>
Date: Sun, 23 Nov 2008 17:30:59 +0100
Subject: [PATCH] EGF Speedup

The full story behind this patch is that grep-2.5.1a does not handle UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a is:

    * whenever a buffer is parsed, go through the entire buffer deciding how many bytes make up each character
    * use this information when necessary

This patch changes that to:

    * when information about how many bytes make up a character is needed, work it out on demand
[...]

Comment 1 Maurizio Paolini 2009-11-26 16:47:40 UTC
The problem is still present in Fedora 12, comment above applies without any
change

Comment 2 Jaroslav Škarvada 2010-02-15 13:50:33 UTC
Fixed in grep-2.5.4-1 in rawhide.

Comment 3 Maurizio Paolini 2010-02-23 16:40:54 UTC
(In reply to comment #2)
> Fixed in grep-2.5.4-1 in rawhide.    

grep-2.5.4-1 solves the problem for me!  What I did is
- download the .src.rpm of fedora 13
- rpmbuild -ba ...
- rpm -U ...
- test with utf8 locale --> OK

Thank you very much!