538423 – grep performance is terrible with UTF-8 locales (compared to C locale)

Bug 538423 - grep performance is terrible with UTF-8 locales (compared to C locale)

Summary: grep performance is terrible with UTF-8 locales (compared to C locale)

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	grep
Sub Component:
Version:	12
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Jaroslav Škarvada
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-11-18 14:55 UTC by Maurizio Paolini
Modified:	2010-02-23 16:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:	grep-2.5.4-1
Clone Of:	499220
Environment:
Last Closed:	2010-02-15 13:50:33 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Maurizio Paolini 2009-11-18 14:55:55 UTC

+++ This bug was initially created as a clone of Bug #499220 +++

The problem initially reported as bug 499220 is still present in
Fedora 11, grep version "grep-2.5.3-4.fc11.i586"

The problem can be reproduced as follows:

--------------------------------------
$ for n in `seq 10000`
> do
>  echo "0" >>test.txt
>done
$ export LANG=en_US.UTF-8
$ time grep [01] test.txt >/dev/null

real    0m9.102s
user    0m8.419s
sys     0m0.021s
--------------------------------------

while without utf8 the result is OK:

$ export LANG=en_US
$ time grep [01] test.txt >/dev/null

real    0m0.018s
user    0m0.004s
sys     0m0.001s
--------------------------------------
We have the same results with "[0]" in place of
"[01]" as regular expression.
This is a nasty bug because it could impacts a lot of system scripts.

One note: the same grep command but without the '[' and ']'
does not have the problem:

$ export LANG=en_US.utf8
$ time grep 0 test.txt >/dev/null

real    0m0.009s
user    0m0.004s
sys     0m0.002s



--------------------

It seems that the rpm package includes, among others, the
patch "grep-2.5.3-egf-speedup.patch", which is the most relevant
in this respect.  It fixes some unicode problems, but *not* the
one that I am reporting.

Here is an extract from that patch:
--- extract from grep-2.5.3-egf-speedup.patch ---
From aac37e1939632dbc7d2ade6f991af7ce103b0cba Mon Sep 17 00:00:00 2001
From: Tim Waugh <twaugh>
Date: Sun, 23 Nov 2008 17:30:59 +0100
Subject: [PATCH] EGF Speedup

The full story behind this patch is that grep-2.5.1a does not handle UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a is:

    * whenever a buffer is parsed, go through the entire buffer deciding how many bytes make up each character
    * use this information when necessary

This patch changes that to:

    * when information about how many bytes make up a character is needed, work it out on demand
[...]

Comment 1 Maurizio Paolini 2009-11-26 16:47:40 UTC

The problem is still present in Fedora 12, comment above applies without any
change

Comment 2 Jaroslav Škarvada 2010-02-15 13:50:33 UTC

Fixed in grep-2.5.4-1 in rawhide.

Comment 3 Maurizio Paolini 2010-02-23 16:40:54 UTC

(In reply to comment #2)
> Fixed in grep-2.5.4-1 in rawhide.    

grep-2.5.4-1 solves the problem for me!  What I did is
- download the .src.rpm of fedora 13
- rpmbuild -ba ...
- rpm -U ...
- test with utf8 locale --> OK

Thank you very much!

Note You need to log in before you can comment on or make changes to this bug.