Bug 538423

Summary:	grep performance is terrible with UTF-8 locales (compared to C locale)
Product:	[Fedora] Fedora	Reporter:	Maurizio Paolini <paolini>
Component:	grep	Assignee:	Jaroslav Škarvada <jskarvad>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	low
Version:	12	CC:	jbgallagher2000, jskarvad, kasal, kdudka, lkundrak, ovasik, paolini, twaugh
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:	grep-2.5.4-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	499220	Environment:
Last Closed:	2010-02-15 13:50:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Maurizio Paolini 2009-11-18 14:55:55 UTC

+++ This bug was initially created as a clone of Bug #499220 +++

The problem initially reported as bug 499220 is still present in
Fedora 11, grep version "grep-2.5.3-4.fc11.i586"

The problem can be reproduced as follows:

--------------------------------------
$ for n in `seq 10000`
> do
>  echo "0" >>test.txt
>done
$ export LANG=en_US.UTF-8
$ time grep [01] test.txt >/dev/null

real    0m9.102s
user    0m8.419s
sys     0m0.021s
--------------------------------------

while without utf8 the result is OK:

$ export LANG=en_US
$ time grep [01] test.txt >/dev/null

real    0m0.018s
user    0m0.004s
sys     0m0.001s
--------------------------------------
We have the same results with "[0]" in place of
"[01]" as regular expression.
This is a nasty bug because it could impacts a lot of system scripts.

One note: the same grep command but without the '[' and ']'
does not have the problem:

$ export LANG=en_US.utf8
$ time grep 0 test.txt >/dev/null

real    0m0.009s
user    0m0.004s
sys     0m0.002s



--------------------

It seems that the rpm package includes, among others, the
patch "grep-2.5.3-egf-speedup.patch", which is the most relevant
in this respect.  It fixes some unicode problems, but *not* the
one that I am reporting.

Here is an extract from that patch:
--- extract from grep-2.5.3-egf-speedup.patch ---
From aac37e1939632dbc7d2ade6f991af7ce103b0cba Mon Sep 17 00:00:00 2001
From: Tim Waugh <twaugh>
Date: Sun, 23 Nov 2008 17:30:59 +0100
Subject: [PATCH] EGF Speedup

The full story behind this patch is that grep-2.5.1a does not handle UTF-8 gracefully at all. The basic plan with handling UTF-8 in 2.5.1a is:

    * whenever a buffer is parsed, go through the entire buffer deciding how many bytes make up each character
    * use this information when necessary

This patch changes that to:

    * when information about how many bytes make up a character is needed, work it out on demand
[...]

Comment 1 Maurizio Paolini 2009-11-26 16:47:40 UTC

The problem is still present in Fedora 12, comment above applies without any
change

Comment 2 Jaroslav Škarvada 2010-02-15 13:50:33 UTC

Fixed in grep-2.5.4-1 in rawhide.

Comment 3 Maurizio Paolini 2010-02-23 16:40:54 UTC

(In reply to comment #2)
> Fixed in grep-2.5.4-1 in rawhide.    

grep-2.5.4-1 solves the problem for me!  What I did is
- download the .src.rpm of fedora 13
- rpmbuild -ba ...
- rpm -U ...
- test with utf8 locale --> OK

Thank you very much!