Bug 194471

Summary:	grep --ignore-case is very slow in UTF-8
Product:	[Fedora] Fedora	Reporter:	Egmont Koblinger <egmont>
Component:	grep	Assignee:	Jaroslav Škarvada <jskarvad>
Status:	CLOSED ERRATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	drepper, henry.hu.sh, jakub, jr-redhatbugs2, jskarvad, mcepl, triage, twaugh
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:	bzcl34nup
Fixed In Version:	grep-2.6.3-1.fc11	Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-03-30 14:06:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	235705

Description Egmont Koblinger 2006-06-08 12:51:27 UTC

Grep contains a lot of patches so that it is very fast even in UTF-8,
but only in the default case sensitive mode. If you type "grep -i
some_very_large_data" then you'll find that it is approx. a hundred times
slower than if you omit "-i". Example:

$ time grep foobar /usr/lib/locale/locale-archive

real    0m0.089s
user    0m0.044s
sys     0m0.048s
$ time grep -i foobar /usr/lib/locale/locale-archive

real    0m17.023s
user    0m16.977s
sys     0m0.044s

Note #1: I'm not using Fedora, but applied its patches (grep-2.5.1-52.2) to
grep 2.5.1a and recompiled it on my system and found this behavior there.

Note #2: If I apply the patches from SUSE's grep-2.5.1a-20 instead then it's
fast even in the case insensitive mode.

Note #3: "fgrep -i" is fast on Fedora, too.

Comment 1 Egmont Koblinger 2006-06-08 12:55:57 UTC

*** Bug 194472 has been marked as a duplicate of this bug. ***

Comment 2 Tim Waugh 2006-10-13 15:55:04 UTC

Bet you SuSE's grep won't properly handle UTF-8 in case-insensitive mode though. :-)

As you note, 'fgrep -i' *is* fast -- but there is special magic going on there
to make it so.  With fgrep there are more assumptions that can be made safely,
which make it possible to optimize this operation quite a bit.

For the 'egrep -i' case (which is slower) it looks to me like the time is spent
in glibc's re_search() function.

Comment 3 Egmont Koblinger 2006-10-13 16:21:36 UTC

What do you exactly mean by suse's version not doing it perfectly? I use it 
very often with LANG=hu_HU.UTF-8 and it works correctly, even for accented 
letters. However, I just tried it with LANG=tr_TR.UTF-8, and -- you're right -- 
it doesn't perfectly catch the i-İ and ı-I pairs (does the fedora version do?)

Comment 4 Jakub Jelinek 2006-10-13 16:32:26 UTC

Well, if fgrep -i foobar is fast in UTF-8 and grep -i foobar is not (with literal
foobar), then that is a missed optimization in grep, there is no reason
why grep shouldn't behave exactly as fgrep when there are no special regex
characters in the search string.
glibc re_search intentionally doesn't special case searches with no regex special
chars in it, glibc would need to basically carry all the cruft that grep
currently has and that's not useful for all the programs out there, only to some.
To my knowledge SUSE doesn't use any special glibc regex patches, so if there is
any difference, it must be either in not honoring the UTF-8 or in 

for i in 'grep -i' 'fgrep -i' grep fgrep; do
  echo -n $i foobar" "
  LC_ALL=en_US.UTF-8 ltrace $i foobar /usr/lib/locale/locale-archive 2>&1 | grep
re_search | wc -l
done
grep -i foobar 93543
fgrep -i foobar 0
grep foobar 0
fgrep foobar 0

BTW, I looked at grep -i foobar under debugger and it doesn't set struct
re_pattern_buffer's fastmap, which would substantially speed it up in this case
IMHO.
So, I'd say grep should be changed, so that
a) if the string in grep -i (and egrep -i) contains no regex special chars at
all (and other grep options don't preclude it), use fgrep -i searching methods
b) otherwise, at least set fastmap, so that glibc regex can at least do a better
job

Comment 5 Jakub Jelinek 2006-10-13 16:41:52 UTC

Regarding tr_TR, glibc regex handles dotless vs. with dot I's correctly.

Comment 6 Tim Waugh 2006-10-13 16:46:21 UTC

..and so does grep, with our patches.  Jakub, thanks very much for the analysis.

Comment 7 Ulrich Drepper 2006-12-18 02:38:33 UTC

This bug should be closed.  There is nothing to do.  It is the user's
responsibility to use the best command for the job.  grep without special
purpose optimizations performs as good as it can.  There are reasons why there
are fgrep and grep as separate programs.

Comment 8 Tim Waugh 2006-12-18 09:58:22 UTC

The changes mentioned in comment #4 seem worth doing, especially (b) which seems
an easy change (set fastmap).

Comment 9 Bug Zapper 2008-04-04 03:03:55 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 11 Henry Hu 2009-09-05 12:38:11 UTC

I've recently upgraded one server from Fedora Core 2 to Fedora 11, and this problem appeared, since now the server is using UTF-8 locale.
Please fix this problem, or we can only run egrep under LC_CTYPE=C.

Comment 12 Jaroslav Škarvada 2010-03-30 14:06:05 UTC

Looks as fixed in grep-2.6.1 in rawhide. I am going to push it as regular update to all suported versions of Fedora.

$ time grep foobar /usr/lib/locale/locale-archive

real	0m1.410s
user	0m0.199s
sys	0m0.178s

$ time grep -i foobar /usr/lib/locale/locale-archive

real	0m4.504s
user	0m4.374s
sys	0m0.043s

$ time fgrep -i foobar /usr/lib/locale/locale-archive

real	0m4.422s
user	0m4.280s
sys	0m0.036s

Comment 13 Egmont Koblinger 2010-03-30 16:18:18 UTC

I haven't tried grep-2.6.1 but read its changelog and seems they did major UTF-8 improvements mainstream. With the new version, grep and fgrep are equally fast if the pattern is actually a simple string without special characters.

I'm not fully satisfied with the result, though. With the old version "fgrep -i" was extremely fast (around 0.1-0.2 seconds), with the new version "fgrep -i" takes around 4-5 seconds. This should rather be addressed mainstream, though...

Comment 14 Jaroslav Škarvada 2010-03-30 16:31:06 UTC

On my system the old grep:

$ time ./fgrep -i foobar /usr/lib/locale/locale-archive

real	0m1.256s
user	0m0.071s
sys	0m0.091s

I think approx. 4 time slowdown is acceptable in comparison to huge changes, bugfixes and speedups (in other cases), presented in the new grep.

Comment 15 Egmont Koblinger 2010-03-30 16:56:04 UTC

Just realized that my old grep is actually an Ubuntu Hardy grep-2.5.3 with I don't know what kinds of patches and what compiler flags, and how that does with UTF-8 weirdnesses and corner cases, such as Turkish i's and such..... So probably your measurement of a 4x slowdown is more accurate than my measurement of 30x slowdown as I'm comparing apples to pears.

Comment 16 Jaroslav Škarvada 2010-03-30 18:21:30 UTC

I didn't complex analysis. When running both cases from cache it seems there is much more slowdown. The new grep is still faster then the old unpatched and less buggy on UTF-8 than patched/unpatched old version. Also the grep/fgrep behave the same, that's why I recognized this as fixed. If you are not satisfied with this solution feel free to reopen.

Comment 17 Egmont Koblinger 2010-03-31 09:14:56 UTC

Continuing the story upstream: https://savannah.gnu.org/patch/index.php?7147
I think it's fine to leave this one closed. Thanks! :)

Comment 18 Egmont Koblinger 2010-03-31 09:15:57 UTC

Argh, sorry, copy-pasted from the wrong tab...
This is the correct URL:
https://savannah.gnu.org/bugs/index.php?29391

Comment 19 Fedora Update System 2010-03-31 14:08:30 UTC

grep-2.6.1-1.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/grep-2.6.1-1.fc11

Comment 20 Fedora Update System 2010-04-07 22:17:32 UTC

grep-2.6.3-1.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/grep-2.6.3-1.fc11

Comment 21 Fedora Update System 2010-06-15 16:06:36 UTC

grep-2.6.3-1.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.