Red Hat Bugzilla – Full Text Bug Listing
|Summary:||grep --ignore-case is very slow in UTF-8|
|Product:||[Fedora] Fedora||Reporter:||Egmont Koblinger <egmont>|
|Component:||grep||Assignee:||Jaroslav Škarvada <jskarvad>|
|Status:||CLOSED ERRATA||QA Contact:|
|Version:||rawhide||CC:||drepper, henry.hu.sh, jakub, jr-redhatbugs2, jskarvad, triage, twaugh|
|Fixed In Version:||grep-2.6.3-1.fc11||Doc Type:||Enhancement|
|Doc Text:||Story Points:||---|
|Last Closed:||2010-03-30 10:06:05 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description Egmont Koblinger 2006-06-08 08:51:27 EDT
Grep contains a lot of patches so that it is very fast even in UTF-8, but only in the default case sensitive mode. If you type "grep -i some_very_large_data" then you'll find that it is approx. a hundred times slower than if you omit "-i". Example: $ time grep foobar /usr/lib/locale/locale-archive real 0m0.089s user 0m0.044s sys 0m0.048s $ time grep -i foobar /usr/lib/locale/locale-archive real 0m17.023s user 0m16.977s sys 0m0.044s Note #1: I'm not using Fedora, but applied its patches (grep-2.5.1-52.2) to grep 2.5.1a and recompiled it on my system and found this behavior there. Note #2: If I apply the patches from SUSE's grep-2.5.1a-20 instead then it's fast even in the case insensitive mode. Note #3: "fgrep -i" is fast on Fedora, too.
Comment 1 Egmont Koblinger 2006-06-08 08:55:57 EDT
*** Bug 194472 has been marked as a duplicate of this bug. ***
Comment 2 Tim Waugh 2006-10-13 11:55:04 EDT
Bet you SuSE's grep won't properly handle UTF-8 in case-insensitive mode though. :-) As you note, 'fgrep -i' *is* fast -- but there is special magic going on there to make it so. With fgrep there are more assumptions that can be made safely, which make it possible to optimize this operation quite a bit. For the 'egrep -i' case (which is slower) it looks to me like the time is spent in glibc's re_search() function.
Comment 3 Egmont Koblinger 2006-10-13 12:21:36 EDT
What do you exactly mean by suse's version not doing it perfectly? I use it very often with LANG=hu_HU.UTF-8 and it works correctly, even for accented letters. However, I just tried it with LANG=tr_TR.UTF-8, and -- you're right -- it doesn't perfectly catch the i-İ and ı-I pairs (does the fedora version do?)
Comment 4 Jakub Jelinek 2006-10-13 12:32:26 EDT
Well, if fgrep -i foobar is fast in UTF-8 and grep -i foobar is not (with literal foobar), then that is a missed optimization in grep, there is no reason why grep shouldn't behave exactly as fgrep when there are no special regex characters in the search string. glibc re_search intentionally doesn't special case searches with no regex special chars in it, glibc would need to basically carry all the cruft that grep currently has and that's not useful for all the programs out there, only to some. To my knowledge SUSE doesn't use any special glibc regex patches, so if there is any difference, it must be either in not honoring the UTF-8 or in for i in 'grep -i' 'fgrep -i' grep fgrep; do echo -n $i foobar" " LC_ALL=en_US.UTF-8 ltrace $i foobar /usr/lib/locale/locale-archive 2>&1 | grep re_search | wc -l done grep -i foobar 93543 fgrep -i foobar 0 grep foobar 0 fgrep foobar 0 BTW, I looked at grep -i foobar under debugger and it doesn't set struct re_pattern_buffer's fastmap, which would substantially speed it up in this case IMHO. So, I'd say grep should be changed, so that a) if the string in grep -i (and egrep -i) contains no regex special chars at all (and other grep options don't preclude it), use fgrep -i searching methods b) otherwise, at least set fastmap, so that glibc regex can at least do a better job
Comment 5 Jakub Jelinek 2006-10-13 12:41:52 EDT
Regarding tr_TR, glibc regex handles dotless vs. with dot I's correctly.
Comment 6 Tim Waugh 2006-10-13 12:46:21 EDT
..and so does grep, with our patches. Jakub, thanks very much for the analysis.
Comment 7 Ulrich Drepper 2006-12-17 21:38:33 EST
This bug should be closed. There is nothing to do. It is the user's responsibility to use the best command for the job. grep without special purpose optimizations performs as good as it can. There are reasons why there are fgrep and grep as separate programs.
Comment 8 Tim Waugh 2006-12-18 04:58:22 EST
The changes mentioned in comment #4 seem worth doing, especially (b) which seems an easy change (set fastmap).
Comment 9 Bug Zapper 2008-04-03 23:03:55 EDT
Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers
Comment 11 Henry Hu 2009-09-05 08:38:11 EDT
I've recently upgraded one server from Fedora Core 2 to Fedora 11, and this problem appeared, since now the server is using UTF-8 locale. Please fix this problem, or we can only run egrep under LC_CTYPE=C.
Comment 12 Jaroslav Škarvada 2010-03-30 10:06:05 EDT
Looks as fixed in grep-2.6.1 in rawhide. I am going to push it as regular update to all suported versions of Fedora. $ time grep foobar /usr/lib/locale/locale-archive real 0m1.410s user 0m0.199s sys 0m0.178s $ time grep -i foobar /usr/lib/locale/locale-archive real 0m4.504s user 0m4.374s sys 0m0.043s $ time fgrep -i foobar /usr/lib/locale/locale-archive real 0m4.422s user 0m4.280s sys 0m0.036s
Comment 13 Egmont Koblinger 2010-03-30 12:18:18 EDT
I haven't tried grep-2.6.1 but read its changelog and seems they did major UTF-8 improvements mainstream. With the new version, grep and fgrep are equally fast if the pattern is actually a simple string without special characters. I'm not fully satisfied with the result, though. With the old version "fgrep -i" was extremely fast (around 0.1-0.2 seconds), with the new version "fgrep -i" takes around 4-5 seconds. This should rather be addressed mainstream, though...
Comment 14 Jaroslav Škarvada 2010-03-30 12:31:06 EDT
On my system the old grep: $ time ./fgrep -i foobar /usr/lib/locale/locale-archive real 0m1.256s user 0m0.071s sys 0m0.091s I think approx. 4 time slowdown is acceptable in comparison to huge changes, bugfixes and speedups (in other cases), presented in the new grep.
Comment 15 Egmont Koblinger 2010-03-30 12:56:04 EDT
Just realized that my old grep is actually an Ubuntu Hardy grep-2.5.3 with I don't know what kinds of patches and what compiler flags, and how that does with UTF-8 weirdnesses and corner cases, such as Turkish i's and such..... So probably your measurement of a 4x slowdown is more accurate than my measurement of 30x slowdown as I'm comparing apples to pears.
Comment 16 Jaroslav Škarvada 2010-03-30 14:21:30 EDT
I didn't complex analysis. When running both cases from cache it seems there is much more slowdown. The new grep is still faster then the old unpatched and less buggy on UTF-8 than patched/unpatched old version. Also the grep/fgrep behave the same, that's why I recognized this as fixed. If you are not satisfied with this solution feel free to reopen.
Comment 17 Egmont Koblinger 2010-03-31 05:14:56 EDT
Continuing the story upstream: https://savannah.gnu.org/patch/index.php?7147 I think it's fine to leave this one closed. Thanks! :)
Comment 18 Egmont Koblinger 2010-03-31 05:15:57 EDT
Argh, sorry, copy-pasted from the wrong tab... This is the correct URL: https://savannah.gnu.org/bugs/index.php?29391
Comment 19 Fedora Update System 2010-03-31 10:08:30 EDT
grep-2.6.1-1.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/grep-2.6.1-1.fc11
Comment 20 Fedora Update System 2010-04-07 18:17:32 EDT
grep-2.6.3-1.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/grep-2.6.3-1.fc11
Comment 21 Fedora Update System 2010-06-15 12:06:36 EDT
grep-2.6.3-1.fc11 has been pushed to the Fedora 11 stable repository. If problems still persist, please make note of it in this bug report.