Bug 194471
Summary: | grep --ignore-case is very slow in UTF-8 | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Egmont Koblinger <egmont> |
Component: | grep | Assignee: | Jaroslav Škarvada <jskarvad> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | drepper, henry.hu.sh, jakub, jr-redhatbugs2, jskarvad, mcepl, triage, twaugh |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | bzcl34nup | ||
Fixed In Version: | grep-2.6.3-1.fc11 | Doc Type: | Enhancement |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-03-30 14:06:05 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 235705 |
Description
Egmont Koblinger
2006-06-08 12:51:27 UTC
*** Bug 194472 has been marked as a duplicate of this bug. *** Bet you SuSE's grep won't properly handle UTF-8 in case-insensitive mode though. :-) As you note, 'fgrep -i' *is* fast -- but there is special magic going on there to make it so. With fgrep there are more assumptions that can be made safely, which make it possible to optimize this operation quite a bit. For the 'egrep -i' case (which is slower) it looks to me like the time is spent in glibc's re_search() function. What do you exactly mean by suse's version not doing it perfectly? I use it very often with LANG=hu_HU.UTF-8 and it works correctly, even for accented letters. However, I just tried it with LANG=tr_TR.UTF-8, and -- you're right -- it doesn't perfectly catch the i-İ and ı-I pairs (does the fedora version do?) Well, if fgrep -i foobar is fast in UTF-8 and grep -i foobar is not (with literal foobar), then that is a missed optimization in grep, there is no reason why grep shouldn't behave exactly as fgrep when there are no special regex characters in the search string. glibc re_search intentionally doesn't special case searches with no regex special chars in it, glibc would need to basically carry all the cruft that grep currently has and that's not useful for all the programs out there, only to some. To my knowledge SUSE doesn't use any special glibc regex patches, so if there is any difference, it must be either in not honoring the UTF-8 or in for i in 'grep -i' 'fgrep -i' grep fgrep; do echo -n $i foobar" " LC_ALL=en_US.UTF-8 ltrace $i foobar /usr/lib/locale/locale-archive 2>&1 | grep re_search | wc -l done grep -i foobar 93543 fgrep -i foobar 0 grep foobar 0 fgrep foobar 0 BTW, I looked at grep -i foobar under debugger and it doesn't set struct re_pattern_buffer's fastmap, which would substantially speed it up in this case IMHO. So, I'd say grep should be changed, so that a) if the string in grep -i (and egrep -i) contains no regex special chars at all (and other grep options don't preclude it), use fgrep -i searching methods b) otherwise, at least set fastmap, so that glibc regex can at least do a better job Regarding tr_TR, glibc regex handles dotless vs. with dot I's correctly. ..and so does grep, with our patches. Jakub, thanks very much for the analysis. This bug should be closed. There is nothing to do. It is the user's responsibility to use the best command for the job. grep without special purpose optimizations performs as good as it can. There are reasons why there are fgrep and grep as separate programs. The changes mentioned in comment #4 seem worth doing, especially (b) which seems an easy change (set fastmap). Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on. We appreciate the time you took to report this issue and want to make sure no important bugs slip through the cracks. If you're currently running a version of Fedora Core between 1 and 6, please note that Fedora no longer maintains these releases. We strongly encourage you to upgrade to a current Fedora release. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained and closing them. http://fedoraproject.org/wiki/LifeCycle/EOL If this bug is still open against Fedora Core 1 through 6, thirty days from now, it will be closed 'WONTFIX'. If you can reporduce this bug in the latest Fedora version, please change to the respective version. If you are unable to do this, please add a comment to this bug requesting the change. Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we are following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. And if you'd like to join the bug triage team to help make things better, check out http://fedoraproject.org/wiki/BugZappers I've recently upgraded one server from Fedora Core 2 to Fedora 11, and this problem appeared, since now the server is using UTF-8 locale. Please fix this problem, or we can only run egrep under LC_CTYPE=C. Looks as fixed in grep-2.6.1 in rawhide. I am going to push it as regular update to all suported versions of Fedora. $ time grep foobar /usr/lib/locale/locale-archive real 0m1.410s user 0m0.199s sys 0m0.178s $ time grep -i foobar /usr/lib/locale/locale-archive real 0m4.504s user 0m4.374s sys 0m0.043s $ time fgrep -i foobar /usr/lib/locale/locale-archive real 0m4.422s user 0m4.280s sys 0m0.036s I haven't tried grep-2.6.1 but read its changelog and seems they did major UTF-8 improvements mainstream. With the new version, grep and fgrep are equally fast if the pattern is actually a simple string without special characters. I'm not fully satisfied with the result, though. With the old version "fgrep -i" was extremely fast (around 0.1-0.2 seconds), with the new version "fgrep -i" takes around 4-5 seconds. This should rather be addressed mainstream, though... On my system the old grep: $ time ./fgrep -i foobar /usr/lib/locale/locale-archive real 0m1.256s user 0m0.071s sys 0m0.091s I think approx. 4 time slowdown is acceptable in comparison to huge changes, bugfixes and speedups (in other cases), presented in the new grep. Just realized that my old grep is actually an Ubuntu Hardy grep-2.5.3 with I don't know what kinds of patches and what compiler flags, and how that does with UTF-8 weirdnesses and corner cases, such as Turkish i's and such..... So probably your measurement of a 4x slowdown is more accurate than my measurement of 30x slowdown as I'm comparing apples to pears. I didn't complex analysis. When running both cases from cache it seems there is much more slowdown. The new grep is still faster then the old unpatched and less buggy on UTF-8 than patched/unpatched old version. Also the grep/fgrep behave the same, that's why I recognized this as fixed. If you are not satisfied with this solution feel free to reopen. Continuing the story upstream: https://savannah.gnu.org/patch/index.php?7147 I think it's fine to leave this one closed. Thanks! :) Argh, sorry, copy-pasted from the wrong tab... This is the correct URL: https://savannah.gnu.org/bugs/index.php?29391 grep-2.6.1-1.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/grep-2.6.1-1.fc11 grep-2.6.3-1.fc11 has been submitted as an update for Fedora 11. http://admin.fedoraproject.org/updates/grep-2.6.3-1.fc11 grep-2.6.3-1.fc11 has been pushed to the Fedora 11 stable repository. If problems still persist, please make note of it in this bug report. |