Bug 69900
Summary: | grep writing output very slow | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | keith adamson <keith.adamson> | ||||
Component: | grep | Assignee: | Tim Waugh <twaugh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mike McLean <mikem> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8.0 | CC: | bmisener, david, heny, jr-redhatbugs2, justin, mark_young, menthos, pbmoses, p.van.egdom, rocky, sanjeev | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | 2.5.1-8 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-03-18 16:13:03 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
keith adamson
2002-07-26 02:10:42 UTC
$ time src_extern/grep/grep-2.5/src/grep '^' /var/log/XFree86.0.log >foo real 0m3.070s user 0m2.950s sys 0m0.054s $ time src_extern/grep/grep-2.4/src/grep '^' /var/log/XFree86.0.log >foo real 0m0.019s user 0m0.001s sys 0m0.005s $ Something very wrong with grep-2.5 Agreed something seriously around with grep 2.5. It works fine for me with my XFree86 log, but grepping a 1mb html file is horrible. time /bin/grep '^' foo.htm > foo 40.020 total grep 2.5.1-4 Limbo beta2 rpm compiled by redhat 3.2 39.819 total grep 2.5.1-4 Target i686 compiled by me 3.2 39.308 total grep 2.5.1-4 Compiled by me 2.96 38.564 total grep 2.5.1-4 Target athlon compiled by me 3.2 0.019 total grep 2.4.2-7 7.2 rpm compiled by redhat 2.96 0.018 total grep 2.4.2-7 Compiled by me 3.2 0.018 total grep 2.4.2-7 Target athlon compiled by me 3.2 Interesting thing I noticed is that the user times from 2.5.1 are total 10x, but user times from 2.4.2 are less than total(aka as they should be). Check for simple spec file differences like configure or CFLAGS. Both are equivelant. 38.001 total grep CVS Compiled by me 3.2 Hmmm, 2.5.1 especially sucks compared to 2.4.2. I just did time grep-2.4.2 '^' foo.htm > foo-2.4.2 time grep-2.5.1 '^' foo.htm > foo-2.5.1 diff -u foo-2.4.2 foo-2.5.1 There is no difference in output. This is UTF-8 processing. If you use a locale without UTF-8 support, grep goes as fast as normal. Karsten Hopp pointed out: >grep spends more than 80% of the time in the check_multibyte_string function. >It is as fast as the old version when used without UTF-8: > >grep-2.5: >LANG=de_DE@euro > time grep -q Attack /var/log/messages > real 0m0.078s > user 0m0.035s > sys 0m0.043s > >LANG=de_DE.UTF-8 > time grep -q Attack /var/log/messages > real 0m2.516s > user 0m2.240s > sys 0m0.059s > >grep-2.4: >LANG=de_DE@euro > time grep -q Attack /var/log/messages > real 0m0.080s > user 0m0.021s > sys 0m0.055s > >LANG=de_DE.UTF-8 > time grep -q Attack /var/log/messages > real 0m0.079s > user 0m0.037s > sys 0m0.037s So there is at least the workaround of setting LANG= While that is certainly true... it sucks because we default to using UTF-8 now. That means out of the box, Red Hat Linux has dog slow grep for users, and likely other applications which are dog slow too. We can likely profile+optimize but I doubt we'll ever take up the slack. Agreed, basically out of the box, grep 2.5 is unusable for many shell applications. I ran into this immediately and downgrade to grep 2.4. And no, I do not want to go into every shell app that uses grep and set the locale. *** Bug 90109 has been marked as a duplicate of this bug. *** *** Bug 90109 has been marked as a duplicate of this bug. *** *** Bug 90956 has been marked as a duplicate of this bug. *** *** Bug 91819 has been marked as a duplicate of this bug. *** this is a rh_8 bug without solution at rh_9 :-? Automatic configuration of rh_9 : querida:/etc # cat /etc/sysconfig/i18n LANG="en_US.UTF-8" SUPPORTED="en_US.UTF-8:en_US:en:gl_ES.UTF-8:gl_ES:gl:es_ES.UTF-8:es_ES:es" SYSFONT="latarcyrheb-sun16" are we going to wait to rh_10 to solve this _general_ distribution, not grep, problem? -thanks- *** Bug 92018 has been marked as a duplicate of this bug. *** *** Bug 103618 has been marked as a duplicate of this bug. *** Reported upstream, with example patch (with bugs) that shows where the problem is. *** Bug 106417 has been marked as a duplicate of this bug. *** This is still a problem in Red Hat 9 as of 10/31/03. Here are sample runs of grep: ########################## Info on my setup [root@test log]# date Fri Oct 31 11:05:57 EST 2003 [root@test log]# up2date -du (snip) All packages are currently up to date. [root@test log]# rpm -q glibc pcre grep glibc-2.3.2-27.9 pcre-3.9-10 grep-2.5.1-7 ############### grep test results [root@test log]# export LANG='en_US.UTF-8' [root@test log]# time grep '^' messages > /dev/null real 0m3.689s user 0m3.600s sys 0m0.080s [root@test log]# export LANG='en_US' [root@test log]# time grep '^' messages > /dev/null real 0m0.004s user 0m0.000s sys 0m0.000s With LANG=en_US.UTF-8, grep is almost a thousand times slower. To rule out caching problems, I ran each test three times before changing LANG. Results with en_US.UTF-8: Results with en_US: Run 1 2 3 1 2 3 REAL 3.689s 3.394s 3.815s 0.005s 0.004s 0.004s USER 3.600s 3.360s 3.370s 0.000s 0.000s 0.000s SYS 0.080s 0.040s 0.050s 0.000s 0.000s 0.000s That's about one thousand times slower with UTF-8. Something's wrong with the way grep, pcre, or libc handle UTF-8. Can RedHat send out an advisory to RHN subscribers suggesting they change LANG until this is fixed? I've been chasing down speed problems off and on for months, now, and would have been grateful for a suggestion like this. An experimental grep package is available here: http://cyberelk.net/tim/data/tmp/grep-2.5.1-19.i386.rpm (srpm in the same place) I would be very grateful if people here could test it to see if there are any new bugs, in particular with multibyte (UTF-8) handling. Seems to fix. Typical output below: [root@w84040 adamso]# rpm -Uvh --force grep-2.5.1-7.i386.rpm Preparing... ########################################### [100%] 1:grep ########################################### [100%] [root@w84040 adamso]# time grep '^' /var/log/messages > /dev/null real 0m5.522s user 0m5.120s sys 0m0.070s [root@w84040 adamso]# rpm -Uvh grep-2.5.1-19.i386.rpm Preparing... ########################################### [100%] 1:grep ########################################### [100%] [root@w84040 adamso]# time grep '^' /var/log/messages > /dev/null real 0m0.143s user 0m0.100s sys 0m0.020s Although grep is a big offender, I don't think it's the only program with this problem; changing LANG to en_US speeds up almost everything. If I've read Tim's gofast patch right (no promises on that), he's caching some data to work around a bigger problem in something else. I'm caching data to work around the fact that grep was not written with multibyte characters in mind. Really it needs a rewrite. There's no problem outside grep that I need to work around, although several other utilities might have inefficient add-on support for multibyte characters. One previous offender has been sort(1), for example (but it's fixed now). Please test the grep package at: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/1/i386/ and verify that (a) it fixes the problem and makes UTF-8 speed comparable to ASCII speed, and (b) you don't see any regressions. grep-2.5.1-7 in /bin grep-2.5.1-17.1.i386.rpm in /test/bin LANG=en_US.UTF-8 $ time /bin/grep ^$ /etc/rc.sysinit Average time was 0.440 0.443 0.010 $ time /test/bin/grep ^$ /etc/rc.sysinit Average time was 0.021 0.10 0.10 LANG=en_US $ time /bin/grep ^$ /etc/rc.sysinit Average time was 0.0065* 0.000 0.010 $ time /test/bin/grep ^$ /etc/rc.sysinit Average time was 0.0065* 0.000 0.010 * time only gives me thousandths; runs alternated between 0.006 and 0.007. This is a major improvement in speed with UTF-8. However, en_US.UTF-8 still takes over three times as long as en_US I haven't done any regression testing yet. In response to 'however, UTF-8 still takes longer than ASCII' -- yes, this is for two reasons: a) there will *always* be more processing to do for UTF-8 than for ASCII in the general case. b) there are specific optimizations which have not been done for grep. However, they are being looked at for glibc's regex implementation, and when grep uses that rather than its own DFA then we will get those speed-ups too. Regressions will more than likely be in the cases where multi-byte characters are involved, especially in the input (rather than the expression); I really don't expect any entirely-ASCII test cases to fail where they didn't before. Some more bugs were fixed in the patch. The latest test package is grep-2.5.1-17.2, still at: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/1/i386/ Created attachment 96134 [details]
output
grep-2.5.1-17.2 is ok. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-083.html The above-mentioned errata package does improve the performance of "grep '^'" but seems to do next to nothing for the more common case (?) of grepping for a string of characters. The difference between en_US and en_US.UTF8 is over fifty-fold with both versions: # ls -al access_log -rw-r--r-- 1 root root 126679906 Mar 20 02:24 access_log # rpm -q grep grep-2.5.1-7 # time LANG=en_US grep blahblah access_log real 0m0.559s user 0m0.290s sys 0m0.270s # time LANG=en_US grep blahblah access_log real 0m0.561s user 0m0.230s sys 0m0.330s # time LANG=en_US.UTF8 grep blahblah access_log real 0m33.016s user 0m32.640s sys 0m0.370s # time LANG=en_US.UTF8 grep blahblah access_log real 0m33.018s user 0m32.720s sys 0m0.300s # rpm -Fvh grep-2.5.1-7.8.i386.rpm Preparing... ########################################### [100%] 1:grep ########################################### [100%] # rpm -q grep grep-2.5.1-7.8 # time LANG=en_US grep blahblah access_log real 0m0.559s user 0m0.190s sys 0m0.360s # time LANG=en_US grep blahblah access_log real 0m0.558s user 0m0.110s sys 0m0.440s # time LANG=en_US.UTF8 grep blahblah access_log real 0m32.359s user 0m32.070s sys 0m0.290s # time LANG=en_US.UTF8 grep blahblah access_log real 0m32.402s user 0m32.060s sys 0m0.330s For remaining performance problems, please see bug 121313 (which has a pointer to an experimental package to try). Thanks. |