Bug 69900 - grep writing output very slow
Summary: grep writing output very slow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: grep
Version: 8.0
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tim Waugh
QA Contact: Mike McLean
URL:
Whiteboard:
: 90109 90956 91819 92018 103618 106417 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-07-26 02:10 UTC by keith adamson
Modified: 2007-04-18 16:44 UTC (History)
11 users (show)

Fixed In Version: 2.5.1-8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-03-18 16:13:03 UTC
Embargoed:


Attachments (Terms of Use)
output (4.47 KB, text/plain)
2003-11-22 17:03 UTC, acount closed by user
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:083 0 normal SHIPPED_LIVE Updated grep package speeds UTF-8 searching 2004-03-18 05:00:00 UTC

Description keith adamson 2002-07-26 02:10:42 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020712

Description of problem:
try "grep '^' large_file > foo"
very very slow

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.run grep on a large file and gernerate a lot of output e.g. 
"grep '^' large_file > foo"
2.
3.
	

Actual Results:  very slow output

Expected Results:  much faster output (like 7.3 and before)

Additional info:

grep-2.5.1-3

Comment 1 keith adamson 2002-07-26 02:29:52 UTC
$ time src_extern/grep/grep-2.5/src/grep '^' /var/log/XFree86.0.log >foo

real    0m3.070s
user    0m2.950s
sys     0m0.054s

$ time src_extern/grep/grep-2.4/src/grep '^' /var/log/XFree86.0.log >foo

real    0m0.019s
user    0m0.001s
sys     0m0.005s
$

Something very wrong with grep-2.5

Comment 2 Nathan G. Grennan 2002-08-05 05:31:36 UTC
Agreed something seriously around with grep 2.5. It works fine for me with my
XFree86 log, but grepping a 1mb html file is horrible.


time /bin/grep '^' foo.htm > foo

40.020 total   grep 2.5.1-4 Limbo beta2 rpm compiled by redhat  3.2
39.819 total   grep 2.5.1-4 Target i686 compiled by me          3.2
39.308 total   grep 2.5.1-4 Compiled by me                      2.96 
38.564 total   grep 2.5.1-4 Target athlon compiled by me        3.2
0.019 total    grep 2.4.2-7 7.2 rpm compiled by redhat          2.96
0.018 total    grep 2.4.2-7 Compiled by me                      3.2
0.018 total    grep 2.4.2-7 Target athlon compiled by me        3.2

Interesting thing I noticed is that the user times from 2.5.1 are total 10x, but
user times from 2.4.2 are less than total(aka as they should be).

Check for simple spec file differences like configure or CFLAGS. Both are
equivelant.

Comment 3 Nathan G. Grennan 2002-08-05 06:00:08 UTC
38.001 total  grep CVS  Compiled by me     3.2


Comment 4 Nathan G. Grennan 2002-08-05 06:16:10 UTC
Hmmm, 2.5.1 especially sucks compared to 2.4.2.


I just did

time grep-2.4.2 '^' foo.htm > foo-2.4.2
time grep-2.5.1 '^' foo.htm > foo-2.5.1
diff -u foo-2.4.2 foo-2.5.1

There is no difference in output.



Comment 5 Michael K. Johnson 2002-08-13 21:34:56 UTC
This is UTF-8 processing.  If you use a locale without UTF-8 support,
grep goes as fast as normal.  Karsten Hopp pointed out:

>grep spends more than 80% of the time in the check_multibyte_string function.
>It is as fast as the old version when used without UTF-8:
>
>grep-2.5:
>LANG=de_DE@euro
>   time grep -q Attack /var/log/messages
>    real    0m0.078s
>    user    0m0.035s
>    sys     0m0.043s
>
>LANG=de_DE.UTF-8
>   time grep -q Attack /var/log/messages
>    real    0m2.516s
>    user    0m2.240s
>    sys     0m0.059s
>
>grep-2.4:
>LANG=de_DE@euro
>   time grep -q Attack /var/log/messages
>    real    0m0.080s
>    user    0m0.021s
>    sys     0m0.055s
>
>LANG=de_DE.UTF-8
>   time grep -q Attack /var/log/messages
>    real    0m0.079s
>    user    0m0.037s
>    sys     0m0.037s

So there is at least the workaround of setting LANG=

Comment 6 Mike A. Harris 2002-08-13 22:38:34 UTC
While that is certainly true... it sucks because we default to using
UTF-8 now.  That means out of the box, Red Hat Linux has dog slow grep
for users, and likely other applications which are dog slow too.

We can likely profile+optimize but I doubt we'll ever take up the slack.



Comment 7 keith adamson 2002-08-14 16:14:28 UTC
Agreed, basically out of the box, grep 2.5 is unusable for many shell 
applications.

I ran into this immediately and downgrade to grep 2.4.  And no, I do not want 
to go into every shell app that uses grep and set the locale.

Comment 8 Tim Waugh 2003-05-06 14:59:50 UTC
*** Bug 90109 has been marked as a duplicate of this bug. ***

Comment 9 Phil Moses 2003-05-06 18:59:07 UTC
*** Bug 90109 has been marked as a duplicate of this bug. ***

Comment 10 Tim Waugh 2003-05-15 21:55:56 UTC
*** Bug 90956 has been marked as a duplicate of this bug. ***

Comment 11 Tim Waugh 2003-05-28 17:00:10 UTC
*** Bug 91819 has been marked as a duplicate of this bug. ***

Comment 12 acount closed by user 2003-05-28 17:20:37 UTC
this is a rh_8 bug without solution at rh_9 :-?

Automatic configuration of rh_9 :
querida:/etc # cat /etc/sysconfig/i18n
LANG="en_US.UTF-8"
SUPPORTED="en_US.UTF-8:en_US:en:gl_ES.UTF-8:gl_ES:gl:es_ES.UTF-8:es_ES:es"
SYSFONT="latarcyrheb-sun16"

are we going to wait to rh_10 to solve this _general_ distribution, not grep,
problem?

-thanks-


Comment 13 Tim Waugh 2003-05-31 23:01:15 UTC
*** Bug 92018 has been marked as a duplicate of this bug. ***

Comment 14 Tim Waugh 2003-09-03 09:07:36 UTC
*** Bug 103618 has been marked as a duplicate of this bug. ***

Comment 15 Tim Waugh 2003-09-03 09:13:53 UTC
Reported upstream, with example patch (with bugs) that shows where the problem is.

Comment 16 Tim Waugh 2003-10-07 08:32:10 UTC
*** Bug 106417 has been marked as a duplicate of this bug. ***

Comment 17 David Johnston 2003-10-31 16:13:30 UTC
This is still a problem in Red Hat 9 as of 10/31/03.  Here are sample runs of grep:
########################## Info on my setup
[root@test log]# date
Fri Oct 31 11:05:57 EST 2003
[root@test log]# up2date -du
(snip)
All packages are currently up to date.
[root@test log]#  rpm -q glibc pcre grep
glibc-2.3.2-27.9
pcre-3.9-10
grep-2.5.1-7

############### grep test results
[root@test log]# export LANG='en_US.UTF-8'
[root@test log]# time grep '^' messages > /dev/null
real    0m3.689s
user    0m3.600s
sys     0m0.080s

[root@test log]# export LANG='en_US'
[root@test log]# time grep '^' messages > /dev/null
real    0m0.004s
user    0m0.000s
sys     0m0.000s

With LANG=en_US.UTF-8, grep is almost a thousand times slower.

To rule out caching problems, I ran each test three times before changing LANG.

Results with en_US.UTF-8:                  Results with en_US:
Run      1      2      3                     1      2      3
REAL  3.689s 3.394s 3.815s                 0.005s 0.004s 0.004s
USER  3.600s 3.360s 3.370s                 0.000s 0.000s 0.000s
SYS   0.080s 0.040s 0.050s                 0.000s 0.000s 0.000s

That's about one thousand times slower with UTF-8.  Something's wrong with the
way grep, pcre, or libc handle UTF-8. 
Can RedHat send out an advisory to RHN subscribers suggesting they change LANG
until this is fixed?  I've been chasing down speed problems off and on for
months, now, and would have been grateful for a suggestion like this.

Comment 18 Tim Waugh 2003-11-06 23:22:02 UTC
An experimental grep package is available here:
http://cyberelk.net/tim/data/tmp/grep-2.5.1-19.i386.rpm
(srpm in the same place)

I would be very grateful if people here could test it to see if there
are any new bugs, in particular with multibyte (UTF-8) handling.

Comment 19 keith adamson 2003-11-07 01:11:38 UTC
Seems to fix.  Typical output below:

[root@w84040 adamso]# rpm -Uvh --force grep-2.5.1-7.i386.rpm
Preparing...               
########################################### [100%]
   1:grep                  
########################################### [100%]
[root@w84040 adamso]# time grep '^' /var/log/messages > /dev/null
 
real    0m5.522s
user    0m5.120s
sys     0m0.070s
[root@w84040 adamso]# rpm -Uvh grep-2.5.1-19.i386.rpm
Preparing...               
########################################### [100%]
   1:grep                  
########################################### [100%]
[root@w84040 adamso]# time grep '^' /var/log/messages > /dev/null
 
real    0m0.143s
user    0m0.100s
sys     0m0.020s


Comment 20 David Johnston 2003-11-07 03:55:36 UTC
Although grep is a big offender, I don't think it's the only program
with this problem; changing LANG to en_US speeds up almost everything.
   If I've read Tim's gofast patch right (no promises on that), he's
caching some data to work around a bigger problem in something else.

Comment 21 Tim Waugh 2003-11-07 09:10:20 UTC
I'm caching data to work around the fact that grep was not written
with multibyte characters in mind.  Really it needs a rewrite. 
There's no problem outside grep that I need to work around, although
several other utilities might have inefficient add-on support for
multibyte characters.  One previous offender has been sort(1), for
example (but it's fixed now).

Comment 22 Tim Waugh 2003-11-14 13:25:29 UTC
Please test the grep package at:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/1/i386/

and verify that (a) it fixes the problem and makes UTF-8 speed
comparable to ASCII speed, and (b) you don't see any regressions.

Comment 23 David Johnston 2003-11-18 17:49:55 UTC
grep-2.5.1-7 in /bin
grep-2.5.1-17.1.i386.rpm in /test/bin

LANG=en_US.UTF-8
$ time /bin/grep ^$ /etc/rc.sysinit
Average time was 0.440 0.443 0.010

$ time /test/bin/grep ^$  /etc/rc.sysinit
Average time was 0.021 0.10 0.10

LANG=en_US
$ time /bin/grep ^$ /etc/rc.sysinit
Average time was 0.0065* 0.000 0.010

$ time /test/bin/grep ^$  /etc/rc.sysinit
Average time was 0.0065* 0.000 0.010

* time only gives me thousandths; runs alternated between 0.006 and 0.007.

This is a major improvement in speed with UTF-8.  However, en_US.UTF-8
still takes over three times as long as en_US

I haven't done any regression testing yet.

Comment 24 Tim Waugh 2003-11-18 18:02:19 UTC
In response to 'however, UTF-8 still takes longer than ASCII' -- yes,
this is for two reasons:

a) there will *always* be more processing to do for UTF-8 than for
ASCII in the general case.

b) there are specific optimizations which have not been done for grep.
 However, they are being looked at for glibc's regex implementation,
and when grep uses that rather than its own DFA then we will get those
speed-ups too.

Regressions will more than likely be in the cases where multi-byte
characters are involved, especially in the input (rather than the
expression); I really don't expect any entirely-ASCII test cases to
fail where they didn't before.

Comment 25 Tim Waugh 2003-11-21 15:52:52 UTC
Some more bugs were fixed in the patch.  The latest test package is
grep-2.5.1-17.2, still at:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/1/i386/


Comment 26 acount closed by user 2003-11-22 17:03:34 UTC
Created attachment 96134 [details]
output

Comment 27 acount closed by user 2003-11-22 17:05:05 UTC
grep-2.5.1-17.2 is ok.

Comment 28 Tim Powers 2004-03-18 16:13:03 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-083.html


Comment 29 Jordan Russell 2004-03-20 18:27:02 UTC
The above-mentioned errata package does improve the performance 
of "grep '^'" but seems to do next to nothing for the more common case
(?) of grepping for a string of characters. The difference between 
en_US and en_US.UTF8 is over fifty-fold with both versions:

# ls -al access_log
-rw-r--r--    1 root     root     126679906 Mar 20 02:24 access_log
# rpm -q grep
grep-2.5.1-7
# time LANG=en_US grep blahblah access_log

real    0m0.559s
user    0m0.290s
sys     0m0.270s
# time LANG=en_US grep blahblah access_log

real    0m0.561s
user    0m0.230s
sys     0m0.330s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m33.016s
user    0m32.640s
sys     0m0.370s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m33.018s
user    0m32.720s
sys     0m0.300s
# rpm -Fvh grep-2.5.1-7.8.i386.rpm
Preparing...                
########################################### [100%]
   1:grep                   
########################################### [100%]
# rpm -q grep
grep-2.5.1-7.8
# time LANG=en_US grep blahblah access_log

real    0m0.559s
user    0m0.190s
sys     0m0.360s
# time LANG=en_US grep blahblah access_log

real    0m0.558s
user    0m0.110s
sys     0m0.440s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m32.359s
user    0m32.070s
sys     0m0.290s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m32.402s
user    0m32.060s
sys     0m0.330s

Comment 30 Tim Waugh 2004-11-09 14:21:26 UTC
For remaining performance problems, please see bug 121313 (which has a
pointer to an experimental package to try).  Thanks.


Note You need to log in before you can comment on or make changes to this bug.