Bug 69900 - grep writing output very slow
grep writing output very slow
Status: CLOSED ERRATA
Product: Red Hat Linux
Classification: Retired
Component: grep (Show other bugs)
8.0
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Tim Waugh
Mike McLean
:
: 90109 90956 91819 92018 103618 106417 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-07-25 22:10 EDT by keith adamson
Modified: 2007-04-18 12:44 EDT (History)
11 users (show)

See Also:
Fixed In Version: 2.5.1-8
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-03-18 11:13:03 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
output (4.47 KB, text/plain)
2003-11-22 12:03 EST, acount closed by user
no flags Details

  None (edit)
Description keith adamson 2002-07-25 22:10:42 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020712

Description of problem:
try "grep '^' large_file > foo"
very very slow

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.run grep on a large file and gernerate a lot of output e.g. 
"grep '^' large_file > foo"
2.
3.
	

Actual Results:  very slow output

Expected Results:  much faster output (like 7.3 and before)

Additional info:

grep-2.5.1-3
Comment 1 keith adamson 2002-07-25 22:29:52 EDT
$ time src_extern/grep/grep-2.5/src/grep '^' /var/log/XFree86.0.log >foo

real    0m3.070s
user    0m2.950s
sys     0m0.054s

$ time src_extern/grep/grep-2.4/src/grep '^' /var/log/XFree86.0.log >foo

real    0m0.019s
user    0m0.001s
sys     0m0.005s
$

Something very wrong with grep-2.5
Comment 2 Nathan G. Grennan 2002-08-05 01:31:36 EDT
Agreed something seriously around with grep 2.5. It works fine for me with my
XFree86 log, but grepping a 1mb html file is horrible.


time /bin/grep '^' foo.htm > foo

40.020 total   grep 2.5.1-4 Limbo beta2 rpm compiled by redhat  3.2
39.819 total   grep 2.5.1-4 Target i686 compiled by me          3.2
39.308 total   grep 2.5.1-4 Compiled by me                      2.96 
38.564 total   grep 2.5.1-4 Target athlon compiled by me        3.2
0.019 total    grep 2.4.2-7 7.2 rpm compiled by redhat          2.96
0.018 total    grep 2.4.2-7 Compiled by me                      3.2
0.018 total    grep 2.4.2-7 Target athlon compiled by me        3.2

Interesting thing I noticed is that the user times from 2.5.1 are total 10x, but
user times from 2.4.2 are less than total(aka as they should be).

Check for simple spec file differences like configure or CFLAGS. Both are
equivelant.
Comment 3 Nathan G. Grennan 2002-08-05 02:00:08 EDT
38.001 total  grep CVS  Compiled by me     3.2
Comment 4 Nathan G. Grennan 2002-08-05 02:16:10 EDT
Hmmm, 2.5.1 especially sucks compared to 2.4.2.


I just did

time grep-2.4.2 '^' foo.htm > foo-2.4.2
time grep-2.5.1 '^' foo.htm > foo-2.5.1
diff -u foo-2.4.2 foo-2.5.1

There is no difference in output.

Comment 5 Michael K. Johnson 2002-08-13 17:34:56 EDT
This is UTF-8 processing.  If you use a locale without UTF-8 support,
grep goes as fast as normal.  Karsten Hopp pointed out:

>grep spends more than 80% of the time in the check_multibyte_string function.
>It is as fast as the old version when used without UTF-8:
>
>grep-2.5:
>LANG=de_DE@euro
>   time grep -q Attack /var/log/messages
>    real    0m0.078s
>    user    0m0.035s
>    sys     0m0.043s
>
>LANG=de_DE.UTF-8
>   time grep -q Attack /var/log/messages
>    real    0m2.516s
>    user    0m2.240s
>    sys     0m0.059s
>
>grep-2.4:
>LANG=de_DE@euro
>   time grep -q Attack /var/log/messages
>    real    0m0.080s
>    user    0m0.021s
>    sys     0m0.055s
>
>LANG=de_DE.UTF-8
>   time grep -q Attack /var/log/messages
>    real    0m0.079s
>    user    0m0.037s
>    sys     0m0.037s

So there is at least the workaround of setting LANG=
Comment 6 Mike A. Harris 2002-08-13 18:38:34 EDT
While that is certainly true... it sucks because we default to using
UTF-8 now.  That means out of the box, Red Hat Linux has dog slow grep
for users, and likely other applications which are dog slow too.

We can likely profile+optimize but I doubt we'll ever take up the slack.

Comment 7 keith adamson 2002-08-14 12:14:28 EDT
Agreed, basically out of the box, grep 2.5 is unusable for many shell 
applications.

I ran into this immediately and downgrade to grep 2.4.  And no, I do not want 
to go into every shell app that uses grep and set the locale.
Comment 8 Tim Waugh 2003-05-06 10:59:50 EDT
*** Bug 90109 has been marked as a duplicate of this bug. ***
Comment 9 Phil Moses 2003-05-06 14:59:07 EDT
*** Bug 90109 has been marked as a duplicate of this bug. ***
Comment 10 Tim Waugh 2003-05-15 17:55:56 EDT
*** Bug 90956 has been marked as a duplicate of this bug. ***
Comment 11 Tim Waugh 2003-05-28 13:00:10 EDT
*** Bug 91819 has been marked as a duplicate of this bug. ***
Comment 12 acount closed by user 2003-05-28 13:20:37 EDT
this is a rh_8 bug without solution at rh_9 :-?

Automatic configuration of rh_9 :
querida:/etc # cat /etc/sysconfig/i18n
LANG="en_US.UTF-8"
SUPPORTED="en_US.UTF-8:en_US:en:gl_ES.UTF-8:gl_ES:gl:es_ES.UTF-8:es_ES:es"
SYSFONT="latarcyrheb-sun16"

are we going to wait to rh_10 to solve this _general_ distribution, not grep,
problem?

-thanks-
Comment 13 Tim Waugh 2003-05-31 19:01:15 EDT
*** Bug 92018 has been marked as a duplicate of this bug. ***
Comment 14 Tim Waugh 2003-09-03 05:07:36 EDT
*** Bug 103618 has been marked as a duplicate of this bug. ***
Comment 15 Tim Waugh 2003-09-03 05:13:53 EDT
Reported upstream, with example patch (with bugs) that shows where the problem is.
Comment 16 Tim Waugh 2003-10-07 04:32:10 EDT
*** Bug 106417 has been marked as a duplicate of this bug. ***
Comment 17 David Johnston 2003-10-31 11:13:30 EST
This is still a problem in Red Hat 9 as of 10/31/03.  Here are sample runs of grep:
########################## Info on my setup
[root@test log]# date
Fri Oct 31 11:05:57 EST 2003
[root@test log]# up2date -du
(snip)
All packages are currently up to date.
[root@test log]#  rpm -q glibc pcre grep
glibc-2.3.2-27.9
pcre-3.9-10
grep-2.5.1-7

############### grep test results
[root@test log]# export LANG='en_US.UTF-8'
[root@test log]# time grep '^' messages > /dev/null
real    0m3.689s
user    0m3.600s
sys     0m0.080s

[root@test log]# export LANG='en_US'
[root@test log]# time grep '^' messages > /dev/null
real    0m0.004s
user    0m0.000s
sys     0m0.000s

With LANG=en_US.UTF-8, grep is almost a thousand times slower.

To rule out caching problems, I ran each test three times before changing LANG.

Results with en_US.UTF-8:                  Results with en_US:
Run      1      2      3                     1      2      3
REAL  3.689s 3.394s 3.815s                 0.005s 0.004s 0.004s
USER  3.600s 3.360s 3.370s                 0.000s 0.000s 0.000s
SYS   0.080s 0.040s 0.050s                 0.000s 0.000s 0.000s

That's about one thousand times slower with UTF-8.  Something's wrong with the
way grep, pcre, or libc handle UTF-8. 
Can RedHat send out an advisory to RHN subscribers suggesting they change LANG
until this is fixed?  I've been chasing down speed problems off and on for
months, now, and would have been grateful for a suggestion like this.
Comment 18 Tim Waugh 2003-11-06 18:22:02 EST
An experimental grep package is available here:
http://cyberelk.net/tim/data/tmp/grep-2.5.1-19.i386.rpm
(srpm in the same place)

I would be very grateful if people here could test it to see if there
are any new bugs, in particular with multibyte (UTF-8) handling.
Comment 19 keith adamson 2003-11-06 20:11:38 EST
Seems to fix.  Typical output below:

[root@w84040 adamso]# rpm -Uvh --force grep-2.5.1-7.i386.rpm
Preparing...               
########################################### [100%]
   1:grep                  
########################################### [100%]
[root@w84040 adamso]# time grep '^' /var/log/messages > /dev/null
 
real    0m5.522s
user    0m5.120s
sys     0m0.070s
[root@w84040 adamso]# rpm -Uvh grep-2.5.1-19.i386.rpm
Preparing...               
########################################### [100%]
   1:grep                  
########################################### [100%]
[root@w84040 adamso]# time grep '^' /var/log/messages > /dev/null
 
real    0m0.143s
user    0m0.100s
sys     0m0.020s
Comment 20 David Johnston 2003-11-06 22:55:36 EST
Although grep is a big offender, I don't think it's the only program
with this problem; changing LANG to en_US speeds up almost everything.
   If I've read Tim's gofast patch right (no promises on that), he's
caching some data to work around a bigger problem in something else.
Comment 21 Tim Waugh 2003-11-07 04:10:20 EST
I'm caching data to work around the fact that grep was not written
with multibyte characters in mind.  Really it needs a rewrite. 
There's no problem outside grep that I need to work around, although
several other utilities might have inefficient add-on support for
multibyte characters.  One previous offender has been sort(1), for
example (but it's fixed now).
Comment 22 Tim Waugh 2003-11-14 08:25:29 EST
Please test the grep package at:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/1/i386/

and verify that (a) it fixes the problem and makes UTF-8 speed
comparable to ASCII speed, and (b) you don't see any regressions.
Comment 23 David Johnston 2003-11-18 12:49:55 EST
grep-2.5.1-7 in /bin
grep-2.5.1-17.1.i386.rpm in /test/bin

LANG=en_US.UTF-8
$ time /bin/grep ^$ /etc/rc.sysinit
Average time was 0.440 0.443 0.010

$ time /test/bin/grep ^$  /etc/rc.sysinit
Average time was 0.021 0.10 0.10

LANG=en_US
$ time /bin/grep ^$ /etc/rc.sysinit
Average time was 0.0065* 0.000 0.010

$ time /test/bin/grep ^$  /etc/rc.sysinit
Average time was 0.0065* 0.000 0.010

* time only gives me thousandths; runs alternated between 0.006 and 0.007.

This is a major improvement in speed with UTF-8.  However, en_US.UTF-8
still takes over three times as long as en_US

I haven't done any regression testing yet.
Comment 24 Tim Waugh 2003-11-18 13:02:19 EST
In response to 'however, UTF-8 still takes longer than ASCII' -- yes,
this is for two reasons:

a) there will *always* be more processing to do for UTF-8 than for
ASCII in the general case.

b) there are specific optimizations which have not been done for grep.
 However, they are being looked at for glibc's regex implementation,
and when grep uses that rather than its own DFA then we will get those
speed-ups too.

Regressions will more than likely be in the cases where multi-byte
characters are involved, especially in the input (rather than the
expression); I really don't expect any entirely-ASCII test cases to
fail where they didn't before.
Comment 25 Tim Waugh 2003-11-21 10:52:52 EST
Some more bugs were fixed in the patch.  The latest test package is
grep-2.5.1-17.2, still at:

http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/1/i386/
Comment 26 acount closed by user 2003-11-22 12:03:34 EST
Created attachment 96134 [details]
output
Comment 27 acount closed by user 2003-11-22 12:05:05 EST
grep-2.5.1-17.2 is ok.
Comment 28 Tim Powers 2004-03-18 11:13:03 EST
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-083.html
Comment 29 Jordan Russell 2004-03-20 13:27:02 EST
The above-mentioned errata package does improve the performance 
of "grep '^'" but seems to do next to nothing for the more common case
(?) of grepping for a string of characters. The difference between 
en_US and en_US.UTF8 is over fifty-fold with both versions:

# ls -al access_log
-rw-r--r--    1 root     root     126679906 Mar 20 02:24 access_log
# rpm -q grep
grep-2.5.1-7
# time LANG=en_US grep blahblah access_log

real    0m0.559s
user    0m0.290s
sys     0m0.270s
# time LANG=en_US grep blahblah access_log

real    0m0.561s
user    0m0.230s
sys     0m0.330s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m33.016s
user    0m32.640s
sys     0m0.370s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m33.018s
user    0m32.720s
sys     0m0.300s
# rpm -Fvh grep-2.5.1-7.8.i386.rpm
Preparing...                
########################################### [100%]
   1:grep                   
########################################### [100%]
# rpm -q grep
grep-2.5.1-7.8
# time LANG=en_US grep blahblah access_log

real    0m0.559s
user    0m0.190s
sys     0m0.360s
# time LANG=en_US grep blahblah access_log

real    0m0.558s
user    0m0.110s
sys     0m0.440s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m32.359s
user    0m32.070s
sys     0m0.290s
# time LANG=en_US.UTF8 grep blahblah access_log

real    0m32.402s
user    0m32.060s
sys     0m0.330s
Comment 30 Tim Waugh 2004-11-09 09:21:26 EST
For remaining performance problems, please see bug 121313 (which has a
pointer to an experimental package to try).  Thanks.

Note You need to log in before you can comment on or make changes to this bug.