121313 – grep SLOW on multibyte LC_CTYPE

Bug 121313 - grep SLOW on multibyte LC_CTYPE

Summary: grep SLOW on multibyte LC_CTYPE

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	grep
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tim Waugh
QA Contact:	Mike McLean
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	176488
TreeView+	depends on / blocked

Reported:	2004-04-20 12:30 UTC by Emmanuel Thomé
Modified:	2007-11-30 22:10 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-11-18 13:28:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test file I used (39.11 KB, text/plain) 2004-04-20 12:31 UTC, Emmanuel Thomé	no flags	Details
View All

Description Emmanuel Thomé 2004-04-20 12:30:40 UTC

Description of problem:

grep is painfully slow on multibyte locales. Slowdown factor >30 observed.

Version-Release number of selected component (if applicable):

grep-2.5.1-26

How reproducible:

~ $ time LC_CTYPE=en_US.UTF-8 grep  '^//PS ' /tmp/r3.log  | wc -l
90304
grep : 97.31s user 0.17s system 87% cpu 1:51.15 total

~ $ time LC_CTYPE=C grep  '^//PS ' /tmp/r3.log  | wc -l
90304
grep : 0.22s user 0.04s system 83% cpu 0.312 total


Test file attached later on, and also downloadable from:

http://www.loria.fr/~thome/vrac/r3.log.gz

It's 40KB gzipped, 2.6MB gunzipped.

Comment 1 Emmanuel Thomé 2004-04-20 12:31:37 UTC

Created attachment 99558 [details]
Test file I used

Comment 2 Tim Waugh 2004-04-20 16:38:41 UTC

(2.5.1-26 is a devel package; changing version.)

Comment 3 Tim Waugh 2004-04-20 16:48:27 UTC

The longer-term solution is to make grep use the system regex for
multibyte encodings.  The GNU libc implementation has quite an
efficient implementation now.

Comment 4 Tim Waugh 2004-11-08 14:33:22 UTC

Please try grep-2.5.1-36, available at:

http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/

Comment 5 Emmanuel Thomé 2004-11-08 14:58:51 UTC

I'm happy with it.

with GREP_USE_DFA set, I observe a 2x slowdown.

E.

Comment 6 Tim Waugh 2004-11-10 11:39:07 UTC

grep-2.5.1-37 fixes a problem that can cause false matches.  It will
be available in the Fedora development tree tomorrow, or at:

  ftp://people.redhat.com/twaugh/tmp/grep/fedora-core-3/

Comment 7 Aleksandar Milivojevic 2005-12-22 17:35:58 UTC

Nice to know the problem was fixed in Fedora Core.  However it seems that
grep-2.5.1-31 (RHEL4) still suffers from this problem.  Any chance of fixing
that one too?  Looking at the dates in comments, I kinda expected that there
would be new version of grep released as part of U1 or at latest U2.

One additional thing.  I found that grep is slow if there are many matches.  If
there are no matches (or just a few of matches), it is fast.

For example:

LANG=en_US.UTF-8   # Should be default
export LANG
a=0
while [ $a -lt 30000 ]; do
  printf "%.9d0\n" $a; a=$(( $a + 1 ))
done > testfile.txt
echo "Going to be sloooow...  Get yourself some coffe"
time grep -c '0$' testfile.txt
echo "However, this one is fast.  Sorry, no time for coffe"
time grep -c '1$' testfile.txt

It takes about 25 seconds on 2.8GHz Pentium D to run the first grep (jeeez). 
The second grep (that doesn't match any lines from the file) is fast.  Of
course, setting LANG to C or en_US solves the problem.

Comment 8 Pádraig Brady 2007-06-22 13:35:51 UTC

I think this is a problem again on fedora core 7 (grep-2.5.1-57.fc7)

I measured grep to be 540% slower in UTF8 locale than grep-2.5.1-17 (FC4)
or 2.5.1.ds1-5ubuntu2 (breezy)

On both breezy and fc7 I did the following:
  find /usr/share/doc > docs.txt
  export LANG=en_IE.utf8
  unalias grep

[breezy]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)"
<docs.txt >/dev/null
real    0m0.940s
user    0m0.891s
sys     0m0.005s

[fc7]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)"
<docs.txt >/dev/null
real    0m5.089s
user    0m4.936s
sys     0m0.007s

Note this regular expression is used by the findnl script in fslint:
  yum install fslint
  /usr/share/fslint/fslint/findnl /usr/share/doc

Apart from fixing this regression, since this is
almost entirely ascii data that is being searched,
Couldn't one apply the optimization of scanning a line
for multibyte chars, and treating like LC_CTYPE=C if none found?

cheers,
Pádraig.

Comment 9 Pádraig Brady 2007-06-27 13:57:03 UTC

$ time GREP_USE_DFA=1 ./findnl /usr/share/doc/ > /dev/null
real    0m5.970s
$ time GREP_USE_DFA=0 ./findnl /usr/share/doc/ > /dev/null
real    0m24.107s

I googled a little and noticed grep disables DFA by default,
when multibyte input is used. But the vast majority of
the above input lines are ascii?

cheers,
Pádraig.

Comment 10 Pádraig Brady 2007-06-27 14:35:09 UTC

In fact, as well as DFA being faster, it looks like
it's more correct for multibyte character checking?

$ echo $LANG
en_IE.UTF-8

$ echo -e "t\xa9st" | GREP_USE_DFA=0 grep -qE "[^[:alnum:]]" && echo "bad char"

$ echo -e "t\xa9st" | GREP_USE_DFA=1 grep -qE "[^[:alnum:]]" && echo "bad char"
bad char

Note You need to log in before you can comment on or make changes to this bug.