This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 121313 - grep SLOW on multibyte LC_CTYPE
grep SLOW on multibyte LC_CTYPE
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: grep (Show other bugs)
rawhide
All Linux
medium Severity medium
: ---
: ---
Assigned To: Tim Waugh
Mike McLean
:
Depends On:
Blocks: 176488
  Show dependency treegraph
 
Reported: 2004-04-20 08:30 EDT by Emmanuel Thomé
Modified: 2007-11-30 17:10 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-11-18 08:28:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Test file I used (39.11 KB, text/plain)
2004-04-20 08:31 EDT, Emmanuel Thomé
no flags Details

  None (edit)
Description Emmanuel Thomé 2004-04-20 08:30:40 EDT
Description of problem:

grep is painfully slow on multibyte locales. Slowdown factor >30 observed.

Version-Release number of selected component (if applicable):

grep-2.5.1-26

How reproducible:

~ $ time LC_CTYPE=en_US.UTF-8 grep  '^//PS ' /tmp/r3.log  | wc -l
90304
grep : 97.31s user 0.17s system 87% cpu 1:51.15 total

~ $ time LC_CTYPE=C grep  '^//PS ' /tmp/r3.log  | wc -l
90304
grep : 0.22s user 0.04s system 83% cpu 0.312 total


Test file attached later on, and also downloadable from:

http://www.loria.fr/~thome/vrac/r3.log.gz

It's 40KB gzipped, 2.6MB gunzipped.
Comment 1 Emmanuel Thomé 2004-04-20 08:31:37 EDT
Created attachment 99558 [details]
Test file I used
Comment 2 Tim Waugh 2004-04-20 12:38:41 EDT
(2.5.1-26 is a devel package; changing version.)
Comment 3 Tim Waugh 2004-04-20 12:48:27 EDT
The longer-term solution is to make grep use the system regex for
multibyte encodings.  The GNU libc implementation has quite an
efficient implementation now.
Comment 4 Tim Waugh 2004-11-08 09:33:22 EST
Please try grep-2.5.1-36, available at:

http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/
Comment 5 Emmanuel Thomé 2004-11-08 09:58:51 EST
I'm happy with it.

with GREP_USE_DFA set, I observe a 2x slowdown.

E.
Comment 6 Tim Waugh 2004-11-10 06:39:07 EST
grep-2.5.1-37 fixes a problem that can cause false matches.  It will
be available in the Fedora development tree tomorrow, or at:

  ftp://people.redhat.com/twaugh/tmp/grep/fedora-core-3/
Comment 7 Aleksandar Milivojevic 2005-12-22 12:35:58 EST
Nice to know the problem was fixed in Fedora Core.  However it seems that
grep-2.5.1-31 (RHEL4) still suffers from this problem.  Any chance of fixing
that one too?  Looking at the dates in comments, I kinda expected that there
would be new version of grep released as part of U1 or at latest U2.

One additional thing.  I found that grep is slow if there are many matches.  If
there are no matches (or just a few of matches), it is fast.

For example:

LANG=en_US.UTF-8   # Should be default
export LANG
a=0
while [ $a -lt 30000 ]; do
  printf "%.9d0\n" $a; a=$(( $a + 1 ))
done > testfile.txt
echo "Going to be sloooow...  Get yourself some coffe"
time grep -c '0$' testfile.txt
echo "However, this one is fast.  Sorry, no time for coffe"
time grep -c '1$' testfile.txt

It takes about 25 seconds on 2.8GHz Pentium D to run the first grep (jeeez). 
The second grep (that doesn't match any lines from the file) is fast.  Of
course, setting LANG to C or en_US solves the problem.
Comment 8 Pádraig Brady 2007-06-22 09:35:51 EDT
I think this is a problem again on fedora core 7 (grep-2.5.1-57.fc7)

I measured grep to be 540% slower in UTF8 locale than grep-2.5.1-17 (FC4)
or 2.5.1.ds1-5ubuntu2 (breezy)

On both breezy and fc7 I did the following:
  find /usr/share/doc > docs.txt
  export LANG=en_IE.utf8
  unalias grep

[breezy]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)"
<docs.txt >/dev/null
real    0m0.940s
user    0m0.891s
sys     0m0.005s

[fc7]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)"
<docs.txt >/dev/null
real    0m5.089s
user    0m4.936s
sys     0m0.007s

Note this regular expression is used by the findnl script in fslint:
  yum install fslint
  /usr/share/fslint/fslint/findnl /usr/share/doc

Apart from fixing this regression, since this is
almost entirely ascii data that is being searched,
Couldn't one apply the optimization of scanning a line
for multibyte chars, and treating like LC_CTYPE=C if none found?

cheers,
Pádraig.
Comment 9 Pádraig Brady 2007-06-27 09:57:03 EDT
$ time GREP_USE_DFA=1 ./findnl /usr/share/doc/ > /dev/null
real    0m5.970s
$ time GREP_USE_DFA=0 ./findnl /usr/share/doc/ > /dev/null
real    0m24.107s

I googled a little and noticed grep disables DFA by default,
when multibyte input is used. But the vast majority of
the above input lines are ascii?

cheers,
Pádraig.
Comment 10 Pádraig Brady 2007-06-27 10:35:09 EDT
In fact, as well as DFA being faster, it looks like
it's more correct for multibyte character checking?

$ echo $LANG
en_IE.UTF-8

$ echo -e "t\xa9st" | GREP_USE_DFA=0 grep -qE "[^[:alnum:]]" && echo "bad char"

$ echo -e "t\xa9st" | GREP_USE_DFA=1 grep -qE "[^[:alnum:]]" && echo "bad char"
bad char

Note You need to log in before you can comment on or make changes to this bug.