This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 121313 - grep SLOW on multibyte LC_CTYPE
grep SLOW on multibyte LC_CTYPE
Product: Fedora
Classification: Fedora
Component: grep (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Tim Waugh
Mike McLean
Depends On:
Blocks: 176488
  Show dependency treegraph
Reported: 2004-04-20 08:30 EDT by Emmanuel Thomé
Modified: 2007-11-30 17:10 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2004-11-18 08:28:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:

Attachments (Terms of Use)
Test file I used (39.11 KB, text/plain)
2004-04-20 08:31 EDT, Emmanuel Thomé
no flags Details

  None (edit)
Description Emmanuel Thomé 2004-04-20 08:30:40 EDT
Description of problem:

grep is painfully slow on multibyte locales. Slowdown factor >30 observed.

Version-Release number of selected component (if applicable):


How reproducible:

~ $ time LC_CTYPE=en_US.UTF-8 grep  '^//PS ' /tmp/r3.log  | wc -l
grep : 97.31s user 0.17s system 87% cpu 1:51.15 total

~ $ time LC_CTYPE=C grep  '^//PS ' /tmp/r3.log  | wc -l
grep : 0.22s user 0.04s system 83% cpu 0.312 total

Test file attached later on, and also downloadable from:

It's 40KB gzipped, 2.6MB gunzipped.
Comment 1 Emmanuel Thomé 2004-04-20 08:31:37 EDT
Created attachment 99558 [details]
Test file I used
Comment 2 Tim Waugh 2004-04-20 12:38:41 EDT
(2.5.1-26 is a devel package; changing version.)
Comment 3 Tim Waugh 2004-04-20 12:48:27 EDT
The longer-term solution is to make grep use the system regex for
multibyte encodings.  The GNU libc implementation has quite an
efficient implementation now.
Comment 4 Tim Waugh 2004-11-08 09:33:22 EST
Please try grep-2.5.1-36, available at:
Comment 5 Emmanuel Thomé 2004-11-08 09:58:51 EST
I'm happy with it.

with GREP_USE_DFA set, I observe a 2x slowdown.

Comment 6 Tim Waugh 2004-11-10 06:39:07 EST
grep-2.5.1-37 fixes a problem that can cause false matches.  It will
be available in the Fedora development tree tomorrow, or at:
Comment 7 Aleksandar Milivojevic 2005-12-22 12:35:58 EST
Nice to know the problem was fixed in Fedora Core.  However it seems that
grep-2.5.1-31 (RHEL4) still suffers from this problem.  Any chance of fixing
that one too?  Looking at the dates in comments, I kinda expected that there
would be new version of grep released as part of U1 or at latest U2.

One additional thing.  I found that grep is slow if there are many matches.  If
there are no matches (or just a few of matches), it is fast.

For example:

LANG=en_US.UTF-8   # Should be default
export LANG
while [ $a -lt 30000 ]; do
  printf "%.9d0\n" $a; a=$(( $a + 1 ))
done > testfile.txt
echo "Going to be sloooow...  Get yourself some coffe"
time grep -c '0$' testfile.txt
echo "However, this one is fast.  Sorry, no time for coffe"
time grep -c '1$' testfile.txt

It takes about 25 seconds on 2.8GHz Pentium D to run the first grep (jeeez). 
The second grep (that doesn't match any lines from the file) is fast.  Of
course, setting LANG to C or en_US solves the problem.
Comment 8 Pádraig Brady 2007-06-22 09:35:51 EDT
I think this is a problem again on fedora core 7 (grep-2.5.1-57.fc7)

I measured grep to be 540% slower in UTF8 locale than grep-2.5.1-17 (FC4)
or 2.5.1.ds1-5ubuntu2 (breezy)

On both breezy and fc7 I did the following:
  find /usr/share/doc > docs.txt
  export LANG=en_IE.utf8
  unalias grep

[breezy]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)"
<docs.txt >/dev/null
real    0m0.940s
user    0m0.891s
sys     0m0.005s

[fc7]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)"
<docs.txt >/dev/null
real    0m5.089s
user    0m4.936s
sys     0m0.007s

Note this regular expression is used by the findnl script in fslint:
  yum install fslint
  /usr/share/fslint/fslint/findnl /usr/share/doc

Apart from fixing this regression, since this is
almost entirely ascii data that is being searched,
Couldn't one apply the optimization of scanning a line
for multibyte chars, and treating like LC_CTYPE=C if none found?

Comment 9 Pádraig Brady 2007-06-27 09:57:03 EDT
$ time GREP_USE_DFA=1 ./findnl /usr/share/doc/ > /dev/null
real    0m5.970s
$ time GREP_USE_DFA=0 ./findnl /usr/share/doc/ > /dev/null
real    0m24.107s

I googled a little and noticed grep disables DFA by default,
when multibyte input is used. But the vast majority of
the above input lines are ascii?

Comment 10 Pádraig Brady 2007-06-27 10:35:09 EDT
In fact, as well as DFA being faster, it looks like
it's more correct for multibyte character checking?

$ echo $LANG

$ echo -e "t\xa9st" | GREP_USE_DFA=0 grep -qE "[^[:alnum:]]" && echo "bad char"

$ echo -e "t\xa9st" | GREP_USE_DFA=1 grep -qE "[^[:alnum:]]" && echo "bad char"
bad char

Note You need to log in before you can comment on or make changes to this bug.