Description of problem: grep is painfully slow on multibyte locales. Slowdown factor >30 observed. Version-Release number of selected component (if applicable): grep-2.5.1-26 How reproducible: ~ $ time LC_CTYPE=en_US.UTF-8 grep '^//PS ' /tmp/r3.log | wc -l 90304 grep : 97.31s user 0.17s system 87% cpu 1:51.15 total ~ $ time LC_CTYPE=C grep '^//PS ' /tmp/r3.log | wc -l 90304 grep : 0.22s user 0.04s system 83% cpu 0.312 total Test file attached later on, and also downloadable from: http://www.loria.fr/~thome/vrac/r3.log.gz It's 40KB gzipped, 2.6MB gunzipped.
Created attachment 99558 [details] Test file I used
(2.5.1-26 is a devel package; changing version.)
The longer-term solution is to make grep use the system regex for multibyte encodings. The GNU libc implementation has quite an efficient implementation now.
Please try grep-2.5.1-36, available at: http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/
I'm happy with it. with GREP_USE_DFA set, I observe a 2x slowdown. E.
grep-2.5.1-37 fixes a problem that can cause false matches. It will be available in the Fedora development tree tomorrow, or at: ftp://people.redhat.com/twaugh/tmp/grep/fedora-core-3/
Nice to know the problem was fixed in Fedora Core. However it seems that grep-2.5.1-31 (RHEL4) still suffers from this problem. Any chance of fixing that one too? Looking at the dates in comments, I kinda expected that there would be new version of grep released as part of U1 or at latest U2. One additional thing. I found that grep is slow if there are many matches. If there are no matches (or just a few of matches), it is fast. For example: LANG=en_US.UTF-8 # Should be default export LANG a=0 while [ $a -lt 30000 ]; do printf "%.9d0\n" $a; a=$(( $a + 1 )) done > testfile.txt echo "Going to be sloooow... Get yourself some coffe" time grep -c '0$' testfile.txt echo "However, this one is fast. Sorry, no time for coffe" time grep -c '1$' testfile.txt It takes about 25 seconds on 2.8GHz Pentium D to run the first grep (jeeez). The second grep (that doesn't match any lines from the file) is fast. Of course, setting LANG to C or en_US solves the problem.
I think this is a problem again on fedora core 7 (grep-2.5.1-57.fc7) I measured grep to be 540% slower in UTF8 locale than grep-2.5.1-17 (FC4) or 2.5.1.ds1-5ubuntu2 (breezy) On both breezy and fc7 I did the following: find /usr/share/doc > docs.txt export LANG=en_IE.utf8 unalias grep [breezy]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)" <docs.txt >/dev/null real 0m0.940s user 0m0.891s sys 0m0.005s [fc7]$ time grep -E "(.*/[^/]*[^][:alnum:]_./,~+@#!=[{}:;'<>%& -]+[^/]*$)" <docs.txt >/dev/null real 0m5.089s user 0m4.936s sys 0m0.007s Note this regular expression is used by the findnl script in fslint: yum install fslint /usr/share/fslint/fslint/findnl /usr/share/doc Apart from fixing this regression, since this is almost entirely ascii data that is being searched, Couldn't one apply the optimization of scanning a line for multibyte chars, and treating like LC_CTYPE=C if none found? cheers, Pádraig.
$ time GREP_USE_DFA=1 ./findnl /usr/share/doc/ > /dev/null real 0m5.970s $ time GREP_USE_DFA=0 ./findnl /usr/share/doc/ > /dev/null real 0m24.107s I googled a little and noticed grep disables DFA by default, when multibyte input is used. But the vast majority of the above input lines are ascii? cheers, Pádraig.
In fact, as well as DFA being faster, it looks like it's more correct for multibyte character checking? $ echo $LANG en_IE.UTF-8 $ echo -e "t\xa9st" | GREP_USE_DFA=0 grep -qE "[^[:alnum:]]" && echo "bad char" $ echo -e "t\xa9st" | GREP_USE_DFA=1 grep -qE "[^[:alnum:]]" && echo "bad char" bad char