Bug 499220 - Coreutils i18n patch terribly affects performance with UTF-8 locales for sort, cut and others
Coreutils i18n patch terribly affects performance with UTF-8 locales for sort...
Product: Fedora
Classification: Fedora
Component: coreutils (Show other bugs)
All Linux
low Severity medium
: ---
: ---
Assigned To: Ondrej Vasik
Fedora Extras Quality Assurance
: FutureFeature
Depends On:
Blocks: 553570 1021403 1063212
  Show dependency treegraph
Reported: 2009-05-05 12:16 EDT by J Gallagher
Modified: 2014-02-10 04:53 EST (History)
8 users (show)

See Also:
Fixed In Version: coreutils-8.22-8.fc21
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 538423 553570 1021403 (view as bug list)
Last Closed: 2014-01-08 09:39:18 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
creates ~220mb text file and parses with common tools (2.05 KB, application/x-sh)
2009-05-05 12:16 EDT, J Gallagher
no flags Details

  None (edit)
Description J Gallagher 2009-05-05 12:16:40 EDT
Created attachment 342488 [details]
creates ~220mb text file and parses with common tools

Description of problem:
Parsing large files with cut and grep is much slower in a UTF-8 locale than a C locale

Version-Release number of selected component (if applicable):
All recent releases

How reproducible:
Parse a large file with cut or grep, specify LANG=C beforehand and the speed improves dramatically if changing from a UTF-8 locale

Steps to Reproduce:
1. run the atatched script textparse.sh in a UTF locale
2. execute LANG=C and run the script again
3. notice that only cut and grep timings are significantly increased (and sed a little)
Actual results:

cut and grep are dramatically faster in C locale, other tools have same performance (sed is a little faster)

Expected results:

Expect cut and grep to behave like other tools in UTF-8 locales

Additional info:
Comment 1 J Gallagher 2009-05-05 12:39:18 EDT
cutting to the chase (no pun intended), here's the cut performance in both locales:

$ time cut -f1 -d' ' file1 > file2
real	0m4.861s
user	0m1.767s
sys	0m0.615s

$ LANG=en_GB.UTF-8
$ time cut -f1 -d' ' file1 > file2
real	0m25.465s
user	0m22.607s
sys	0m0.640s

file1 is a 10,000,000 line text file generated by the script

can't post grep results beacause the particular grep word-regex is broken in F11 which I'm currently on (although resulrs are reproducible across all other Fedora releases) (separate bug submitted)
Comment 2 Kamil Dudka 2009-05-06 14:18:28 EDT
Thank you for the report and the script. I can confirm your results with cut. It is caused by the i18n patch. I've spent some time by profiling it.

Looking into cut_fields_mb() there is a bit waste in calling print_kth in the loop with invariant parameters, nothing extra. But it spends most of the time by calling the mbrtowc() function. There is not much we can do with it within coreutils.

Any idea?
Comment 3 Kamil Dudka 2009-05-06 14:56:31 EDT
Maybe converting more wide characters at once (mbsrtowcs) is the way to go, but I am just guessing now. It's a significant change and it needs to be precisely tested first...
Comment 4 J Gallagher 2009-05-07 09:32:01 EDT
You might want to check how they improved UTF-8 performance in the recent 4.2 release of gnu sed

Comment 5 Kamil Dudka 2009-05-07 14:19:07 EDT
Are you talking about this improvement?

Comment 6 J Gallagher 2009-05-07 20:29:45 EDT
that looks like the relevant code change, not much is it? wonder how much of an improvement it gives,seems that in execute.c a test 'if (mb_cur_max > 1 && !is_utf8)' is made, so in case of utf8 a loop is skipped.

I'm not an expert on this multibyte character processing, just raised the issue as the performance difference is staggering.
Comment 7 Ondrej Vasik 2009-05-20 10:02:19 EDT
Thanks for report, as the issue is confirmed, changing to assigned, although I don't want to invest time into i18n patch. It was never accepted by upstream ( not very good design, portability questions, hidden issues (e.g. recent segfault in join)...) and upstream planes it's own multibyte character support - not based on our patch. So making performance improvements on i18n patch might be wasting of time... Maybe I'll take a look into it a bit, but I'm not sure if there is an easy way how to improve performance of it...
Comment 8 Maurizio Paolini 2009-07-01 09:09:11 EDT
I can confirm the problem on Fedora 11, grep version 2.5.3.  The problem
can be reproduced as follows:

$ for n in `seq 10000`
> do
>  echo "0" >>test.txt
$ export LANG=en_US.UTF-8
$ time grep [0] pippo >/dev/null

real    0m9.102s
user    0m8.419s
sys     0m0.021s

while without utf8 the result is OK:

$ export LANG=en_US
$ time grep [0] pippo >/dev/null

real    0m0.018s
user    0m0.004s
sys     0m0.001s

This is a nasty bug because it impacts a lot of system scripts!

One note: the same grep command but without the '[' and ']'
does not have the problem:

$ export LANG=en_US.utf8
$ time grep 0 pippo >/dev/null

real    0m0.009s
user    0m0.004s
sys     0m0.002s
Comment 9 Maurizio Paolini 2009-07-17 12:40:27 EDT
(In reply to comment #8)
There where a mistake (file name in the grep command)
in the bash commands to reproduce the bug in my previous post.
Here is the corrected version

for n in $( seq 10000 )
  echo "0" >>test.txt

export LANG=en_US.UTF-8
time grep [0] test.txt >/dev/null
export LANG=en_US
time grep [0] test.txt >/dev/null
Comment 10 Bug Zapper 2009-11-18 06:54:10 EST
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
Comment 11 Kamil Dudka 2009-11-18 12:23:45 EST
I've just changed the version to rawhide. There is no known solution and nobody is working on the fix right now. It will definitely not get into Fedora 10.
Comment 12 Ondrej Vasik 2009-11-18 12:36:44 EST
And as Maurizio Paolini filled rhbz #538423 against grep, we could modify the summary to be more accurate and only against coreutils.
Comment 13 Bug Zapper 2010-03-15 08:35:26 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
Comment 14 Akira TAGOH 2011-05-24 08:34:56 EDT
According to the comment #11, I've added a FutureFeature tag and moving back to rawhide again because auto-closing by Bug Zapper shouldn't be the expected result.
Comment 15 Ondrej Vasik 2011-05-24 09:15:48 EDT
Ok, futurefeature tracking seems reasonable, anyway, not much to do at the moment - we are aware of the performance issue caused by i18n patch. Anyone is more than welcome to propose possible ways how to reduce the impact of the patch for common usecases.
Comment 16 István Tóth 2012-01-30 05:16:33 EST
I've run the Maurizio's testcase on Fedora 16, as well as my own scripts that were affected by this problem, and found that the performance regression is fixed now.

export LANG=en_US
time grep [0] test.txt >/dev/null

real	0m0.004s
user	0m0.002s
sys	0m0.002s
export LANG=en_US.UTF-8
time grep [0] test.txt >/dev/null

real	0m0.004s
user	0m0.003s
sys	0m0.001s

I have also run the attached textparse.sh, and the only significant differences between en_US and en_US.UTF-8 were in 
sed (5 sec vs 9 sec), cut (13 sec vs 1.5sec), and perl (108 sec vs 44 sec)

looks like cut is the only program in coreutils that still has a serious performace problem with UTF-8. (At least among those tested in textparse.sh)
Comment 17 Kamil Dudka 2012-01-30 05:41:44 EST
(In reply to comment #16)
> export LANG=en_US
> time grep [0] test.txt >/dev/null
> real 0m0.004s
> user 0m0.002s
> sys 0m0.002s
> export LANG=en_US.UTF-8
> time grep [0] test.txt >/dev/null
> real 0m0.004s
> user 0m0.003s
> sys 0m0.001s

You tested grep whereas this bug is about coreutils.
Comment 18 Ondrej Vasik 2014-01-08 09:39:18 EST
It turned out that cut field is actually the only one scenario which could be reasonably easily improved. 

Fixed by https://lists.fedoraproject.org/pipermail/scm-commits/Week-of-Mon-20140106/1168699.html - closing RAWHIDE, as we don't plan any other multibyte handling performance improvements in coreutils in near future.

Here are results of my testing:
(input file 3M4, 100k lines with 6 columns (each column with word beginning with F)
Old package: cut -f3 mytestfile >/dev/null  -> time : 0.223s
New package: cut -f3 mytestfile >/dev/null  -> time : 0.029s

Old package: cut -d'F' -f3- mytestfile >/dev/null -> time : 0.264s 
New package: cut -d'F' -f3- mytestfile >/dev/null -> time : 0.032s

Note You need to log in before you can comment on or make changes to this bug.