Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Coreutils i18n patch terribly affects performance with UTF-8 locales for sort, cut and others|
|Product:||[Fedora] Fedora||Reporter:||J Gallagher <jbgallagher2000>|
|Component:||coreutils||Assignee:||Ondrej Vasik <ovasik>|
|Status:||CLOSED RAWHIDE||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||rawhide||CC:||i18n-bugs, kdudka, ovasik, paolini, petersen, stoty, tagoh, twaugh|
|Fixed In Version:||coreutils-8.22-8.fc21||Doc Type:||Enhancement|
|Doc Text:||Story Points:||---|
|:||538423 553570 1021403 (view as bug list)||Environment:|
|Last Closed:||2014-01-08 09:39:18 EST||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
|Bug Blocks:||553570, 1021403, 1063212|
Description J Gallagher 2009-05-05 12:16:40 EDT
Created attachment 342488 [details] creates ~220mb text file and parses with common tools Description of problem: Parsing large files with cut and grep is much slower in a UTF-8 locale than a C locale Version-Release number of selected component (if applicable): All recent releases How reproducible: Parse a large file with cut or grep, specify LANG=C beforehand and the speed improves dramatically if changing from a UTF-8 locale Steps to Reproduce: 1. run the atatched script textparse.sh in a UTF locale 2. execute LANG=C and run the script again 3. notice that only cut and grep timings are significantly increased (and sed a little) Actual results: cut and grep are dramatically faster in C locale, other tools have same performance (sed is a little faster) Expected results: Expect cut and grep to behave like other tools in UTF-8 locales Additional info:
Comment 1 J Gallagher 2009-05-05 12:39:18 EDT
cutting to the chase (no pun intended), here's the cut performance in both locales: $ LANG=C $ time cut -f1 -d' ' file1 > file2 real 0m4.861s user 0m1.767s sys 0m0.615s $ LANG=en_GB.UTF-8 $ time cut -f1 -d' ' file1 > file2 real 0m25.465s user 0m22.607s sys 0m0.640s file1 is a 10,000,000 line text file generated by the script can't post grep results beacause the particular grep word-regex is broken in F11 which I'm currently on (although resulrs are reproducible across all other Fedora releases) (separate bug submitted)
Comment 2 Kamil Dudka 2009-05-06 14:18:28 EDT
Thank you for the report and the script. I can confirm your results with cut. It is caused by the i18n patch. I've spent some time by profiling it. Looking into cut_fields_mb() there is a bit waste in calling print_kth in the loop with invariant parameters, nothing extra. But it spends most of the time by calling the mbrtowc() function. There is not much we can do with it within coreutils. Any idea?
Comment 3 Kamil Dudka 2009-05-06 14:56:31 EDT
Maybe converting more wide characters at once (mbsrtowcs) is the way to go, but I am just guessing now. It's a significant change and it needs to be precisely tested first...
Comment 4 J Gallagher 2009-05-07 09:32:01 EDT
You might want to check how they improved UTF-8 performance in the recent 4.2 release of gnu sed http://freshmeat.net/projects/sed/releases/298668
Comment 5 Kamil Dudka 2009-05-07 14:19:07 EDT
Are you talking about this improvement? http://git.savannah.gnu.org/gitweb/?p=sed.git;a=commitdiff;h=3ca529fbe25706387200425b2a99012d6008f26c
Comment 6 J Gallagher 2009-05-07 20:29:45 EDT
that looks like the relevant code change, not much is it? wonder how much of an improvement it gives,seems that in execute.c a test 'if (mb_cur_max > 1 && !is_utf8)' is made, so in case of utf8 a loop is skipped. I'm not an expert on this multibyte character processing, just raised the issue as the performance difference is staggering.
Comment 7 Ondrej Vasik 2009-05-20 10:02:19 EDT
Thanks for report, as the issue is confirmed, changing to assigned, although I don't want to invest time into i18n patch. It was never accepted by upstream ( not very good design, portability questions, hidden issues (e.g. recent segfault in join)...) and upstream planes it's own multibyte character support - not based on our patch. So making performance improvements on i18n patch might be wasting of time... Maybe I'll take a look into it a bit, but I'm not sure if there is an easy way how to improve performance of it...
Comment 8 Maurizio Paolini 2009-07-01 09:09:11 EDT
I can confirm the problem on Fedora 11, grep version 2.5.3. The problem can be reproduced as follows: -------------------------------------- $ for n in `seq 10000` > do > echo "0" >>test.txt >done $ export LANG=en_US.UTF-8 $ time grep  pippo >/dev/null real 0m9.102s user 0m8.419s sys 0m0.021s -------------------------------------- while without utf8 the result is OK: $ export LANG=en_US $ time grep  pippo >/dev/null real 0m0.018s user 0m0.004s sys 0m0.001s -------------------------------------- This is a nasty bug because it impacts a lot of system scripts! One note: the same grep command but without the '[' and ']' does not have the problem: $ export LANG=en_US.utf8 $ time grep 0 pippo >/dev/null real 0m0.009s user 0m0.004s sys 0m0.002s
Comment 9 Maurizio Paolini 2009-07-17 12:40:27 EDT
(In reply to comment #8) There where a mistake (file name in the grep command) in the bash commands to reproduce the bug in my previous post. Here is the corrected version for n in $( seq 10000 ) do echo "0" >>test.txt done export LANG=en_US.UTF-8 time grep  test.txt >/dev/null export LANG=en_US time grep  test.txt >/dev/null
Comment 10 Bug Zapper 2009-11-18 06:54:10 EST
This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 11 Kamil Dudka 2009-11-18 12:23:45 EST
I've just changed the version to rawhide. There is no known solution and nobody is working on the fix right now. It will definitely not get into Fedora 10.
Comment 12 Ondrej Vasik 2009-11-18 12:36:44 EST
And as Maurizio Paolini filled rhbz #538423 against grep, we could modify the summary to be more accurate and only against coreutils.
Comment 13 Bug Zapper 2010-03-15 08:35:26 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 14 Akira TAGOH 2011-05-24 08:34:56 EDT
According to the comment #11, I've added a FutureFeature tag and moving back to rawhide again because auto-closing by Bug Zapper shouldn't be the expected result.
Comment 15 Ondrej Vasik 2011-05-24 09:15:48 EDT
Ok, futurefeature tracking seems reasonable, anyway, not much to do at the moment - we are aware of the performance issue caused by i18n patch. Anyone is more than welcome to propose possible ways how to reduce the impact of the patch for common usecases.
Comment 16 István Tóth 2012-01-30 05:16:33 EST
I've run the Maurizio's testcase on Fedora 16, as well as my own scripts that were affected by this problem, and found that the performance regression is fixed now. export LANG=en_US time grep  test.txt >/dev/null real 0m0.004s user 0m0.002s sys 0m0.002s export LANG=en_US.UTF-8 time grep  test.txt >/dev/null real 0m0.004s user 0m0.003s sys 0m0.001s I have also run the attached textparse.sh, and the only significant differences between en_US and en_US.UTF-8 were in sed (5 sec vs 9 sec), cut (13 sec vs 1.5sec), and perl (108 sec vs 44 sec) looks like cut is the only program in coreutils that still has a serious performace problem with UTF-8. (At least among those tested in textparse.sh)
Comment 17 Kamil Dudka 2012-01-30 05:41:44 EST
(In reply to comment #16) > export LANG=en_US > time grep  test.txt >/dev/null > > real 0m0.004s > user 0m0.002s > sys 0m0.002s > export LANG=en_US.UTF-8 > time grep  test.txt >/dev/null > > real 0m0.004s > user 0m0.003s > sys 0m0.001s You tested grep whereas this bug is about coreutils.
Comment 18 Ondrej Vasik 2014-01-08 09:39:18 EST
It turned out that cut field is actually the only one scenario which could be reasonably easily improved. Fixed by https://lists.fedoraproject.org/pipermail/scm-commits/Week-of-Mon-20140106/1168699.html - closing RAWHIDE, as we don't plan any other multibyte handling performance improvements in coreutils in near future. Here are results of my testing: (input file 3M4, 100k lines with 6 columns (each column with word beginning with F) Old package: cut -f3 mytestfile >/dev/null -> time : 0.223s New package: cut -f3 mytestfile >/dev/null -> time : 0.029s Old package: cut -d'F' -f3- mytestfile >/dev/null -> time : 0.264s New package: cut -d'F' -f3- mytestfile >/dev/null -> time : 0.032s