Bug 499220
Summary: | Coreutils i18n patch terribly affects performance with UTF-8 locales for sort, cut and others | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | J Gallagher <jbgallagher2000> | ||||
Component: | coreutils | Assignee: | Ondrej Vasik <ovasik> | ||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | rawhide | CC: | i18n-bugs, kdudka, ovasik, paolini, petersen, stoty, tagoh, twaugh | ||||
Target Milestone: | --- | Keywords: | FutureFeature | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | coreutils-8.22-8.fc21 | Doc Type: | Enhancement | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 538423 553570 1021403 (view as bug list) | Environment: | |||||
Last Closed: | 2014-01-08 14:39:18 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 553570, 1021403, 1063212 | ||||||
Attachments: |
|
cutting to the chase (no pun intended), here's the cut performance in both locales: $ LANG=C $ time cut -f1 -d' ' file1 > file2 real 0m4.861s user 0m1.767s sys 0m0.615s $ LANG=en_GB.UTF-8 $ time cut -f1 -d' ' file1 > file2 real 0m25.465s user 0m22.607s sys 0m0.640s file1 is a 10,000,000 line text file generated by the script can't post grep results beacause the particular grep word-regex is broken in F11 which I'm currently on (although resulrs are reproducible across all other Fedora releases) (separate bug submitted) Thank you for the report and the script. I can confirm your results with cut. It is caused by the i18n patch. I've spent some time by profiling it. Looking into cut_fields_mb() there is a bit waste in calling print_kth in the loop with invariant parameters, nothing extra. But it spends most of the time by calling the mbrtowc() function. There is not much we can do with it within coreutils. Any idea? Maybe converting more wide characters at once (mbsrtowcs) is the way to go, but I am just guessing now. It's a significant change and it needs to be precisely tested first... You might want to check how they improved UTF-8 performance in the recent 4.2 release of gnu sed http://freshmeat.net/projects/sed/releases/298668 Are you talking about this improvement? http://git.savannah.gnu.org/gitweb/?p=sed.git;a=commitdiff;h=3ca529fbe25706387200425b2a99012d6008f26c that looks like the relevant code change, not much is it? wonder how much of an improvement it gives,seems that in execute.c a test 'if (mb_cur_max > 1 && !is_utf8)' is made, so in case of utf8 a loop is skipped. I'm not an expert on this multibyte character processing, just raised the issue as the performance difference is staggering. Thanks for report, as the issue is confirmed, changing to assigned, although I don't want to invest time into i18n patch. It was never accepted by upstream ( not very good design, portability questions, hidden issues (e.g. recent segfault in join)...) and upstream planes it's own multibyte character support - not based on our patch. So making performance improvements on i18n patch might be wasting of time... Maybe I'll take a look into it a bit, but I'm not sure if there is an easy way how to improve performance of it... I can confirm the problem on Fedora 11, grep version 2.5.3. The problem
can be reproduced as follows:
--------------------------------------
$ for n in `seq 10000`
> do
> echo "0" >>test.txt
>done
$ export LANG=en_US.UTF-8
$ time grep [0] pippo >/dev/null
real 0m9.102s
user 0m8.419s
sys 0m0.021s
--------------------------------------
while without utf8 the result is OK:
$ export LANG=en_US
$ time grep [0] pippo >/dev/null
real 0m0.018s
user 0m0.004s
sys 0m0.001s
--------------------------------------
This is a nasty bug because it impacts a lot of system scripts!
One note: the same grep command but without the '[' and ']'
does not have the problem:
$ export LANG=en_US.utf8
$ time grep 0 pippo >/dev/null
real 0m0.009s
user 0m0.004s
sys 0m0.002s
(In reply to comment #8) There where a mistake (file name in the grep command) in the bash commands to reproduce the bug in my previous post. Here is the corrected version for n in $( seq 10000 ) do echo "0" >>test.txt done export LANG=en_US.UTF-8 time grep [0] test.txt >/dev/null export LANG=en_US time grep [0] test.txt >/dev/null This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I've just changed the version to rawhide. There is no known solution and nobody is working on the fix right now. It will definitely not get into Fedora 10. And as Maurizio Paolini filled rhbz #538423 against grep, we could modify the summary to be more accurate and only against coreutils. This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping According to the comment #11, I've added a FutureFeature tag and moving back to rawhide again because auto-closing by Bug Zapper shouldn't be the expected result. Ok, futurefeature tracking seems reasonable, anyway, not much to do at the moment - we are aware of the performance issue caused by i18n patch. Anyone is more than welcome to propose possible ways how to reduce the impact of the patch for common usecases. I've run the Maurizio's testcase on Fedora 16, as well as my own scripts that were affected by this problem, and found that the performance regression is fixed now. export LANG=en_US time grep [0] test.txt >/dev/null real 0m0.004s user 0m0.002s sys 0m0.002s export LANG=en_US.UTF-8 time grep [0] test.txt >/dev/null real 0m0.004s user 0m0.003s sys 0m0.001s I have also run the attached textparse.sh, and the only significant differences between en_US and en_US.UTF-8 were in sed (5 sec vs 9 sec), cut (13 sec vs 1.5sec), and perl (108 sec vs 44 sec) looks like cut is the only program in coreutils that still has a serious performace problem with UTF-8. (At least among those tested in textparse.sh) (In reply to comment #16) > export LANG=en_US > time grep [0] test.txt >/dev/null > > real 0m0.004s > user 0m0.002s > sys 0m0.002s > export LANG=en_US.UTF-8 > time grep [0] test.txt >/dev/null > > real 0m0.004s > user 0m0.003s > sys 0m0.001s You tested grep whereas this bug is about coreutils. It turned out that cut field is actually the only one scenario which could be reasonably easily improved. Fixed by https://lists.fedoraproject.org/pipermail/scm-commits/Week-of-Mon-20140106/1168699.html - closing RAWHIDE, as we don't plan any other multibyte handling performance improvements in coreutils in near future. Here are results of my testing: (input file 3M4, 100k lines with 6 columns (each column with word beginning with F) Old package: cut -f3 mytestfile >/dev/null -> time : 0.223s New package: cut -f3 mytestfile >/dev/null -> time : 0.029s Old package: cut -d'F' -f3- mytestfile >/dev/null -> time : 0.264s New package: cut -d'F' -f3- mytestfile >/dev/null -> time : 0.032s |
Created attachment 342488 [details] creates ~220mb text file and parses with common tools Description of problem: Parsing large files with cut and grep is much slower in a UTF-8 locale than a C locale Version-Release number of selected component (if applicable): All recent releases How reproducible: Parse a large file with cut or grep, specify LANG=C beforehand and the speed improves dramatically if changing from a UTF-8 locale Steps to Reproduce: 1. run the atatched script textparse.sh in a UTF locale 2. execute LANG=C and run the script again 3. notice that only cut and grep timings are significantly increased (and sed a little) Actual results: cut and grep are dramatically faster in C locale, other tools have same performance (sed is a little faster) Expected results: Expect cut and grep to behave like other tools in UTF-8 locales Additional info: