From Bugzilla Helper: User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows; U; AIIEEEE!; Win98; Windows 98; en-US; Gecko masquerading as IE; should it matter?; rv:1.8b) Gecko/20050217 Description of problem: When running sed under Unicode character set the performance degrades by a factor of over 500%. Version-Release number of selected component (if applicable): sed-4.1.4-1 How reproducible: Always Steps to Reproduce: 1. LANG=en_US.ISO-8859-1 time sed -n '/XXXXX/p' alargetextfile >/dev/null 2. LANG=en_US.UTF-8 time sed -n '/XXXXX/p' alargetextfile >/dev/null 3. Actual Results: 1. 0.48 secs 2. 2.72 secs Expected Results: 1. 0.48 secs 2. 0.50 secs Additional info: You must ensure that kernel has cached text file before timing. Run each command above twice to be sure.
Created attachment 133474 [details] fix for this problem In str_append, sed does some extra processing of UTF-8 strings. It looks like a) there is a bug inside in that the first character is processed again and again, and b) even if it wasn't, there is no actual outcome of the processing. Taking this code out gives back lost performance. Testsuite passes.
The result of the code is to update to->mbstate. This is a no-op indeed for UTF-8, but the code needs to be there for other encodings. It's ok with me to fix the bug that Petr reported, and conditionalize the code on multibyte, non-UTF-8 encoding (rather than on multibyte encoding only).
Created attachment 133551 [details] correct patch This patch only disables the code in question for UTF-8, and fixes the bug for other multi-byte locales.
Thanks. I applied the patch verbatim and did a build, it should be available shortly.