177246 – Appalling Performance with UTF-8

Bug 177246 - Appalling Performance with UTF-8

Summary: Appalling Performance with UTF-8

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sed
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Petr Machata
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-01-08 07:16 UTC by JW
Modified:	2015-05-05 01:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:	FC-4
Clone Of:
Environment:
Last Closed:	2006-08-03 14:00:26 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fix for this problem (817 bytes, patch) 2006-08-02 11:13 UTC, Petr Machata	no flags	Details \| Diff
correct patch (2.39 KB, patch) 2006-08-03 12:58 UTC, Paolo Bonzini	no flags	Details \| Diff
View All

Description JW 2006-01-08 07:16:33 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows; U; AIIEEEE!; Win98; Windows 98; en-US; Gecko masquerading as IE; should it matter?; rv:1.8b) Gecko/20050217

Description of problem:
When running sed under Unicode character set the performance degrades by a factor of over 500%.


Version-Release number of selected component (if applicable):
sed-4.1.4-1

How reproducible:
Always

Steps to Reproduce:
1. LANG=en_US.ISO-8859-1 time sed -n '/XXXXX/p' alargetextfile >/dev/null
2. LANG=en_US.UTF-8 time sed -n '/XXXXX/p' alargetextfile >/dev/null
3.
  

Actual Results:  1. 0.48 secs
2. 2.72 secs


Expected Results:  1. 0.48 secs
2. 0.50 secs


Additional info:

You must ensure that kernel has cached text file before timing.  Run each command above twice to be sure.

Comment 1 Petr Machata 2006-08-02 11:13:45 UTC

Created attachment 133474 [details]
fix for this problem

In str_append, sed does some extra processing of UTF-8 strings.  It looks like
a) there is a bug inside in that the first character is processed again and
again, and b) even if it wasn't, there is no actual outcome of the processing.

Taking this code out gives back lost performance. Testsuite passes.

Comment 2 Paolo Bonzini 2006-08-03 06:40:34 UTC

The result of the code is to update to->mbstate.  This is a no-op indeed for
UTF-8, but the code needs to be there for other encodings.  It's ok with me to
fix the bug that Petr reported, and conditionalize the code on multibyte,
non-UTF-8 encoding (rather than on multibyte encoding only).

Comment 3 Paolo Bonzini 2006-08-03 12:58:41 UTC

Created attachment 133551 [details]
correct patch

This patch only disables the code in question for UTF-8, and fixes the bug for
other multi-byte locales.

Comment 4 Petr Machata 2006-08-03 14:00:26 UTC

Thanks. I applied the patch verbatim and did a build, it should be available
shortly.

Note You need to log in before you can comment on or make changes to this bug.