90279 – sort is very inefficient with UTF-8

Bug 90279 - sort is very inefficient with UTF-8

Summary: sort is very inefficient with UTF-8

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	coreutils
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tim Waugh
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-05-06 12:52 UTC by Andrew E. Mileski
Modified:	2007-04-18 16:53 UTC (History)
CC List:	1 user (show)
Fixed In Version:	5.0-4
Clone Of:
Environment:
Last Closed:	2004-10-11 09:09:51 UTC
Embargoed:

Attachments	(Terms of Use)

Description Andrew E. Mileski 2003-05-06 12:52:12 UTC

Sorting a file of the following field separted by 2 spaces:
  RMD160 hash
  file size (bytes)
  signed integer value
  filename
  full file path
The file is 1236636 bytes long, 14391 lines.

Sorting on the 1st and 3rd fields with:
  sort -k1,1n k3,3n
results in 100% CPU with no result.

Sorting same file with:
  sort
results in sorted file.

Specifing the separator (this should be necessary):
    sort -k1,1n k3,3n -t ' '
works correctly.

Specifying the buffer size with -S or setting temporary file area with  -T does
not solve the problem.

Environment has:
  LANG=en_US.UTF-8
unsetting this for the default and
    sort -k1,1n k3,3n
works as expected.

System is up-to-date with latest RPMS.

Version-Release number of selected component (if applicable):
  coreutils-4.5.3-19

How reproducible:
Always

Steps to Reproduce:
1. see above

Comment 1 Tim Waugh 2003-05-06 13:17:15 UTC

Please attach a script to generate input of the type you are seeing the problem
with.

I suspect that it will actually finish if you leave it long enough; in UTF-8 it
has a lot more to do than in C locale.  The rawhide coreutils package has
several efficiency fixes.

Comment 2 Andrew E. Mileski 2003-05-06 14:46:58 UTC

"I suspect that it will actually finish if you leave it long enough; in UTF-8 it
has a lot more to do than in C locale."

That sounds a lot like a GCC worst-case optimizer bug I used to run into a lot.
 Accordin to the maintainer the compiler would eventually spit out correct code,
but not in our lifetime :P

FWIW, running on a dual CPU P3-450.  Pegs one CPU.  Started investigating after
waiting an inoridnate amount of time (several minutes - should only take a few
seconds).

"Please attach a script to generate input of the type you are seeing the problem
with."

I'm weeding out duplicate mp3 files and choosing the best file based on a score.
 The evaluation process is (modified slightly - I use rmd160 hash):
  find path_to_files -type f | xargs md5sum | score | sort -k1,1n -k3,3nr > $$.1
  uniq -w 32 $$.1 > $$.2
  diff -u $$.1 $$.2 | sed -e '/^-/!d;1d;' | awk '{print $5;}' | xargs rm -vf

Someday the score will include a test for mp3 tags, artist names, etc., but for
now it is only based on the filename (more "descriptive" = higher score:
#! /usr/bin/perl
while (<>) {
        chomp $_;
        $score = 0;
        @line = split /[        ]+/;
        $name = $line[2];
        $name =~ m#[^/]*$#;
        $name = $&;
        for $char ($name =~ m/./g) {
                $score += 5 if ($char =~ m/[a-zA-Z]/);
                $score -= 5 if ($char =~ m/[0-9]/);
                $score -= 1 if ($char =~ m/\-/);
        }
        print "$line[0]  $line[1]  $score $name $line[2]\n";
}

I recently did a fresh install of Red Hat 9.  I never had a problem with this
before.

Comment 3 Tim Waugh 2003-05-06 14:52:16 UTC

Really I was asking for a script that would generate a file I could test with,
so I could see the problem for myself.

Please try the coreutils package from rawhide and see if it's any better. (That
was fixed as bug #82032.)

Comment 4 Andrew E. Mileski 2003-05-06 15:12:15 UTC

This will generate a data semi-random file you can sort:

#! /bin/bash
i=1
md5=x
while [ $i -lt 15000 ] ; do
        echo $md5 > $$
        sum=`md5sum $$ | sed -e 's/ .*//;'`
        md5=$sum
        echo "$sum $i $i x /somepath/x"
        i=$[$i+1]
done
rm -vf $$

Comment 5 Tim Waugh 2003-05-28 09:07:38 UTC

Have you tried the rawhide package yet?

Comment 6 Andrew E. Mileski 2003-05-28 13:01:45 UTC

Tested coreutils-5.0-4.i386.rpm, and it appears to work now.  Thanks.  Marking
as RAWHIDE.

FWIW, Adobe Acrobat Reader linux-ar-506.tar.gz (yes, I know, a third party
product) doesn't like LANG="en_US.UTF-8" either; it segfaults on startup.  I'm
told that LANG="en_US" does work though (I can't test this at moment).

Comment 7 Ian Collier 2003-10-24 13:39:37 UTC

Could you backport that fix to Red Hat 9 and put it on the updates?
I've just spent a whole hour waiting for Matlab to finish installing, when it
ought to have been done in 10 seconds.  It was sorting the list of installed
files (109,289 of them).

FYI here's a simple test.

$ find /usr/share/doc | sed 's/^/x /' > /tmp/files
$ wc -l /tmp/files
  19618 /tmp/files
$ LANG=en_GB.UTF-8 time sort +1 -2 /tmp/files >/dev/null
286.88user 0.31system 4:47.37elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (165major+465minor)pagefaults 0swaps

That was on a 3GHz (dual) Pentium 4 running RedHat 9 with all updates.  Now,
copying the exact same test file to an 866MHz Pentium 3 running Red Hat 8.0
gave this:

$ LANG=en_GB.UTF-8 time sort +1 -2 /tmp/files >/dev/null
1.30user 0.00system 0:01.31elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (140major+464minor)pagefaults 0swaps

What happened here?!

Note You need to log in before you can comment on or make changes to this bug.