Sorting a file of the following field separted by 2 spaces: RMD160 hash file size (bytes) signed integer value filename full file path The file is 1236636 bytes long, 14391 lines. Sorting on the 1st and 3rd fields with: sort -k1,1n k3,3n results in 100% CPU with no result. Sorting same file with: sort results in sorted file. Specifing the separator (this should be necessary): sort -k1,1n k3,3n -t ' ' works correctly. Specifying the buffer size with -S or setting temporary file area with -T does not solve the problem. Environment has: LANG=en_US.UTF-8 unsetting this for the default and sort -k1,1n k3,3n works as expected. System is up-to-date with latest RPMS. Version-Release number of selected component (if applicable): coreutils-4.5.3-19 How reproducible: Always Steps to Reproduce: 1. see above
Please attach a script to generate input of the type you are seeing the problem with. I suspect that it will actually finish if you leave it long enough; in UTF-8 it has a lot more to do than in C locale. The rawhide coreutils package has several efficiency fixes.
"I suspect that it will actually finish if you leave it long enough; in UTF-8 it has a lot more to do than in C locale." That sounds a lot like a GCC worst-case optimizer bug I used to run into a lot. Accordin to the maintainer the compiler would eventually spit out correct code, but not in our lifetime :P FWIW, running on a dual CPU P3-450. Pegs one CPU. Started investigating after waiting an inoridnate amount of time (several minutes - should only take a few seconds). "Please attach a script to generate input of the type you are seeing the problem with." I'm weeding out duplicate mp3 files and choosing the best file based on a score. The evaluation process is (modified slightly - I use rmd160 hash): find path_to_files -type f | xargs md5sum | score | sort -k1,1n -k3,3nr > $$.1 uniq -w 32 $$.1 > $$.2 diff -u $$.1 $$.2 | sed -e '/^-/!d;1d;' | awk '{print $5;}' | xargs rm -vf Someday the score will include a test for mp3 tags, artist names, etc., but for now it is only based on the filename (more "descriptive" = higher score: #! /usr/bin/perl while (<>) { chomp $_; $score = 0; @line = split /[ ]+/; $name = $line[2]; $name =~ m#[^/]*$#; $name = $&; for $char ($name =~ m/./g) { $score += 5 if ($char =~ m/[a-zA-Z]/); $score -= 5 if ($char =~ m/[0-9]/); $score -= 1 if ($char =~ m/\-/); } print "$line[0] $line[1] $score $name $line[2]\n"; } I recently did a fresh install of Red Hat 9. I never had a problem with this before.
Really I was asking for a script that would generate a file I could test with, so I could see the problem for myself. Please try the coreutils package from rawhide and see if it's any better. (That was fixed as bug #82032.)
This will generate a data semi-random file you can sort: #! /bin/bash i=1 md5=x while [ $i -lt 15000 ] ; do echo $md5 > $$ sum=`md5sum $$ | sed -e 's/ .*//;'` md5=$sum echo "$sum $i $i x /somepath/x" i=$[$i+1] done rm -vf $$
Have you tried the rawhide package yet?
Tested coreutils-5.0-4.i386.rpm, and it appears to work now. Thanks. Marking as RAWHIDE. FWIW, Adobe Acrobat Reader linux-ar-506.tar.gz (yes, I know, a third party product) doesn't like LANG="en_US.UTF-8" either; it segfaults on startup. I'm told that LANG="en_US" does work though (I can't test this at moment).
Could you backport that fix to Red Hat 9 and put it on the updates? I've just spent a whole hour waiting for Matlab to finish installing, when it ought to have been done in 10 seconds. It was sorting the list of installed files (109,289 of them). FYI here's a simple test. $ find /usr/share/doc | sed 's/^/x /' > /tmp/files $ wc -l /tmp/files 19618 /tmp/files $ LANG=en_GB.UTF-8 time sort +1 -2 /tmp/files >/dev/null 286.88user 0.31system 4:47.37elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (165major+465minor)pagefaults 0swaps That was on a 3GHz (dual) Pentium 4 running RedHat 9 with all updates. Now, copying the exact same test file to an 866MHz Pentium 3 running Red Hat 8.0 gave this: $ LANG=en_GB.UTF-8 time sort +1 -2 /tmp/files >/dev/null 1.30user 0.00system 0:01.31elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (140major+464minor)pagefaults 0swaps What happened here?!