Bug 98119 - cat <file> | sort -u > <file2>, without some words with accent
Summary: cat <file> | sort -u > <file2>, without some words with accent
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: coreutils
Version: 9
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tim Waugh
QA Contact: Mike McLean
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-06-26 20:26 UTC by hotmail
Modified: 2007-04-18 16:55 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-08 15:29:04 UTC
Embargoed:


Attachments (Terms of Use)

Description hotmail 2003-06-26 20:26:57 UTC
Description of problem:
There is a problem with sort in Red Hat 9.0 that doesn't happen with Red Hat
7.2. In Brazilian portugues (I only saw the problem with this language), sort
removed some words with accent.
I used the following command:
cat <file> | sort -u

And, some words with accent disapear from the command. I tried the same commmand
with the same <file> in a machine with Red Hat 7.2 and the problem doesn't
occur. The <file> has 4 Mbytes and have around 53 thousand unique words. I
cannot send the original file or the results because it is a internal document
from my company.


Version-Release number of selected component (if applicable):
Red Hat 9.0

How reproducible:
All time.

Steps to Reproduce:
1. cat <file>
2. sort -u
3.
    
Actual results:
<without> j´unio (I cannot write here with accent propery)

Expected results:
<with> j´unio

Additional info:

Comment 1 Tim Waugh 2003-06-27 08:30:32 UTC
Could you send me a minimal test case (or provide a pointer to one) that
demonstrates the problem?  Perhaps obscuring the words with "tr '[a-z]' x" would
help?

Also what locale are you using?  What does 'locale' say?

Comment 2 hotmail 2003-06-30 21:43:40 UTC
I am trying to find a minimum file that appers this error. I really cannot send
you the original file.
The problem is related to very large files. The original file has 8Mbytes with
1.3Mwords and 65K unique words. I couldn't reproduce the problem with a smaller
version of the file.

I notice that RedHat 9.0 and RedHat 7.2 have bugs in this case, but they are
differents bugs. In RedHat 7.2, there are a couple of non accent words missing,
but in RedHat 9.0, there are accented words missing. I cannot reproduce this
error with a small file.

I don't know if you can arrange a very big text file to test this. Unfortune, I
really cannot send you the file.


Luis

Comment 3 Tim Waugh 2003-07-07 11:58:42 UTC
Need a test case before I can analyse the problem. :-/


Note You need to log in before you can comment on or make changes to this bug.