Bug 98119

Summary: cat <file> | sort -u > <file2>, without some words with accent
Product: [Retired] Red Hat Linux Reporter: hotmail <luisuebel>
Component: coreutilsAssignee: Tim Waugh <twaugh>
Status: CLOSED WORKSFORME QA Contact: Mike McLean <mikem>
Severity: medium Docs Contact:
Priority: medium    
Version: 9   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-08 15:29:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description hotmail 2003-06-26 20:26:57 UTC
Description of problem:
There is a problem with sort in Red Hat 9.0 that doesn't happen with Red Hat
7.2. In Brazilian portugues (I only saw the problem with this language), sort
removed some words with accent.
I used the following command:
cat <file> | sort -u

And, some words with accent disapear from the command. I tried the same commmand
with the same <file> in a machine with Red Hat 7.2 and the problem doesn't
occur. The <file> has 4 Mbytes and have around 53 thousand unique words. I
cannot send the original file or the results because it is a internal document
from my company.


Version-Release number of selected component (if applicable):
Red Hat 9.0

How reproducible:
All time.

Steps to Reproduce:
1. cat <file>
2. sort -u
3.
    
Actual results:
<without> j´unio (I cannot write here with accent propery)

Expected results:
<with> j´unio

Additional info:

Comment 1 Tim Waugh 2003-06-27 08:30:32 UTC
Could you send me a minimal test case (or provide a pointer to one) that
demonstrates the problem?  Perhaps obscuring the words with "tr '[a-z]' x" would
help?

Also what locale are you using?  What does 'locale' say?

Comment 2 hotmail 2003-06-30 21:43:40 UTC
I am trying to find a minimum file that appers this error. I really cannot send
you the original file.
The problem is related to very large files. The original file has 8Mbytes with
1.3Mwords and 65K unique words. I couldn't reproduce the problem with a smaller
version of the file.

I notice that RedHat 9.0 and RedHat 7.2 have bugs in this case, but they are
differents bugs. In RedHat 7.2, there are a couple of non accent words missing,
but in RedHat 9.0, there are accented words missing. I cannot reproduce this
error with a small file.

I don't know if you can arrange a very big text file to test this. Unfortune, I
really cannot send you the file.


Luis

Comment 3 Tim Waugh 2003-07-07 11:58:42 UTC
Need a test case before I can analyse the problem. :-/