Bug 49449 - The textutils utility comm is broken
Summary: The textutils utility comm is broken
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: glibc
Version: 7.1
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-07-19 16:37 UTC by simon
Modified: 2016-11-24 14:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2001-07-24 19:38:21 UTC
Embargoed:


Attachments (Terms of Use)
tmp.1 (36 bytes, text/plain)
2001-07-24 18:24 UTC, simon
no flags Details

Description simon 2001-07-19 16:37:47 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.5 i686)

Description of problem:
The comm utility which is part of the RH 7.1 textutils-2.0.11-7 package is
broken. Comm is failing to corectly identify unique or common entries in
the files
given.

Here is a quick example:
The output of the comm command below should be all the common entries 
contained in both tmp.1 and tmp.2.
 
comm -12 tmp.1 tmp.2

Where tmp.1 contains:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 13

and tmp.2 contains:
  1
  2
  3
  4
  5
  6
  7
  8
 10
 11
 13

The output should be (and is under Solaris and RH 6.2):
1
2
3
4
5
6
7
8
10
11
13

But RH 7.1 comm is returning:
1
2
3
4
5
6
7
8

Ie: nothing is output after the first unique entry in tmp.1 is found.  This
example
is only one example. Comm is also failing in many other modes also.

How reproducible:
Always

Steps to Reproduce:
1. Use comm with the two tmp files included above
2.
3.
	

Additional info:

Comment 1 Bernhard Rosenkraenzer 2001-07-24 16:55:39 UTC
This seems to be a glibc issue - still happens after recompiling textutils 
packages from 6.x on 7.x systems...
Jakub, any idea? (Might be one of the locale changes)


Comment 2 Ben Levenson 2001-07-24 18:02:43 UTC
AFAICT, this is the correct behavior.  According to the comm man page, it
does a *line-by-line* comparison.  The two files provided in your bug report
are not common beyond the eighth line -- so 10,11, and 13 should not be
returned by "$ comm -12 tmp.1 tmp.2".

FWIW, I just verified the same behaviour on a 6.2 system:

$ rpm -q textutils
textutils-2.0e-6
$ cat /etc/redhat-release
Red Hat Linux release 6.2 (Zoot)
$ com -12 tmp.1 tmp.2
1
2
3
4
5
6
7
8

I even down graded the version on the 6.2 system to 2.0a-2 (the version
originally shipped w/ the distribution) and the results were the same.
Does this make sense?


Comment 3 simon 2001-07-24 18:24:13 UTC
Created attachment 24767 [details]
tmp.1

Comment 4 simon 2001-07-24 18:55:44 UTC
OK - I could attach tmp.1 but not tmp.2 there is some weirdness in bugzilla. To
reproduce my problem you need to use the files as given Ie. Keep the right
justification of the integers. Then you will see that comm in RH7.1 gives
different results to RH6.2, Solaris2.6 and HP-UX11. My reading of the comm
documentation is that comm -12 should return the lines common to both files (I
am not a comm expert :-)). Anyway my concern is that linux utilities should not
change their behaviour from one release to the next unless bugs are fixed. If
this is not the case I will need to start testing the >1000 scripts I use on a
regular basis each time I upgrade any of our RH linux packages?  If this is a
bug fix then so be it, I will be suprised and disapointed, the question remains
why is the behaviour different to every other NIX I could test.

Comment 5 Ben Levenson 2001-07-24 19:38:15 UTC
You are exactly right. When I used right-justification, version 2.0a-2 did 
return different results than the newer versions....
I noticed that when I sorted the files prior to running comm, it correctly
identified the common entries in the files regardless of the justification
of the contents.


Comment 6 Jakub Jelinek 2001-07-24 20:35:42 UTC
comm behaviour in 7.1 is correct.
info gives about comm:
   Before `comm' can be used, the input files must be sorted using the
collating sequence specified by the `LC_COLLATE' locale.  If an input
file ends in a non-newline character, a newline is silently appended.
The `sort' command with no options always outputs a file that is
suitable input to `comm'.

Your example files are sorted using the "C" collating sequence,
for most of other collating sequences they are unsorted.
Just check what will sort do with your input files to see...
Running LC_COLLATE=C comm -12 tmp.1 tmp.2
will give you the results you expect.
E.g. with LC_COLLATE=en_US (or LC_ALL=en_US or LANG=en_US if neither
is set), sorting the files will give 10 after 1, followed by 11, 13,
2, etc. So, if comm is run with such LC_COLLATE, input should be
sorted that way.


Note You need to log in before you can comment on or make changes to this bug.