Bug 1462372 - basic functions skip over diacritical characters
basic functions skip over diacritical characters
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: gawk (Show other bugs)
7.3
x86_64 Linux
low Severity high
: rc
: ---
Assigned To: David Kaspar [Dee'Kej]
BaseOS QE - Apps
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-16 17:58 EDT by Marc Beaudoin
Modified: 2017-07-10 10:11 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-10 10:11:34 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
A one line text file with 5 characters. Last character is "é", an accented character which is ignored by length and substr. (6 bytes, application/octet-stream)
2017-06-16 17:58 EDT, Marc Beaudoin
no flags Details
Strip down version of the text file I am trying to process. It contains only 5 characters (unité) and awk does not process them properly. (6 bytes, text/plain)
2017-06-16 19:06 EDT, Marc Beaudoin
no flags Details

  None (edit)
Description Marc Beaudoin 2017-06-16 17:58:27 EDT
Created attachment 1288477 [details]
A one line text file with 5 characters.  Last character is "é", an accented character which is ignored by length and substr.

Description of problem:
functions like length, match and substr seems to skip over accented characters.  Line with 41 characters, length returns 40.

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log
canal Tape : SID=13 type d'unit▒=SBT_TAPE

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | od -b
0000000 143 141 156 141 154 040 124 141 160 145 040 072 040 123 111 104
0000020 075 061 063 040 164 171 160 145 040 144 047 165 156 151 164 351
0000040 075 123 102 124 137 124 101 120 105 012
0000052

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | wc -c
42

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | awk '{ print length }'
40

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | awk '{ print substr($0,31,1) }'
t

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | awk '{ print substr($0,32,1) }'
=

substr skips diacritical (accented) character 351 (octal) (233 decimal) (é)

Version-Release number of selected component (if applicable):
4.0.2

How reproducible:
All the time

Steps to Reproduce:
1. use "vi" to create a text file of one line (vi Test.lst)
2. enter "unité" in the only line of the file and save it.
3. awk '{ print length }' Test.lst

Actual results:
4

Expected results:
5

Additional info:
Comment 2 Marc Beaudoin 2017-06-16 18:06:21 EDT
If a line contains only diacritical characters (àéù for instance), function "length" returns the right number of caracters.
Comment 3 Marc Beaudoin 2017-06-16 18:12:56 EDT
I have two RHEL servers version 7.3 (Maipo) with same version of awk (GNU Awk 4.0.2).  I have the problem on one but not the other!  Chances are that the problem lie with other part of the server or with its configuration.
Comment 4 Marc Beaudoin 2017-06-16 18:17:51 EDT
Server with the problem:

oracle@slpgbd0221 :/oracle/tmp$ which awk
/bin/awk
PID:8203 OH:/oracle/product/11.2.0.4 -  SID:oemp(v.) - 18:15:20 - Nb err:3379
oracle@slpgbd0221 :/oracle/tmp$ ls -l /bin/awk
lrwxrwxrwx. 1 root root 4 Jun 15 12:02 /bin/awk -> gawk
PID:8203 OH:/oracle/product/11.2.0.4 -  SID:oemp(v.) - 18:15:27 - Nb err:3379
oracle@slpgbd0221 :/oracle/tmp$ sum /bin/awk
62099   419


Server without the problem:
$ which awk
/usr/bin/awk
$ ls -l  /usr/bin/awk
lrwxrwxrwx. 1 root root 4 Jun 15  2016 /usr/bin/awk -> gawk
$ sum /usr/bin/awk
62099   419
Comment 5 Marc Beaudoin 2017-06-16 18:43:02 EDT
I can't reproduce the problem with vi anymore!  When I enter "unité" with vi, the octal dump is: "165 156 151 164 303 251 012".
When it comes from the file I was trying to process with awk and isolate "unité", octoal dump is "0000000 165 156 151 164 351 012".

I guess it has to do with UTF.  It is as if awk expects UTF and when it gets characters above 255, it skips them.
Comment 6 Marc Beaudoin 2017-06-16 19:02:24 EDT
I realize, now, that both servers are plagged with the same bug.
When "unité" comes from vi, both servers can handle the "length" properly.  When "unité" comes from the file I was trying to process with awk, both servers return a length of 4 instead of 5.

I which I can upload the file.  I don't see where I can upload it to this bug.
Comment 7 Marc Beaudoin 2017-06-16 19:06 EDT
Created attachment 1288481 [details]
Strip down version of the text file I am trying to process.  It contains only 5 characters (unité) and awk does not process them properly.
Comment 8 Marc Beaudoin 2017-06-16 19:11:16 EDT
When I wrote "I guess it has to do with UTF.  It is as if awk expects UTF and when it gets characters above 255, it skips them" above, I meant "I guess it has to do with UTF.  It is as if awk expects UTF and when it gets characters above 127, it skips them."
Comment 9 Marc Beaudoin 2017-06-16 19:41:42 EDT
I am trying to reproduce the file I am trying to process with awk and I could not.
awk 'BEGIN { X = sprintf("%c",233); print "unit" X }' | od -b
0000000 165 156 151 164 303 251 012

Character 233 (decimal) is converted in two bytes: 195 and 169.  In short, "awk" generates UTF8; I guess it expects UTF8 as input.  How can I make it process ISO8859-1 input streams?
Comment 10 Marc Beaudoin 2017-06-16 21:11:04 EDT
I found it: one has to set environment variable LANG to C!
Comment 11 Kamil Dudka 2017-06-21 13:45:02 EDT
If the input data is not utf-8 encoded, you need to set locale accordingly.  You can try something like this:

$ export LANG=en_US.iso88591

It of course depends on which languages/encodings you are processing...
Comment 12 David Kaspar [Dee'Kej] 2017-06-26 08:35:22 EDT
Hello Marc,

(In reply to Kamil Dudka from comment #11)
> If the input data is not utf-8 encoded, you need to set locale accordingly. 
> You can try something like this:
> 
> $ export LANG=en_US.iso88591
> 
> It of course depends on which languages/encodings you are processing...

Did Kamil's suggestion work for you?
Comment 13 David Kaspar [Dee'Kej] 2017-06-28 07:09:02 EDT
Marc,

any update on your problem? :)
Comment 14 Marc Beaudoin 2017-07-10 09:45:09 EDT
Hi David,

Unfortunately, I replied to your first EMail and, contrary to IBM support site, it did not get posted in this request.

Thanks for pointing out that "export LANG=en_US.io88591" solves the problem too.  I already knew that "export LANG=C" also solves the problem.

On AIX, my locale contains "en_US.ISO8859-1" and "en_US.8859-15", but not "en_US.iso88591".  On RHEL, I have "en_US.iso88591".

My goal is portability.  It seems that "C" locale exists on both platforms by default.  Maybe you have more portable solutions regarding this problem.

Otherwize, you can close that request.  However, I guess that I am not alone to get into that problem.  Maybe there is a way to keep track of this problem and its solution for future references.

Regards,
Marc.
Comment 15 Kamil Dudka 2017-07-10 10:11:34 EDT
The problem of plain text files is that they do not carry any (machine-readable) info about the encoding they use.  So it is nearly impossible to fix this problem in general.  Using LANG=C is the most portable setting but it hardly classifies as a (generic) solution to your problem.  While it may solve the problem with counting characters, as described in comment #0, it will not work well if you e.g. try to alphabetically sort the data.  The locale needs to be set properly.

Note You need to log in before you can comment on or make changes to this bug.