1462372 – basic functions skip over diacritical characters

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1462372 - basic functions skip over diacritical characters

Summary: basic functions skip over diacritical characters

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	gawk
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	David Kaspar // Dee'Kej
QA Contact:	BaseOS QE - Apps
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-16 21:58 UTC by Marc Beaudoin
Modified:	2017-07-10 14:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-07-10 14:11:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
A one line text file with 5 characters. Last character is "é", an accented character which is ignored by length and substr. (6 bytes, application/octet-stream) 2017-06-16 21:58 UTC, Marc Beaudoin	no flags	Details
Strip down version of the text file I am trying to process. It contains only 5 characters (unité) and awk does not process them properly. (6 bytes, text/plain) 2017-06-16 23:06 UTC, Marc Beaudoin	no flags	Details
View All

Description Marc Beaudoin 2017-06-16 21:58:27 UTC

Created attachment 1288477 [details]
A one line text file with 5 characters.  Last character is "é", an accented character which is ignored by length and substr.

Description of problem:
functions like length, match and substr seems to skip over accented characters.  Line with 41 characters, length returns 40.

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log
canal Tape : SID=13 type d'unit▒=SBT_TAPE

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | od -b
0000000 143 141 156 141 154 040 124 141 160 145 040 072 040 123 111 104
0000020 075 061 063 040 164 171 160 145 040 144 047 165 156 151 164 351
0000040 075 123 102 124 137 124 101 120 105 012
0000052

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | wc -c
42

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | awk '{ print length }'
40

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | awk '{ print substr($0,31,1) }'
t

sed -n -e '61p'  /RMan_Listings/sxdgbd0123.cogcontd.20170616133225.20170616133254.log | awk '{ print substr($0,32,1) }'
=

substr skips diacritical (accented) character 351 (octal) (233 decimal) (é)

Version-Release number of selected component (if applicable):
4.0.2

How reproducible:
All the time

Steps to Reproduce:
1. use "vi" to create a text file of one line (vi Test.lst)
2. enter "unité" in the only line of the file and save it.
3. awk '{ print length }' Test.lst

Actual results:
4

Expected results:
5

Additional info:

Comment 2 Marc Beaudoin 2017-06-16 22:06:21 UTC

If a line contains only diacritical characters (àéù for instance), function "length" returns the right number of caracters.

Comment 3 Marc Beaudoin 2017-06-16 22:12:56 UTC

I have two RHEL servers version 7.3 (Maipo) with same version of awk (GNU Awk 4.0.2).  I have the problem on one but not the other!  Chances are that the problem lie with other part of the server or with its configuration.

Comment 4 Marc Beaudoin 2017-06-16 22:17:51 UTC

Server with the problem:

oracle@slpgbd0221 :/oracle/tmp$ which awk
/bin/awk
PID:8203 OH:/oracle/product/11.2.0.4 -  SID:oemp(v.) - 18:15:20 - Nb err:3379
oracle@slpgbd0221 :/oracle/tmp$ ls -l /bin/awk
lrwxrwxrwx. 1 root root 4 Jun 15 12:02 /bin/awk -> gawk
PID:8203 OH:/oracle/product/11.2.0.4 -  SID:oemp(v.) - 18:15:27 - Nb err:3379
oracle@slpgbd0221 :/oracle/tmp$ sum /bin/awk
62099   419


Server without the problem:
$ which awk
/usr/bin/awk
$ ls -l  /usr/bin/awk
lrwxrwxrwx. 1 root root 4 Jun 15  2016 /usr/bin/awk -> gawk
$ sum /usr/bin/awk
62099   419

Comment 5 Marc Beaudoin 2017-06-16 22:43:02 UTC

I can't reproduce the problem with vi anymore!  When I enter "unité" with vi, the octal dump is: "165 156 151 164 303 251 012".
When it comes from the file I was trying to process with awk and isolate "unité", octoal dump is "0000000 165 156 151 164 351 012".

I guess it has to do with UTF.  It is as if awk expects UTF and when it gets characters above 255, it skips them.

Comment 6 Marc Beaudoin 2017-06-16 23:02:24 UTC

I realize, now, that both servers are plagged with the same bug.
When "unité" comes from vi, both servers can handle the "length" properly.  When "unité" comes from the file I was trying to process with awk, both servers return a length of 4 instead of 5.

I which I can upload the file.  I don't see where I can upload it to this bug.

Comment 7 Marc Beaudoin 2017-06-16 23:06:21 UTC

Created attachment 1288481 [details]
Strip down version of the text file I am trying to process.  It contains only 5 characters (unité) and awk does not process them properly.

Comment 8 Marc Beaudoin 2017-06-16 23:11:16 UTC

When I wrote "I guess it has to do with UTF.  It is as if awk expects UTF and when it gets characters above 255, it skips them" above, I meant "I guess it has to do with UTF.  It is as if awk expects UTF and when it gets characters above 127, it skips them."

Comment 9 Marc Beaudoin 2017-06-16 23:41:42 UTC

I am trying to reproduce the file I am trying to process with awk and I could not.
awk 'BEGIN { X = sprintf("%c",233); print "unit" X }' | od -b
0000000 165 156 151 164 303 251 012

Character 233 (decimal) is converted in two bytes: 195 and 169.  In short, "awk" generates UTF8; I guess it expects UTF8 as input.  How can I make it process ISO8859-1 input streams?

Comment 10 Marc Beaudoin 2017-06-17 01:11:04 UTC

I found it: one has to set environment variable LANG to C!

Comment 11 Kamil Dudka 2017-06-21 17:45:02 UTC

If the input data is not utf-8 encoded, you need to set locale accordingly.  You can try something like this:

$ export LANG=en_US.iso88591

It of course depends on which languages/encodings you are processing...

Comment 12 David Kaspar // Dee'Kej 2017-06-26 12:35:22 UTC

Hello Marc,

(In reply to Kamil Dudka from comment #11)
> If the input data is not utf-8 encoded, you need to set locale accordingly. 
> You can try something like this:
> 
> $ export LANG=en_US.iso88591
> 
> It of course depends on which languages/encodings you are processing...

Did Kamil's suggestion work for you?

Comment 13 David Kaspar // Dee'Kej 2017-06-28 11:09:02 UTC

Marc,

any update on your problem? :)

Comment 14 Marc Beaudoin 2017-07-10 13:45:09 UTC

Hi David,

Unfortunately, I replied to your first EMail and, contrary to IBM support site, it did not get posted in this request.

Thanks for pointing out that "export LANG=en_US.io88591" solves the problem too.  I already knew that "export LANG=C" also solves the problem.

On AIX, my locale contains "en_US.ISO8859-1" and "en_US.8859-15", but not "en_US.iso88591".  On RHEL, I have "en_US.iso88591".

My goal is portability.  It seems that "C" locale exists on both platforms by default.  Maybe you have more portable solutions regarding this problem.

Otherwize, you can close that request.  However, I guess that I am not alone to get into that problem.  Maybe there is a way to keep track of this problem and its solution for future references.

Regards,
Marc.

Comment 15 Kamil Dudka 2017-07-10 14:11:34 UTC

The problem of plain text files is that they do not carry any (machine-readable) info about the encoding they use.  So it is nearly impossible to fix this problem in general.  Using LANG=C is the most portable setting but it hardly classifies as a (generic) solution to your problem.  While it may solve the problem with counting characters, as described in comment #0, it will not work well if you e.g. try to alphabetically sort the data.  The locale needs to be set properly.

Note You need to log in before you can comment on or make changes to this bug.