Bug 103132 - migrate_passwd.pl UTF-8 problems
migrate_passwd.pl UTF-8 problems
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openldap (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jan Vcelak
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-08-26 16:38 EDT by Steve Bonneville
Modified: 2013-03-03 20:27 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-08-17 12:05:51 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Steve Bonneville 2003-08-26 16:38:02 EDT
Description of problem:

/usr/share/openldap/migration/migrate_passwd.pl is a Perl script intended to
convert a passwd(5) file into LDIF format.  The while (<INFILE>) loop has a
misguided section where it attempts to do a global search and replace on
non-ASCII patterns (vowels with an umlaut and six other Germanic-looking code
points) to represent them in ASCII instead.  [Conveniently ignoring all the
other non-ASCII possibilities, but that's a separate issue.]

Unfortunately, it does this by encoding raw ISO 8859-15 characters into the
Perl script.  Since Perl 5.8 thinks scripts are encoded in UTF-8 by default,
this breaks and misconverts the code points.

I'm not sure this part of the code makes sense anyway.  Are non-ASCII characters
permitted in a GECOS field?  If they are, then what text encoding is used? 
Finally, what encodings are permitted by the LDAP schema for these attributes? 
I have a feeling this script doesn't bother to worry itself about any of that at
all.

Version-Release number of selected component (if applicable):
openldap-servers-2.0.27-9

Steps to Reproduce:
This can be verified by simply 'LANG=C less migrate_passwd.pl' to see the actual
bytes stored.  You can also create a fake passwd(5) file containing appropriate
characters with vi, and process it with the script to see what happens.

Possible Fixes:

OPTION 1) Recode migrate_passwd.pl from ISO 8859-15 to UTF-8:

   mv migrate_passwd.pl migrate_passwd.pl.orig
   cat migrate_passwd.pl.orig | perl -e 'use encoding "iso 8859-15",
STDOUT=>"utf8"; while (<>) {print};' > migrate_passwd.pl

OPTION 2) Add the following line to the top of migrate_passwd.pl, specifying
that the script is in ISO 8859-15 format and converting on the fly. This would
probably cause a performance hit, though:

   use encoding "iso 8859-15", Filter=>1;

OPTION 3) Junk these scripts and get some LDAP migration scripts that aren't so
scary.  :)
Comment 1 Nalin Dahyabhai 2003-09-15 12:16:56 EDT
The 'gecos' field used in the 'posixAccount' objectclass has a syntax of 'IA5
String' (1.3.6.1.4.1.1466.115.121.1.26).  (The alphabet for an IA5 String is
actually more restrictive than ASCII, consisting of alphanumerics and only a few
punctuation marks.)

You're right, the script is not very smart about handling the default codeset,
but we're looking at input data which doesn't have a defined encoding, so I'm
not sure we can do more than guess anyway.

Does Perl 5.8 always assume that scripts are encoded in UTF-8, or just when
executed in a UTF-8 locale?
Comment 2 Steve Bonneville 2003-09-15 16:33:18 EDT
Good point.  I'll admit that I haven't worked with Perl UTF-8 support much, so I
just now went back to the docs to double check a few things.

I'll back off from my assertion that Perl expects scripts to be encoded in
UTF-8, although that does seem to be the eventual plan.  In the meantime, they
have a transitional mechanism that "transparently" decides what to do.

It looks like what's happening is that Perl is checking to see if LANG has
'UTF-8' in the string.  If it does, Perl assumes the default encoding of STDIN /
STDOUT / STDERR on all subsequent file opens is UTF-8.  So yes, locale affects
how the script is read.

In the migrate_passwd.pl script, the regexp engine detects STDIN is UTF-8, and
operates in UTF-8 character context instead of raw byte context (the old
default).  For more info, see perluniintro(1), perlunicode(1), utf8(3pm),
bytes(3pm), and encoding(3pm).

Given that gecos is IA5String, I'm not sure how the script should sanitize any
non-ASCII data it gets fed.  Nevertheless, it shouldn't let any non-ASCII data
through into the output LDIF for the attribute.  :(
Comment 3 RHEL Product and Program Management 2007-10-19 15:35:02 EDT
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.
Comment 4 Steve Bonneville 2007-10-19 15:54:37 EDT
Problem still exists in RHEL 5, openldap-servers-2.3.27-5.  Updated version
affected to RHEL 5.
Comment 5 Jan Safranek 2007-10-22 09:20:08 EDT
reported upstream: http://bugzilla.padl.com/show_bug.cgi?id=346
Comment 6 Jan Zeleny 2009-11-16 05:20:14 EST
Any update here? Since there appears not to be any universal solution, I'm considering closing this bug as CANTFIX.
Comment 7 Steve Bonneville 2009-11-17 16:27:11 EST
It looks to me that the code should be attempting to sanitize the input data so that the contents of the string used for the value of the 'gecos' attribute have only valid IA5 characters (effectively the 7-bit ASCII set) in them.

If this is harder than it's worth, then perhaps the code shouldn't even be attempting to make a conversion in the first place.  (Someone was trying to be clever when they put this hack in the code, I suppose.)  Chances are low that there'd be non-ASCII data in a password GECOS field, I suppose, but this hack must have been dropped in for a reason.

I don't recall at this point in time whether the Perl script fails to match the
input characters because it reads them in the wrong charset, or if it produces output which is outside IA5.  

As I recall from looking at this way back in, yikes, 2003, part of the issue is the code was mostly written as Perl 4 but running in the Perl 5 environment, which led to some of the other bugs I filed at the same time.
Comment 9 RHEL Product and Program Management 2010-08-09 14:44:42 EDT
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.
Comment 10 RHEL Product and Program Management 2011-05-31 09:50:43 EDT
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.
Comment 11 Jan Vcelak 2011-08-17 12:05:51 EDT
We haven't found any general solution and there is absolutely no response from upstream for more than three years.

I'm closing this report as CANTFIX.

Note You need to log in before you can comment on or make changes to this bug.