190363 – RFE: Reject non-UTF-8 input.

Bug 190363 - RFE: Reject non-UTF-8 input.

Summary: RFE: Reject non-UTF-8 input.

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Panu Matilainen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-01 16:38 UTC by David Woodhouse
Modified:	2009-01-30 09:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-01-30 09:12:03 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Woodhouse 2006-05-01 16:38:39 UTC

rpmbuild currently allows legacy character sets in changelogs, summary, etc. 

It would be useful if it would refuse to build if the specfile contains such
errors. Valid UTF-8 should be required.

Comment 1 Michael Jennings (KainX) 2006-05-01 19:09:29 UTC

No, it shouldn't.  Character set encoding should be a matter of policy, not
programming.  RPM should support any character set encoding which can correctly
be processed by the shell and RPM's own parser.

Comment 2 David Woodhouse 2006-05-01 19:23:30 UTC

I believe RPM lacks any method of _marking_ text which is stored in obsolete
character sets. Therefore it _can't_ correctly be processed. You end up with a
mix of UTF-8 and obsolete undecipherable data in the RPM database.

Comment 3 Michael Jennings (KainX) 2006-05-01 22:14:19 UTC

So now you think you can declare character sets obsolete?  Who gave you that
right?  What RPM needs is a method for specifying character encodings for data,
not some wholesale declaration from Almighty RedHat that their solution is the
One True Way.  You can keep your Kool-Aid, thanks.

Comment 4 David Woodhouse 2006-05-01 22:20:56 UTC

This is the Fedora Core bugzilla, and a Fedora Core RFE.

Comment 5 Michael Jennings (KainX) 2006-05-01 23:15:47 UTC

It's also the RPM bugzilla and an RPM RFE.

Comment 6 David Woodhouse 2006-05-01 23:35:28 UTC

In the context of Fedora, having specfile content in non-UTF8 character sets is
a bug. I'm not interested in what RPM does outside Fedora, although the same
argument probably applies -- unmarked obsolete character sets are just random
undecipherable data. This isn't the place to discuss that though -- this is
purely a Fedora RFE.

Comment 7 Michael Jennings (KainX) 2006-05-02 00:02:42 UTC

And I couldn't care less what RPM does *in* Fedora, except that cooperation with
other RPM-based distributions helps everyone, and introducing incompatibilities
between RPM in different distributions is self-defeating.

This is an issue which has potentially-far-reaching implications, and blindly
implementing it is extremely unwise.  The issue requires discussion.  Knee-jerk
solutions are rarely the appropriate response to complex problems.

Comment 8 Paul Nasrat 2006-05-02 01:06:02 UTC

FWIW this has been brought up upstream:

https://lists.dulug.duke.edu/pipermail/rpm-devel/2006-February/000779.html

https://lists.dulug.duke.edu/pipermail/rpm-devel/2006-March/000931.html

Michael - Jeff is already talking about doing this upstream, I'd raise any
concerns about UTF-8 everywhere on list.

Comment 9 David Woodhouse 2006-05-02 11:14:32 UTC

Yeah, I was talking to Jeff about this a while ago. The point is that if it's
random unmarked character sets, you _can't_ sanely guess what it is. Not without
being horribly US-centric, anyway. It's better just to require correct input,
rather than hacking up crap and known-broken heuristics.

We should fix rpmlint to check for it if it doesn't already, too.

Comment 10 Jeff Johnson 2006-05-02 22:50:00 UTC

<sarcasm>
Well there's always g_locale_to_utf8(), which applies iconv(3) until joy ...
</sarcasm>

It *IS* a matter of policy, not implementation, to set encodings. However, that basically
means that encodings need to be hinted from spec files, and verified accurate
when building, and carried within packages, and converted to local locale when
displayed, etc, etc, etc.

Perfectly doable, but it's gonna be a tedious nightmare ...

Comment 11 David Woodhouse 2006-05-02 23:23:26 UTC

Your sarcastic suggestion is likely to be taken seriously by the less sober
among us... you know pefectly well that you can't just apply iconv(3) until you
get something which works -- there are a huge number of 8-bit encodings which
can be converted to UTF-8 but don't actually _mean_ anything. Untagged non-UTF-8
data is just line noise.

Tagging data is pointless -- just store it in a portable format (i.e. UTF-8) in
all RPM metadata, and convert it only for display if you have to.

You might be able to implement some broken hack in other situations to appease
some outspoken Luddites, but we certainly wouldn't want anything like that in
Fedora -- just require UTF-8 and abort the build if it's not valid.

Comment 12 Michael Jennings (KainX) 2006-05-03 01:55:13 UTC

(In reply to comment #11)
> Untagged non-UTF-8 data is just line noise.

Untagged data in ANY unknown encoding is line noise.

> Tagging data is pointless

That may be your opinion, but it is beyond your reach to make that assumption
for everyone else on the planet.

> You might be able to implement some broken hack in other situations to appease
> some outspoken Luddites,

So now we're resorting to name-calling?  How unfortunate.  Are they handing out
@redhat.com e-mail addresses to just anyone these days?

If all you have to offer is dictatorial decrees and orthogonal vitriolic spew,
kindly step aside so that those of us who wish to discuss the technical merits
of this RFE may do so without all the interfering spam.

Comment 13 David Woodhouse 2006-05-03 08:09:26 UTC

You are correct. It is beyond my reach to make that assumption for everyone else
on the planet. That's why I restricted myself to doing so in a Fedora Core RFE,
in Fedora bugzilla.

In the context of _Fedora_ it's perfectly reasonable to label those who refuse
to use UTF-8 as Luddites. You just have to look at the quality of the
alternative 'solution' which was proposed -- hacking all the RPM formats from
specfile through to the database to tag data in random formats instead of just
storing it in a consistent encoding in the first place.

Since you persist in trolling the Fedora bugzilla and talking about non-Fedora
issues, I suppose I might as well capitulate and discuss it...

There's no excuse for avoiding UTF-8 in RPM internals, even outside the context
of Fedora. That would really be pointless -- there's certainly no need to
'extend' its file formats when we can just store data in UTF-8, which can
represent the older encodings.

We can quite happily fix rpmq to convert from UTF-8 to the current locale in its
output, and fix rpmbuild to convert _to_ UTF-8 from the current locale when
reading the specfile. Although we certainly wouldn't want the latter in Fedora
-- if I check out the current libxml2/devel branch from CVS and attempt to build
it, for example, it should _fail_. It certainly shouldn't use _my_ locale (and
it'd fail anyway because of course my locale is UTF-8).

You'd need a way to handle existing RPM databases, which may contain random data
in unknown encodings. Probably an 'rpm --rebuilddb --oldcharset=FOO' on RPM
upgrade? This isn't a new problem _anyway_ since an existing RPM database
without either a consistent charset or charset tagging is just line noise.

And of course you might have to call it 'RPM-CHARSET' instead of 'UTF-8' to
appease those who have religious objections to UTF-8.

Comment 14 Jeff Johnson 2006-05-03 13:06:12 UTC

FWIW, Fedora could easily apply iconv(1) to all spec files no matter what rpm does.
All the horses have to want to do is drink ...

Comment 15 David Woodhouse 2006-05-03 13:17:38 UTC

Apply iconv to convert from what character set? If I check libxml2 out of CVS
and attempt to build it, how is my system supposed to guess which random
obsolete character set it's using? It just doesn't work -- it's better if the
broken specfile doesn't compile on the system of whoever committed it.

Comment 16 Michael Jennings (KainX) 2006-05-03 21:58:23 UTC

(In reply to comment #13)
> You are correct. It is beyond my reach to make that assumption for everyone else
> on the planet. That's why I restricted myself to doing so in a Fedora Core RFE,
> in Fedora bugzilla.

At the risk of beating a dead horse, until either jbj or RedHat decide to part
Bugzillas, this is also RPM's bugzilla.

> In the context of _Fedora_ it's perfectly reasonable to label those who refuse
> to use UTF-8 as Luddites.

http://www.nizkor.org/features/fallacies/ad-hominem.html

> You just have to look at the quality of the alternative 'solution' which was
> proposed -- hacking all the RPM formats from specfile through to the database
> to tag data in random formats instead of just storing it in a consistent
> encoding in the first place.

You are using the word "random" in a manner with which I am unfamiliar. 
Specified and defined character encodings are not "random."

> Since you persist in trolling the Fedora bugzilla and talking about non-Fedora
> issues, I suppose I might as well capitulate and discuss it...

Get this through your head:  This is not Fedora bugzilla.  This is RedHat
bugzilla, which is currently shared between RHEL, RPM, Fedora, RHAS, and RHN,
among others.  There is no "RPM" product, so the "rpm" component is used.  You
used it.  So here we are.

Furthermore, this is a Bazaar, not a Cathedral.  RPM is used by AIX, Solaris,
Darwin, and numerous flavors of Linux, not just Fedora.  If you have a problem
with that, convince the Fedora Deities to use a different package format.  Until
then, suck it up and deal.

Those who develop RPM and related tools concern themselves with numerous
operating systems, the majority of which do NOT use UTF-8 by default.

> There's no excuse for avoiding UTF-8 in RPM internals, even outside the context
> of Fedora. That would really be pointless -- there's certainly no need to
> 'extend' its file formats when we can just store data in UTF-8, which can
> represent the older encodings.

Those who fail to learn from the mistakes of history are doomed to repeat them.
 You have apparently failed to learn from the mistake of assuming that the de
facto standard encoding cannot change over time and does not differ between
platforms.  Right now, UTF-8 is a compelling replacement for Latin encodings
(which are NOT obsolete, so stop erroneously using that term).  In the future,
UTF-8 may be found to be insufficient to the cause.   The correct long-term
solution is to allow spec files to specify an arbitrary encoding and to use an
internal encoding which can store all data any other encoding could contain.

> fix rpmbuild to convert _to_ UTF-8 from the current locale when
> reading the specfile.

There is no relationship between current locale and the encoding of a particular
spec file.

> if I check out the current libxml2/devel branch from CVS and attempt to build
> it, for example, it should _fail_. It certainly shouldn't use _my_ locale (and
> it'd fail anyway because of course my locale is UTF-8).

There is nothing whatsoever inherently wrong with any particular encoding.  I
should be able to create a spec file in UCS-4 or UTF-32 if I so choose.  The
problem is telling RPM what encoding was used, and the proper solution does not
involve ASSuming UTF-8 and failing on an invalid character.

> You'd need a way to handle existing RPM databases, which may contain random data
> in unknown encodings.

"random"  You keep using that word.  I do not think it means what you think it
means.

> And of course you might have to call it 'RPM-CHARSET' instead of 'UTF-8' to
> appease those who have religious objections to UTF-8.

The encoding should be called exactly what it is, be it UTF-8, UCS-4, or any
other.  My objections are not to UTF-8 itself, and they're technical, not religious.

Now, let's talk technical details to try and save the usefulness of this whole
conversation.

Jeff and Paul (and other RPM developers), please comment on the following two ideas:

1.  Spec files are encoded as US-ASCII/UTF-8 by default.  Any containing
characters which cannot be encoded thusly must specify their encoding via either
a header value ("Encoding: ISO-8859-2") or a macro value ("%define
__spec_encoding ISO-8859-2"), whichever you think is better.

2.  Values which contain non-ASCII characters should specify encoding similar to
the way languages are currently specified.  For example, PLD uses Summary(pl):
and description -l pl to denote Polish content.  This could be expanded to allow
Summary(pl.utf8) and description -l pl.utf8.

Comment 17 David Woodhouse 2006-05-04 00:50:06 UTC

(In reply to comment #16)
> At the risk of beating a dead horse, until either jbj or RedHat decide to part
> Bugzillas, this is also RPM's bugzilla.

jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to
have an account he was of course able to make a comment.

> http://www.nizkor.org/features/fallacies/ad-hominem.html

Read it again. You evidently misunderstood it the first time.

Note the difference between the following:

A. "You smoke crack. Therefore your opinion is irrelevant".
B. "You have very strange opinions. Therefore I suspect you smoke crack."

The former is the classic ad-hominem fallacy. The latter isn't.

> You are using the word "random" in a manner with which I am unfamiliar. 
> Specified and defined character encodings are not "random."

My point is that they _aren't_ specified and defined. In the absence of such
tagging, it's line noise. It's essentially random.

> Get this through your head:  This is not Fedora bugzilla. 

Heh. Mind if I quote you on that? This bug is filed against a specific version
of a specific product. 

> There is no relationship between current locale and the encoding of a 
> particular spec file.

Yes, that's the point I made in the immediately subsequent sentence.

> 1.  Spec files are encoded as US-ASCII/UTF-8 by default.  Any containing
> characters which cannot be encoded thusly must specify their encoding via either
> a header value ("Encoding: ISO-8859-2") or a macro value ("%define
> __spec_encoding ISO-8859-2"), whichever you think is better.
> 
> 2.  Values which contain non-ASCII characters should specify encoding similar to
> the way languages are currently specified.  For example, PLD uses Summary(pl):
> and description -l pl to denote Polish content.  This could be expanded to allow
> Summary(pl.utf8) and description -l pl.utf8.

That's only a partial solution, and it's the uninteresting part of the solution.
The more interesting part is what you do with an existing RPM database if it
contains random data. And I do mean 'random' -- if it's in untagged character
sets it might as well be line noise.

Comment 18 Kevin Kofler 2006-05-04 12:12:14 UTC

The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8 
can represent all languages of the world, so there is no need for supporting 
anything else.

Comment 19 David Woodhouse 2006-05-04 12:27:31 UTC

That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able
to convert to the user's locale when displaying text.

It might also make sense to take specfiles in obsolete charsets, if they are
clearly marked as such. If that's done, the original Fedora RFE stands, in a
slightly modified form -- if it _isn't_ tagged, and if it isn't valid UTF-8, we
should reject it.

Repeat after me: Untagged data are no better than line noise.

Comment 20 Michael Jennings (KainX) 2006-05-04 17:22:28 UTC

(In reply to comment #17)
> jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to
> have an account he was of course able to make a comment.

Apparently you aren't familiar with Bugzilla's watch feature or weren't aware
that jbj was watching.  Regardless, your statement does nothing to contradict or
invalidate my point:  RPM has used this Bugzilla since its inception and will
continue to do so unless and until either jbj or RedHat decide to alter the
arrangement.

> Read it again. You evidently misunderstood it the first time.

Funny, I'd say the same about you.

> Note the difference between the following:
> 
> A. "You smoke crack. Therefore your opinion is irrelevant".
> B. "You have very strange opinions. Therefore I suspect you smoke crack."

Neither of which is representative of what actually occurred:

C.  "You disagree with me.  Therefore I will imply that you are a Luddite."

> My point is that they _aren't_ specified and defined. In the absence of such
> tagging, it's line noise. It's essentially random.

It's not even close to random by any sane definition of the word.  The contents
of a spec file in an unspecified encoding would have at most 1 or 2 bits of
entropy per byte of content.  A sufficiently determined individual could almost
certainly use a known-plaintext attack on certain spec file parts to produce the
encoding.  If this is true, applying the label "random" is clearly an overstatement.

What you mean to say is that the encoding of untagged data cannot be
computationally deduced with sufficient certainty to be used with
system-critical information such as that stored in an RPM database.

And with that, I agree.  :-)

My primary goal in all this is two-fold:  One, point out that the assumption
that UTF-8 is the end-all and be-all encoding for the entire lifespan of the RPM
product is potentially as erroneous as the assumption that the C/POSIX locale
makes for a sufficient default.  Two, open up discussion on mechanisms for
tagging to allow for the future.

> Heh. Mind if I quote you on that?

Go ahead, so long as you include the entire context:
  1.  Fedora is one of many products sharing this Bugzilla.
  2.  RPM is also sharing this Bugzilla.
  3.  RPM does not have its own "product" selection.
  4.  As a result of 2 and 3, any bug filed against the "rpm" component under  
      
      *any* product may be an upstream issue.

> That's only a partial solution, and it's the uninteresting part of the solution.

"Uninteresting" is often a synonym for "important."  In light of goal #1 stated
above, I consider it an important point.

> The more interesting part is what you do with an existing RPM database if it
> contains random data.

If you have random data in your RPM database, you have bigger issues than
whether or not the random data is UTF-8 encoded randomness.

> And I do mean 'random' -- if it's in untagged character sets it might as well
> be line noise.

Nonsense.  The vast majority of textual data in an RPM database is plain old
ordinary ASCII, which means it's valid ISO-8859-?? as well as UTF-8. 
Furthermore, I am having a hard time coming up with examples of RPMDB text data
which (1) would contain high-ASCII/multibyte data sequences AND (2) the
interpretation of which would have significant material impact on a system.

Most situations where encoding counts are things like descriptions and
summaries...things that are merely cosmetic.

It seems to me that the following would suffice when upgrading a non-tagged RPMDB:

1.  All data which can be interpreted as ASCII is ASCII.
2.  If any bytes 0x80 and above are encountered, attempt to process as UTF-8.
3.  If invalid UTF-8 is encountered, look for language tags (like fr or de) to
deduce encoding (Latin-N, SJIS, BIG5).
4.  If no deduction can be made with reasonable certainty, or if a deduction
could cause system problems, replace invalid UTF-8 character sequences with some
other character and move on.

(In reply to comment #18)
> The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8 
> can represent all languages of the world, so there is no need for supporting 
> anything else.

First off, ISO-8859 encodings are not obsolete.  The vast majority of UNIX-like
systems in the world still use Latin-N encodings, and that's not going to change
any time soon.  Fedora developers declaring something obsolete does not make it
obsolete; rather, it makes said developers pretentious.

Second, as I've said before, UTF-8 being the "answer to all our encoding
problems" now does not mean it will continue to be so in the future.  UTF-8 owes
its popularity to two compelling but potentially limiting facts:  ASCII
encodings don't change, and C-style NUL termination doesn't have to change.  For
legacy code, those are a huge win.  But as more and more code becomes
multilingual and encoding-agnostic, those factors reduce significantly in
importance, lending additional potential to more consistent encodings such as
UCS2 and UCS4.

If you really want something to become obsolete, continue thinking with blinders
on.  Before you know it, your thinking will be obsolete.

(In reply to comment #19)
> That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able
> to convert to the user's locale when displaying text.

Definitely.

> It might also make sense to take specfiles in obsolete charsets, if they are
> clearly marked as such. If that's done, the original Fedora RFE stands, in a
> slightly modified form -- if it _isn't_ tagged, and if it isn't valid UTF-8, we
> should reject it.

I would have no problem with that, with two provisos:
1.  Encoding *should* always be tagged.  Relying on a particular default should
be discouraged.
2.  The RPMDB encoding should be opaque as far as packagers are concerned. 
While UTF-8 may make sense for now, an alternate format may be preferable in the
future.

> Repeat after me: Untagged data are no better than line noise.

s/no/only marginally/

Comment 21 Jeff Johnson 2006-05-04 22:22:39 UTC

To answer the 2 questions from comment #16:

1) The "default" encoding for non-i18n tags (summary/description/group) is
essentially 8bit octets since there is no enforcement of any encoding.

2) For i18n tags (summary/description/group), one can spcify any locale one
wishes, e.g.
    Summary(pl.utf8):
    Summary(pl.wtfencoding)
are perfectly permissible, and the associative array used for retrieval of the
tags will attemp to key from "pl.utf8" and "pl.wtfencoding" for appropriate
strings to be returned.

That being said, the real world is far more complicated, and there is no current attempt to
translate strings into a different encoding, the current packaging pragma is to assume a 2 letter
country code implies an encoding, and the whole mess was obsoleted in RHL 6.2, at least for
RH packaging, in favor of specspo (the implementation has not been tested since RHL 6.2, at least
by me) which at least *has* an explicit and well specified encoding.

Nor are summary/description/group the only data that should hav i18n representations. Unfortunately,
rpm has no "array of i18n strings" data type, only "string array" and "i18n string" (i.e. an associative 
array. Adding a new data type to rpm is a very very painful experience, using specspo-like to
associate strings *WITH* a package rather than *IN* a package is far easier to achieve than
actually adding a new data type.

I seriously question whether i18n encodings for strings has anything whatsoever to do with package 
management. Use a a key, like Pkgid/Hdrid/NEVRA or whatever to attach glop to the package for 
display purposes instead.

Comment 22 Michael Jennings (KainX) 2006-05-14 16:33:35 UTC

IIUC, you are saying that you wish for RPM to essentially remain ignorant of
encodings and leave that up to the individual packages.  Is that correct?

Do RPM scriptlets always execute under the current user's locale?

Comment 23 Jeff Johnson 2006-05-15 17:31:02 UTC

Yes, mapping i18n encodings out of rpm to the greatest extent possible.

RPM scriptlets execute without attempting to control for locale. Whatever
is in the environment is in the environment.

Comment 24 Michael Jennings (KainX) 2006-05-16 16:53:21 UTC

WONTFIX, then?

Comment 25 David Woodhouse 2006-05-16 16:58:07 UTC

At the very least, RPM should refuse to eat untagged non-ASCII data.

Comment 26 Jeff Johnson 2007-06-06 05:49:33 UTC

Untagged non-ASCII data is syntactically permitted in spec files that are currently widely deployed.

So rpmbuild should just FULL STOP because you want utf-8 everywhere? Shall I add your email
(and preferred encoding) to the error message before rpmbuild comes FULL STOP?

Comment 27 Red Hat Bugzilla 2007-08-21 05:23:48 UTC

User pnasrat's account has been closed

Comment 28 Panu Matilainen 2007-08-22 06:35:04 UTC

Reassigning to owner after bugzilla made a mess, sorry about the noise...

Comment 29 Jon Stanley 2008-04-23 20:30:33 UTC

Adding FutureFeature keyword to RFE's.

Comment 31 Panu Matilainen 2009-01-30 09:12:03 UTC

Moved to upstream tracking: http://rpm.org/ticket/30

Note You need to log in before you can comment on or make changes to this bug.