rpmbuild currently allows legacy character sets in changelogs, summary, etc. It would be useful if it would refuse to build if the specfile contains such errors. Valid UTF-8 should be required.
No, it shouldn't. Character set encoding should be a matter of policy, not programming. RPM should support any character set encoding which can correctly be processed by the shell and RPM's own parser.
I believe RPM lacks any method of _marking_ text which is stored in obsolete character sets. Therefore it _can't_ correctly be processed. You end up with a mix of UTF-8 and obsolete undecipherable data in the RPM database.
So now you think you can declare character sets obsolete? Who gave you that right? What RPM needs is a method for specifying character encodings for data, not some wholesale declaration from Almighty RedHat that their solution is the One True Way. You can keep your Kool-Aid, thanks.
This is the Fedora Core bugzilla, and a Fedora Core RFE.
It's also the RPM bugzilla and an RPM RFE.
In the context of Fedora, having specfile content in non-UTF8 character sets is a bug. I'm not interested in what RPM does outside Fedora, although the same argument probably applies -- unmarked obsolete character sets are just random undecipherable data. This isn't the place to discuss that though -- this is purely a Fedora RFE.
And I couldn't care less what RPM does *in* Fedora, except that cooperation with other RPM-based distributions helps everyone, and introducing incompatibilities between RPM in different distributions is self-defeating. This is an issue which has potentially-far-reaching implications, and blindly implementing it is extremely unwise. The issue requires discussion. Knee-jerk solutions are rarely the appropriate response to complex problems.
FWIW this has been brought up upstream: https://lists.dulug.duke.edu/pipermail/rpm-devel/2006-February/000779.html https://lists.dulug.duke.edu/pipermail/rpm-devel/2006-March/000931.html Michael - Jeff is already talking about doing this upstream, I'd raise any concerns about UTF-8 everywhere on list.
Yeah, I was talking to Jeff about this a while ago. The point is that if it's random unmarked character sets, you _can't_ sanely guess what it is. Not without being horribly US-centric, anyway. It's better just to require correct input, rather than hacking up crap and known-broken heuristics. We should fix rpmlint to check for it if it doesn't already, too.
<sarcasm> Well there's always g_locale_to_utf8(), which applies iconv(3) until joy ... </sarcasm> It *IS* a matter of policy, not implementation, to set encodings. However, that basically means that encodings need to be hinted from spec files, and verified accurate when building, and carried within packages, and converted to local locale when displayed, etc, etc, etc. Perfectly doable, but it's gonna be a tedious nightmare ...
Your sarcastic suggestion is likely to be taken seriously by the less sober among us... you know pefectly well that you can't just apply iconv(3) until you get something which works -- there are a huge number of 8-bit encodings which can be converted to UTF-8 but don't actually _mean_ anything. Untagged non-UTF-8 data is just line noise. Tagging data is pointless -- just store it in a portable format (i.e. UTF-8) in all RPM metadata, and convert it only for display if you have to. You might be able to implement some broken hack in other situations to appease some outspoken Luddites, but we certainly wouldn't want anything like that in Fedora -- just require UTF-8 and abort the build if it's not valid.
(In reply to comment #11) > Untagged non-UTF-8 data is just line noise. Untagged data in ANY unknown encoding is line noise. > Tagging data is pointless That may be your opinion, but it is beyond your reach to make that assumption for everyone else on the planet. > You might be able to implement some broken hack in other situations to appease > some outspoken Luddites, So now we're resorting to name-calling? How unfortunate. Are they handing out @redhat.com e-mail addresses to just anyone these days? If all you have to offer is dictatorial decrees and orthogonal vitriolic spew, kindly step aside so that those of us who wish to discuss the technical merits of this RFE may do so without all the interfering spam.
You are correct. It is beyond my reach to make that assumption for everyone else on the planet. That's why I restricted myself to doing so in a Fedora Core RFE, in Fedora bugzilla. In the context of _Fedora_ it's perfectly reasonable to label those who refuse to use UTF-8 as Luddites. You just have to look at the quality of the alternative 'solution' which was proposed -- hacking all the RPM formats from specfile through to the database to tag data in random formats instead of just storing it in a consistent encoding in the first place. Since you persist in trolling the Fedora bugzilla and talking about non-Fedora issues, I suppose I might as well capitulate and discuss it... There's no excuse for avoiding UTF-8 in RPM internals, even outside the context of Fedora. That would really be pointless -- there's certainly no need to 'extend' its file formats when we can just store data in UTF-8, which can represent the older encodings. We can quite happily fix rpmq to convert from UTF-8 to the current locale in its output, and fix rpmbuild to convert _to_ UTF-8 from the current locale when reading the specfile. Although we certainly wouldn't want the latter in Fedora -- if I check out the current libxml2/devel branch from CVS and attempt to build it, for example, it should _fail_. It certainly shouldn't use _my_ locale (and it'd fail anyway because of course my locale is UTF-8). You'd need a way to handle existing RPM databases, which may contain random data in unknown encodings. Probably an 'rpm --rebuilddb --oldcharset=FOO' on RPM upgrade? This isn't a new problem _anyway_ since an existing RPM database without either a consistent charset or charset tagging is just line noise. And of course you might have to call it 'RPM-CHARSET' instead of 'UTF-8' to appease those who have religious objections to UTF-8.
FWIW, Fedora could easily apply iconv(1) to all spec files no matter what rpm does. All the horses have to want to do is drink ...
Apply iconv to convert from what character set? If I check libxml2 out of CVS and attempt to build it, how is my system supposed to guess which random obsolete character set it's using? It just doesn't work -- it's better if the broken specfile doesn't compile on the system of whoever committed it.
(In reply to comment #13) > You are correct. It is beyond my reach to make that assumption for everyone else > on the planet. That's why I restricted myself to doing so in a Fedora Core RFE, > in Fedora bugzilla. At the risk of beating a dead horse, until either jbj or RedHat decide to part Bugzillas, this is also RPM's bugzilla. > In the context of _Fedora_ it's perfectly reasonable to label those who refuse > to use UTF-8 as Luddites. http://www.nizkor.org/features/fallacies/ad-hominem.html > You just have to look at the quality of the alternative 'solution' which was > proposed -- hacking all the RPM formats from specfile through to the database > to tag data in random formats instead of just storing it in a consistent > encoding in the first place. You are using the word "random" in a manner with which I am unfamiliar. Specified and defined character encodings are not "random." > Since you persist in trolling the Fedora bugzilla and talking about non-Fedora > issues, I suppose I might as well capitulate and discuss it... Get this through your head: This is not Fedora bugzilla. This is RedHat bugzilla, which is currently shared between RHEL, RPM, Fedora, RHAS, and RHN, among others. There is no "RPM" product, so the "rpm" component is used. You used it. So here we are. Furthermore, this is a Bazaar, not a Cathedral. RPM is used by AIX, Solaris, Darwin, and numerous flavors of Linux, not just Fedora. If you have a problem with that, convince the Fedora Deities to use a different package format. Until then, suck it up and deal. Those who develop RPM and related tools concern themselves with numerous operating systems, the majority of which do NOT use UTF-8 by default. > There's no excuse for avoiding UTF-8 in RPM internals, even outside the context > of Fedora. That would really be pointless -- there's certainly no need to > 'extend' its file formats when we can just store data in UTF-8, which can > represent the older encodings. Those who fail to learn from the mistakes of history are doomed to repeat them. You have apparently failed to learn from the mistake of assuming that the de facto standard encoding cannot change over time and does not differ between platforms. Right now, UTF-8 is a compelling replacement for Latin encodings (which are NOT obsolete, so stop erroneously using that term). In the future, UTF-8 may be found to be insufficient to the cause. The correct long-term solution is to allow spec files to specify an arbitrary encoding and to use an internal encoding which can store all data any other encoding could contain. > fix rpmbuild to convert _to_ UTF-8 from the current locale when > reading the specfile. There is no relationship between current locale and the encoding of a particular spec file. > if I check out the current libxml2/devel branch from CVS and attempt to build > it, for example, it should _fail_. It certainly shouldn't use _my_ locale (and > it'd fail anyway because of course my locale is UTF-8). There is nothing whatsoever inherently wrong with any particular encoding. I should be able to create a spec file in UCS-4 or UTF-32 if I so choose. The problem is telling RPM what encoding was used, and the proper solution does not involve ASSuming UTF-8 and failing on an invalid character. > You'd need a way to handle existing RPM databases, which may contain random data > in unknown encodings. "random" You keep using that word. I do not think it means what you think it means. > And of course you might have to call it 'RPM-CHARSET' instead of 'UTF-8' to > appease those who have religious objections to UTF-8. The encoding should be called exactly what it is, be it UTF-8, UCS-4, or any other. My objections are not to UTF-8 itself, and they're technical, not religious. Now, let's talk technical details to try and save the usefulness of this whole conversation. Jeff and Paul (and other RPM developers), please comment on the following two ideas: 1. Spec files are encoded as US-ASCII/UTF-8 by default. Any containing characters which cannot be encoded thusly must specify their encoding via either a header value ("Encoding: ISO-8859-2") or a macro value ("%define __spec_encoding ISO-8859-2"), whichever you think is better. 2. Values which contain non-ASCII characters should specify encoding similar to the way languages are currently specified. For example, PLD uses Summary(pl): and description -l pl to denote Polish content. This could be expanded to allow Summary(pl.utf8) and description -l pl.utf8.
(In reply to comment #16) > At the risk of beating a dead horse, until either jbj or RedHat decide to part > Bugzillas, this is also RPM's bugzilla. jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to have an account he was of course able to make a comment. > http://www.nizkor.org/features/fallacies/ad-hominem.html Read it again. You evidently misunderstood it the first time. Note the difference between the following: A. "You smoke crack. Therefore your opinion is irrelevant". B. "You have very strange opinions. Therefore I suspect you smoke crack." The former is the classic ad-hominem fallacy. The latter isn't. > You are using the word "random" in a manner with which I am unfamiliar. > Specified and defined character encodings are not "random." My point is that they _aren't_ specified and defined. In the absence of such tagging, it's line noise. It's essentially random. > Get this through your head: This is not Fedora bugzilla. Heh. Mind if I quote you on that? This bug is filed against a specific version of a specific product. > There is no relationship between current locale and the encoding of a > particular spec file. Yes, that's the point I made in the immediately subsequent sentence. > 1. Spec files are encoded as US-ASCII/UTF-8 by default. Any containing > characters which cannot be encoded thusly must specify their encoding via either > a header value ("Encoding: ISO-8859-2") or a macro value ("%define > __spec_encoding ISO-8859-2"), whichever you think is better. > > 2. Values which contain non-ASCII characters should specify encoding similar to > the way languages are currently specified. For example, PLD uses Summary(pl): > and description -l pl to denote Polish content. This could be expanded to allow > Summary(pl.utf8) and description -l pl.utf8. That's only a partial solution, and it's the uninteresting part of the solution. The more interesting part is what you do with an existing RPM database if it contains random data. And I do mean 'random' -- if it's in untagged character sets it might as well be line noise.
The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8 can represent all languages of the world, so there is no need for supporting anything else.
That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able to convert to the user's locale when displaying text. It might also make sense to take specfiles in obsolete charsets, if they are clearly marked as such. If that's done, the original Fedora RFE stands, in a slightly modified form -- if it _isn't_ tagged, and if it isn't valid UTF-8, we should reject it. Repeat after me: Untagged data are no better than line noise.
(In reply to comment #17) > jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to > have an account he was of course able to make a comment. Apparently you aren't familiar with Bugzilla's watch feature or weren't aware that jbj was watching. Regardless, your statement does nothing to contradict or invalidate my point: RPM has used this Bugzilla since its inception and will continue to do so unless and until either jbj or RedHat decide to alter the arrangement. > Read it again. You evidently misunderstood it the first time. Funny, I'd say the same about you. > Note the difference between the following: > > A. "You smoke crack. Therefore your opinion is irrelevant". > B. "You have very strange opinions. Therefore I suspect you smoke crack." Neither of which is representative of what actually occurred: C. "You disagree with me. Therefore I will imply that you are a Luddite." > My point is that they _aren't_ specified and defined. In the absence of such > tagging, it's line noise. It's essentially random. It's not even close to random by any sane definition of the word. The contents of a spec file in an unspecified encoding would have at most 1 or 2 bits of entropy per byte of content. A sufficiently determined individual could almost certainly use a known-plaintext attack on certain spec file parts to produce the encoding. If this is true, applying the label "random" is clearly an overstatement. What you mean to say is that the encoding of untagged data cannot be computationally deduced with sufficient certainty to be used with system-critical information such as that stored in an RPM database. And with that, I agree. :-) My primary goal in all this is two-fold: One, point out that the assumption that UTF-8 is the end-all and be-all encoding for the entire lifespan of the RPM product is potentially as erroneous as the assumption that the C/POSIX locale makes for a sufficient default. Two, open up discussion on mechanisms for tagging to allow for the future. > Heh. Mind if I quote you on that? Go ahead, so long as you include the entire context: 1. Fedora is one of many products sharing this Bugzilla. 2. RPM is also sharing this Bugzilla. 3. RPM does not have its own "product" selection. 4. As a result of 2 and 3, any bug filed against the "rpm" component under *any* product may be an upstream issue. > That's only a partial solution, and it's the uninteresting part of the solution. "Uninteresting" is often a synonym for "important." In light of goal #1 stated above, I consider it an important point. > The more interesting part is what you do with an existing RPM database if it > contains random data. If you have random data in your RPM database, you have bigger issues than whether or not the random data is UTF-8 encoded randomness. > And I do mean 'random' -- if it's in untagged character sets it might as well > be line noise. Nonsense. The vast majority of textual data in an RPM database is plain old ordinary ASCII, which means it's valid ISO-8859-?? as well as UTF-8. Furthermore, I am having a hard time coming up with examples of RPMDB text data which (1) would contain high-ASCII/multibyte data sequences AND (2) the interpretation of which would have significant material impact on a system. Most situations where encoding counts are things like descriptions and summaries...things that are merely cosmetic. It seems to me that the following would suffice when upgrading a non-tagged RPMDB: 1. All data which can be interpreted as ASCII is ASCII. 2. If any bytes 0x80 and above are encountered, attempt to process as UTF-8. 3. If invalid UTF-8 is encountered, look for language tags (like fr or de) to deduce encoding (Latin-N, SJIS, BIG5). 4. If no deduction can be made with reasonable certainty, or if a deduction could cause system problems, replace invalid UTF-8 character sequences with some other character and move on. (In reply to comment #18) > The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8 > can represent all languages of the world, so there is no need for supporting > anything else. First off, ISO-8859 encodings are not obsolete. The vast majority of UNIX-like systems in the world still use Latin-N encodings, and that's not going to change any time soon. Fedora developers declaring something obsolete does not make it obsolete; rather, it makes said developers pretentious. Second, as I've said before, UTF-8 being the "answer to all our encoding problems" now does not mean it will continue to be so in the future. UTF-8 owes its popularity to two compelling but potentially limiting facts: ASCII encodings don't change, and C-style NUL termination doesn't have to change. For legacy code, those are a huge win. But as more and more code becomes multilingual and encoding-agnostic, those factors reduce significantly in importance, lending additional potential to more consistent encodings such as UCS2 and UCS4. If you really want something to become obsolete, continue thinking with blinders on. Before you know it, your thinking will be obsolete. (In reply to comment #19) > That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able > to convert to the user's locale when displaying text. Definitely. > It might also make sense to take specfiles in obsolete charsets, if they are > clearly marked as such. If that's done, the original Fedora RFE stands, in a > slightly modified form -- if it _isn't_ tagged, and if it isn't valid UTF-8, we > should reject it. I would have no problem with that, with two provisos: 1. Encoding *should* always be tagged. Relying on a particular default should be discouraged. 2. The RPMDB encoding should be opaque as far as packagers are concerned. While UTF-8 may make sense for now, an alternate format may be preferable in the future. > Repeat after me: Untagged data are no better than line noise. s/no/only marginally/
To answer the 2 questions from comment #16: 1) The "default" encoding for non-i18n tags (summary/description/group) is essentially 8bit octets since there is no enforcement of any encoding. 2) For i18n tags (summary/description/group), one can spcify any locale one wishes, e.g. Summary(pl.utf8): Summary(pl.wtfencoding) are perfectly permissible, and the associative array used for retrieval of the tags will attemp to key from "pl.utf8" and "pl.wtfencoding" for appropriate strings to be returned. That being said, the real world is far more complicated, and there is no current attempt to translate strings into a different encoding, the current packaging pragma is to assume a 2 letter country code implies an encoding, and the whole mess was obsoleted in RHL 6.2, at least for RH packaging, in favor of specspo (the implementation has not been tested since RHL 6.2, at least by me) which at least *has* an explicit and well specified encoding. Nor are summary/description/group the only data that should hav i18n representations. Unfortunately, rpm has no "array of i18n strings" data type, only "string array" and "i18n string" (i.e. an associative array. Adding a new data type to rpm is a very very painful experience, using specspo-like to associate strings *WITH* a package rather than *IN* a package is far easier to achieve than actually adding a new data type. I seriously question whether i18n encodings for strings has anything whatsoever to do with package management. Use a a key, like Pkgid/Hdrid/NEVRA or whatever to attach glop to the package for display purposes instead.
IIUC, you are saying that you wish for RPM to essentially remain ignorant of encodings and leave that up to the individual packages. Is that correct? Do RPM scriptlets always execute under the current user's locale?
Yes, mapping i18n encodings out of rpm to the greatest extent possible. RPM scriptlets execute without attempting to control for locale. Whatever is in the environment is in the environment.
WONTFIX, then?
At the very least, RPM should refuse to eat untagged non-ASCII data.
Untagged non-ASCII data is syntactically permitted in spec files that are currently widely deployed. So rpmbuild should just FULL STOP because you want utf-8 everywhere? Shall I add your email (and preferred encoding) to the error message before rpmbuild comes FULL STOP?
User pnasrat's account has been closed
Reassigning to owner after bugzilla made a mess, sorry about the noise...
Adding FutureFeature keyword to RFE's.
Moved to upstream tracking: http://rpm.org/ticket/30