Bug 1876946

Summary: rpmbuild complaining about utf-8 encoding
Product: [Fedora] Fedora Reporter: Ian Collier <imc>
Component: rpmAssignee: Packaging Maintenance Team <packaging-team-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: 32CC: igor.raits, mjw, packaging-team-maint, pmatilai, pmoravco, vmukhame
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-10 09:28:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ian Collier 2020-09-08 14:57:59 UTC
I'm packaging some third-party stuff for easier deployment, and when I build the 
package I'm getting nonsense errors which terminate the build, such as:

error: Package xxx: invalid utf-8 encoding in Classdict: TrueType Font data, digitally signed, 19 tables, 1st "DSIG", 26 names, Macintosh, Digitized data copyright © 2010-2011, Google Corporation.Open SansItalic1.10;1ASC;OpenSans-Ital - Invalid or incomplete multibyte or wide character

There are two problems with this.

1. RPM is not telling me the file with the invalid encoding, so that makes it 
   hard to fix, if this is a real error.

2. There's actually nothing wrong with the file in question.  RPM gets this by
   calling 'file' which outputs in the current locale, but it is trying to
   validate it as UTF-8, which won't work if your locale is not UTF-8.
   So either: (a) if your locale is not UTF-8, RPM should not try to validate
   it as UTF-8; or (b) if RPM wants UTF-8, it should set the locale to UTF-8
   before calling 'file'.

Of course there's an obvious workaround, which is to set the locale to UTF-8
before compiling the package.  But this is a confusing error if you are not
aware of that.

Comment 1 Panu Matilainen 2020-09-10 09:28:25 UTC
Such a problem can certainly occur, but there are invalid assumptions in your post: rpm does not call file, it uses libmagic API, and the locale does not affect the outcome because libmagic strings are not translated at all.

Rpm cannot directly tell you the associated file because it's not checked in that context at all (instead, the encoding check is run on the entire header). Just run 'file' manually on the fonts in the buildroot to see what matches (+ possibly fix).

For Fedora packages, utf-8 is mandatory but for your own purposes... if you don't care about the encoding, it's trivially worked around by adding the following to the spec:

%global _invalid_encoding_terminates_build 0

(after which you can also associate the broken description to the file in question by running 'rpm -q --fileclass <pkg>' if you want to try fixing instead)

So, not a bug, the check is doing exactly what it's meant to do.

Comment 2 Ian Collier 2020-09-10 09:38:48 UTC
"file" and "libmagic" are basically the same thing, and it's trivial to 
demonstrate that locale does make a difference because the exact same package
builds fine with LANG=en_GB.UTF-8 when it errors out with LANG=en_GB.iso8859-1
(the font files came from a third party and are not changed during the build).

So as I said, there is a trivial workaround; but the message is confusing
because the files in the package are not broken.

Comment 3 Panu Matilainen 2020-09-10 10:11:37 UTC
Right, translation != encoding. I don't see what we could do to help that though, except 
a) better document the issue + workaround (short term)
b) get rid of the libmagic classification strings in the first place (longer term)

Rpm itself couldn't care less about the encoding but the world at large expects utf-8 these days, which is why the check is there to begin with.

P.S. file and libmagic are of course quite literally "the same thing", but technically "calling file" is quite different involving forks and shells etc compared to using libmagic API as you surely know. A number of rpm scripts do "call file" instead, so the distinction matters.