Bug 1254738 - NotImplementedError: libuser does not support non-UTF-8 locales with Python 3 (currently using ANSI_X3.4-1968)
Summary: NotImplementedError: libuser does not support non-UTF-8 locales with Python 3...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: libuser
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Miloslav Trmač
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-18 18:23 UTC by David Shea
Modified: 2015-08-21 21:01 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-21 21:01:46 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description David Shea 2015-08-18 18:23:45 UTC
Not even LANG=C? Harsh.

This makes things difficult for anaconda and initial-setup since we need to parse kickstart data in order to configure the locale, and the kickstart module imports libuser.

Comment 1 Miloslav Trmač 2015-08-18 21:13:38 UTC
Yeah, https://fedorahosted.org/libuser/browser/python/libusermodule.c#L216 and https://fedorahosted.org/libuser/ticket/13#comment:17. Even with the C locale libuser is getting UTF-8 from Python code, and I think it is a generally reasonable heuristic to consider system-wide environment variables authoritative for the OS (like /etc/passwd) rather than a run-time environment override (which is what Anaconda changing locale after parsing a config file looks like).

There are nowadays ways to handle the locale encoding using a Python API, but it is rather awkward, especially when trying to keep the bindings also working for Python 2 (e.g. https://lists.linuxcontainers.org/pipermail/lxc-devel/2013-November/006363.html ).


I suppose a good solution would be for Anaconda to boot in C.UTF-8 (#902094) but that is not yet available.


At the time these bindings were written, I have specifically discussed the UTF-8 requirement with Vratislav, and verified F22 Alpha RC3 that Anaconda and its tmux were starting in en_US.UTF-8; so the switch to LANG=C must have been recent. Has anaconda in fact switched, and is it really necessary?

Comment 2 David Shea 2015-08-18 21:19:45 UTC
(In reply to Miloslav Trmač from comment #1)
> At the time these bindings were written, I have specifically discussed the
> UTF-8 requirement with Vratislav, and verified F22 Alpha RC3 that Anaconda
> and its tmux were starting in en_US.UTF-8; so the switch to LANG=C must have
> been recent. 

Anaconda starting in en_US.UTF-8 is only true if tmux is in use. Use cases where it is not include s390x, which will use LANG=C (I think, it might be unset), and imageinstall, dirinstall and live (including Workstation), where the initial environment depends on the environment set by the user before running anaconda. Live uses a wrapper script that can exert some control over the initial environment. image and dir installs do not.

There are also downstream projects using anaconda code, such as initial-setup.

> Has anaconda in fact switched, and is it really necessary?

It's possible to work around this if we're more careful about the import, but the assumptions that this would not affect anaconda were not correct.

Comment 3 Miloslav Trmač 2015-08-18 21:46:14 UTC
Yes, this is an assumption which is not true in the full general case of POSIX systems.

I really can’t spend time _truly_ supporting non-UTF-8 locales in libuser now; and at least as far as I understand our customers’ and users’ setups and requirements (which might be wrong), I don’t think it is worth _anyone_ at Red Hat spending any non-trivial time.

And the alternative implementation options linked above are so ugly that I am not even sure I would accept a patch; I don’t want the bindings to end up 60% charset conversion and the associated memory management, and 40% ferrying data between Python and C.


So from libuser’s view the only remaining question is whether to only support UTF-8 implicitly, possibly corrupting data if users somehow run this version on a system which uses a non-UTF-8 locale, or explicitly, refusing to silently corrupt data. This seems like a clear enough decision to me.


I think that anaconda always starting in a consistent environment can be only good for the test matrix, but that is admittedly a somewhat self-serving opinion.

initial-setup is, AFAICS, a pretty ordinary post-install application and should be started by systemd with the correct LANG= value, like every other daemon, and it is pretty much the typical case in which I think libuser is _correct_ in insisting on an UTF-8 locale at program startup (as opposed to either ignoring the locale, or using a later run-time value). I can’t from a quick look see any obvious explicit decision for it to use C. Is there perhaps something in the X startup sequence which is dropping $LANG? If so, that should be fixed.

Comment 4 David Shea 2015-08-18 21:54:58 UTC
(In reply to Miloslav Trmač from comment #3)
> Is there
> perhaps something in the X startup sequence which is dropping $LANG? If so,
> that should be fixed.

Everything has a text mode. And even if we say that both anaconda and initial-setup can safely assume an encoding of utf-8, and must set that encoding before touching libuser, you broke livemedia-creator. It's run from a terminal, by a user, who for whatever reason might have set LANG=C. It's a pretty common thing to do!

Comment 5 Miloslav Trmač 2015-08-18 21:56:46 UTC
And BTW it is a pretty safe bet that almost all non-GTK+ Python extensions break in non-UTF-8 locales: PyArg_ParseTuple*’s “s” format converts Python Unicode objects to UTF-8 C strings regardless of locale. (The exception for GTK+ is because all GLib’s strings should technically be UTF-8, and this is noticeable in GTK+; I bet most non-GUI applications don’t care and do I/O with printf() and the like, mixing GLib’s UTF-8 and system’s LC_CTYPE).

So perhaps consider libuser’s refusal a public service for the benefit of the other C extensions ☺

Comment 6 Miloslav Trmač 2015-08-18 22:23:22 UTC
(In reply to David Shea from comment #4)
> (In reply to Miloslav Trmač from comment #3)
> > Is there
> > perhaps something in the X startup sequence which is dropping $LANG? If so,
> > that should be fixed.
> 
> Everything has a text mode.

initial-setup in text mode should also inherit systemd’s LANG=*.UTF-8 value; doesn’t it?

> you broke livemedia-creator. It's run from
> a terminal, by a user, who for whatever reason might have set LANG=C. It's a
> pretty common thing to do!

Is it? I wouldn’t know, that setting can’t even display my name (/me shudders at reminiscences of the old [A-Z] vs. LC_COLLATE arguments and people setting LANG=C instead of fixing their programs).

(If only we had telemetry and could measure such things ☺ )


(I don’t care for the blame game; this decision was discussed in some detail and explicitly approved by an Anaconda developer.)


Anyway, so what now?

* Silently corrupting data is not acceptable to me.

* True support of non-UTF-8 is ugly, basically useless and just not worth it, especially considering that almost no other Python extension is doing it.

* I insist that initial-setup running in the C locale is very likely a bug.

* Is the only use we are talking about pyanaconda/users.py? ISTM livemedia-creator has no need for libuser but I haven’t looked for it now. If so, libuser would be breaking livemedia-creator even if livemedia-creator is never calling anything from libuser.

In that case it might be reasonable to delay the “import libuser” in pyanaconda/users.py from top level to the first use, or for libuser to move the UTF-8 check from module import to first function call / object creation.

That would be still somewhat ugly but better than the PyUnicode_FSConverter mess.

Comment 7 David Shea 2015-08-18 23:25:46 UTC
(In reply to Miloslav Trmač from comment #6)
> (In reply to David Shea from comment #4)
> > (In reply to Miloslav Trmač from comment #3)
> > > Is there
> > > perhaps something in the X startup sequence which is dropping $LANG? If so,
> > > that should be fixed.
> > 
> > Everything has a text mode.
> 
> initial-setup in text mode should also inherit systemd’s LANG=*.UTF-8 value;
> doesn’t it?

Again, systemd does not always have a *.UTF-8 LANG value. s390x more often has LANG=C.

 
> Anyway, so what now?
> 
> * Silently corrupting data is not acceptable to me.
> 
> * True support of non-UTF-8 is ugly, basically useless and just not worth
> it, especially considering that almost no other Python extension is doing it.
> 
> * I insist that initial-setup running in the C locale is very likely a bug.
> 
> * Is the only use we are talking about pyanaconda/users.py? ISTM
> livemedia-creator has no need for libuser but I haven’t looked for it now.
> If so, libuser would be breaking livemedia-creator even if livemedia-creator
> is never calling anything from libuser.
> 
> In that case it might be reasonable to delay the “import libuser” in
> pyanaconda/users.py from top level to the first use, or for libuser to move
> the UTF-8 check from module import to first function call / object creation.
> 
> That would be still somewhat ugly but better than the PyUnicode_FSConverter
> mess.

livemedia-creator uses users.py through anaconda, which, of course, inherits LANG= from the caller. So anaconda needs to ensure that the locale at the time of the libuser import uses a UTF-8 encoding, and that will be added to the pile of workarounds that go along with libuser. That's what I'll end up doing regardless, since I need a fix in the short-term, but that you're unwilling to even entertain the idea of a change to libuser, even a change submitted by someone else, is beyond frustrating and makes me question why we are jumping through these hoops to use libuser in the first place.

Comment 8 Miloslav Trmač 2015-08-21 20:27:27 UTC
(In reply to David Shea from comment #7)
> (In reply to Miloslav Trmač from comment #6)
> > (In reply to David Shea from comment #4)
> > > (In reply to Miloslav Trmač from comment #3)
> > > > Is there
> > > > perhaps something in the X startup sequence which is dropping $LANG? If so,
> > > > that should be fixed.
> > > 
> > > Everything has a text mode.
> > 
> > initial-setup in text mode should also inherit systemd’s LANG=*.UTF-8 value;
> > doesn’t it?
> 
> Again, systemd does not always have a *.UTF-8 LANG value. s390x more often
> has LANG=C.

Why does the architecture matter?

(langtable/languages.xml.gz does not have any non-UTF-8 entry so I assumed these are the only options; also https://bugzilla.redhat.com/show_bug.cgi?id=91235 . But, true, kickstart’s “lang” apparently could be used to configure a system to use LANG=C.)

And what should libuser write to /etc/passwd when inital-setup sends it a non-ASCII user name? (Due to Ctrl-Shift-U Ctrl-Shift-hexchar*, and GTK+ always using UTF-8 internally, this can happen with any locale configuration.)

> > * Is the only use we are talking about pyanaconda/users.py? ISTM
> > livemedia-creator has no need for libuser but I haven’t looked for it now.
> > If so, libuser would be breaking livemedia-creator even if livemedia-creator
> > is never calling anything from libuser.
<reordered>
> livemedia-creator uses users.py through anaconda, which, of course, inherits
> LANG= from the caller.

Sorry, that was my mistake; I confused livemedia-creator with a write-image-to-disk tool, I did not realize it is interpreting a kickstart file. So deferring the LANG interpretation after the kickstart is parsed seems to be necessary in livemedia-creator, regardless of system locale.


> > In that case it might be reasonable to delay the “import libuser” in
> > pyanaconda/users.py from top level to the first use, or for libuser to move
> > the UTF-8 check from module import to first function call / object creation.
> > 
> > That would be still somewhat ugly but better than the PyUnicode_FSConverter
> > mess.
> 
> So anaconda needs to ensure that the locale at the
> time of the libuser import uses a UTF-8 encoding, and that will be added to
> the pile of workarounds that go along with libuser. That's what I'll end up
> doing regardless, since I need a fix in the short-term,

So I have spent several hours researching this, making sure I have the facts straight and all options have been explored, only to be told you can’t wait for a libuser fix anyway… I can’t help thinking my time could have been used for something more productive.

> but that you're
> unwilling to even entertain the idea of a change to libuser, even a change
> submitted by someone else, is beyond frustrating

I _have_ been considering this and checking how to accommodate the request, but the plain truth is that Python 3 just doesn’t support non-UTF-8 locales sufficiently for C extension modules. I have also precisely explained _why_ I objected to that particular FS converter usage, leaving the option open for a patch without this downside.


… AND, I have later realized that using FS encoding _would not even work_ because the FS encoding is set by Python at the time of interpreter startup, so we would end up exactly in the same situation as now, except that a late import of libuser would not work around it.

So the libuser module would have to use PyUnicode_{EncodeDecode}Locale and write its own ParseTuple converter… Even more code which has nothing to do with binding libuser; if this is worth doing at all (and IMHO it isn’t), it is worth doing in python3-libs.


Besides being unhappy with my opinions on various options, what do _you_ think the right solution is and what do you want me to do about this bug now?
- Close this as WONTFIX-worked-around-in-anaconda?
- Defer the check from import time to to time of first libuser API call?
- Add a special exception for C, treating C as C.UTF-8?
- Roughly double the size of the bindings to support non-UTF-8 locales?
- Accept your patch which shows this is trivial to support and I have been an idiot missing an obvious solution?
- Add the required C API to Python 3 upstream, then adjust libuser?
- Silently break users’ data in non-UTF-8 locales?
- Something else?

Comment 9 David Shea 2015-08-21 21:01:46 UTC
(In reply to Miloslav Trmač from comment #8)
> (In reply to David Shea from comment #7)
> > (In reply to Miloslav Trmač from comment #6)
> > > (In reply to David Shea from comment #4)
> > > > (In reply to Miloslav Trmač from comment #3)
> > > > > Is there
> > > > > perhaps something in the X startup sequence which is dropping $LANG? If so,
> > > > > that should be fixed.
> > > > 
> > > > Everything has a text mode.
> > > 
> > > initial-setup in text mode should also inherit systemd’s LANG=*.UTF-8 value;
> > > doesn’t it?
> > 
> > Again, systemd does not always have a *.UTF-8 LANG value. s390x more often
> > has LANG=C.
> 
> Why does the architecture matter?

Because it's not using tmux, because it doesn't have a console. I'm sorry if anaconda's initial environment was misrepresented to you, but there are a lot of use cases and the less common ones are easy to forget.

> 
> (langtable/languages.xml.gz does not have any non-UTF-8 entry so I assumed
> these are the only options; also
> https://bugzilla.redhat.com/show_bug.cgi?id=91235 . But, true, kickstart’s
> “lang” apparently could be used to configure a system to use LANG=C.)

1) This about a library's behavior in a given environment, not what kickstart will allow you to set in the installed system. The crash I was hitting was actually while trying to parse the kickstart.

2) That bug predates Fedora. Some of the information in it might not exactly be up to date.

> 
> And what should libuser write to /etc/passwd when inital-setup sends it a
> non-ASCII user name? (Due to Ctrl-Shift-U Ctrl-Shift-hexchar*, and GTK+
> always using UTF-8 internally, this can happen with any locale
> configuration.)

Based on the behavior of the rest of python, including the print statement, raising UnicodeDecodeError would not be completely outlandish.

> 
> > > * Is the only use we are talking about pyanaconda/users.py? ISTM
> > > livemedia-creator has no need for libuser but I haven’t looked for it now.
> > > If so, libuser would be breaking livemedia-creator even if livemedia-creator
> > > is never calling anything from libuser.
> <reordered>
> > livemedia-creator uses users.py through anaconda, which, of course, inherits
> > LANG= from the caller.
> 
> Sorry, that was my mistake; I confused livemedia-creator with a
> write-image-to-disk tool, I did not realize it is interpreting a kickstart
> file. So deferring the LANG interpretation after the kickstart is parsed
> seems to be necessary in livemedia-creator, regardless of system locale.
> 
> 
> > > In that case it might be reasonable to delay the “import libuser” in
> > > pyanaconda/users.py from top level to the first use, or for libuser to move
> > > the UTF-8 check from module import to first function call / object creation.
> > > 
> > > That would be still somewhat ugly but better than the PyUnicode_FSConverter
> > > mess.
> > 
> > So anaconda needs to ensure that the locale at the
> > time of the libuser import uses a UTF-8 encoding, and that will be added to
> > the pile of workarounds that go along with libuser. That's what I'll end up
> > doing regardless, since I need a fix in the short-term,
> 
> So I have spent several hours researching this, making sure I have the facts
> straight and all options have been explored, only to be told you can’t wait
> for a libuser fix anyway… I can’t help thinking my time could have been used
> for something more productive.
> 
> > but that you're
> > unwilling to even entertain the idea of a change to libuser, even a change
> > submitted by someone else, is beyond frustrating
> 
> I _have_ been considering this and checking how to accommodate the request,
> but the plain truth is that Python 3 just doesn’t support non-UTF-8 locales
> sufficiently for C extension modules. I have also precisely explained _why_
> I objected to that particular FS converter usage, leaving the option open
> for a patch without this downside.
> 
> 
> … AND, I have later realized that using FS encoding _would not even work_
> because the FS encoding is set by Python at the time of interpreter startup,
> so we would end up exactly in the same situation as now, except that a late
> import of libuser would not work around it.

The FS encoding is intended for encoding and decoding PEP 838 filename strings so I don't see how that's even relevant. You mentioned wanting to use system-wide locale settings for system files, but there is no such thing as system-wide locale settings. That's part of why the C fallback is expected to exist.


> So the libuser module would have to use PyUnicode_{EncodeDecode}Locale and
> write its own ParseTuple converter… Even more code which has nothing to do
> with binding libuser; if this is worth doing at all (and IMHO it isn’t), it
> is worth doing in python3-libs.

It has everything to do with binding libuser, because it is entirely about translating the data from python to data that can be used by libuser.

> 
> 
> Besides being unhappy with my opinions on various options, what do _you_
> think the right solution is and what do you want me to do about this bug now?
> - Close this as WONTFIX-worked-around-in-anaconda?

Fine.


Note You need to log in before you can comment on or make changes to this bug.