902094 – C.UTF-8 locale?

Bug 902094 - C.UTF-8 locale?

Summary: C.UTF-8 locale?

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Carlos O'Donell
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1241381 (view as bug list)
Depends On:
Blocks:	1250238 1361965 1365486
TreeView+	depends on / blocked

Reported:	2013-01-20 20:56 UTC by Bill Nottingham
Modified:	2021-03-01 01:39 UTC (History)
CC List:	25 users (show)
Fixed In Version:	2.22.90-7.fc24
Clone Of:
Environment:
Last Closed:	2015-09-17 16:34:03 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	17318	0	P2	NEW	[RFE] Provide a C.UTF-8 locale by default	2020-11-17 17:04:56 UTC

Description Bill Nottingham 2013-01-20 20:56:31 UTC

Description of problem:

Assorted other distributions/OSes appear to carry a C.UTF-8 locale.
(http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636086). We don't, AFAICT.

Could we?

Version-Release number of selected component (if applicable):

2.16-28

How reproducible:

100%

Steps to Reproduce:
1. export LANG=C.utf8
  
Actual results:

(process:14831): Gtk-WARNING **: Locale not supported by C library.
	Using the fallback 'C' locale.


Expected results:


Additional info:

Comment 1 Carlos O'Donell 2013-01-21 03:09:09 UTC

I have no objection to adding a useful `C' locale that has UTF-8 support, and I've seen no objections to this upstream. I've seen proposals for this as early as 2009 (http://www.sourceware.org/ml/libc-alpha/2009-09/msg00042.html) but no action on the part of the submitter.

The best way forward is for someone to do the work to create a builtin (see glibc/locale/C-*) C.UTF-8 locale and submit it upstream, that way it will automatically get pulled into rawhide.

Providing a non-builtin C.UTF-8 locale is less useful, but certainly a simpler first step, again it should just be submitted upstream for inclusion and then pulled into rawhide.

Comment 2 Jakub Jelinek 2013-01-21 06:46:25 UTC

Providing a builtin C.UTF-8 locale is IMHO a bad idea, UTF-8 locales are IMHO just too large for that.

Comment 3 Carlos O'Donell 2013-01-21 14:06:30 UTC

Jakub,

How bit do you estimate one UTF-8 locale would be if it were builtin like the normal C locale?

Comment 4 Jakub Jelinek 2013-01-21 14:17:30 UTC

I don't have a compiled tree of glibc around me and locale-archive hides the exact file sizes, just look at the typical sizes of compiled locales if they are for eight-bit charset vs. UTF-8.  LC_COLLATE typically grows many times, and LC_CTYPE also somewhat.  localedata/POSIX currently defines collation only for the ASCII set, you'd need to define some collation for all other UTF-8 characters too.

Comment 5 Carlos O'Donell 2013-01-21 14:46:56 UTC

Jakub,

It's ~1.5MB uncompressed for en_US.UTF-8.

I can agree that it's useful to have a universal UTF-8 locale to use without needing to install any other UTF-8 locale. The use cases make some sense to me, but I haven't gone over them with any kind of critical approach.

I think we could get a C.UTF-8 locale down to a smaller size, and I also think we might be able to do something to load it on demand when required.

Comment 6 Fedora Admin XMLRPC Client 2013-01-28 20:09:27 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 7 Fedora End Of Life 2013-04-03 14:13:23 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 9 Nick Coghlan 2014-05-28 11:19:11 UTC

One of the challenges with Python 3's Linux integration is that when the OS claims the locale encoding is ASCII, the Python 3 interpreter believes it. While I eventually hope to deal with that problem upstream, it's not a trivial fix due to where the assumption that the locale encoding is accurate occurs during the startup sequence.

In the meantime, being able to do "LANG=C.UTF-8 python3" as a simple alternative to "LANG=C python3" would be a convenient workaround.

Comment 10 Josh Triplett 2014-06-13 22:16:13 UTC

Any update on this issue?  It looks like the last update was in January 2013.  I'm interested in getting this fixed in glibc upstream, to eliminate skew between distributions that have implemented this and distributions that haven't.

Comment 11 Siddhesh Poyarekar 2014-06-18 06:50:31 UTC

Carlos suggested in comment 1 that someone submit a patch upstream to get C.UTF-8 into glibc by default so that it gets pulled into rawhide automatically.  All we need is someone to volunteer to actually do that.  The ideal folks to do this would be maintainers of distributions that ship this locale.

Comment 12 Nick Coghlan 2014-08-27 13:00:23 UTC

Upstream RFE filed: https://sourceware.org/bugzilla/show_bug.cgi?id=17318

If that doesn't get accepted as a reasonable idea, we can reexamine the possibility of a Fedora specific solution.

Comment 13 Nick Coghlan 2015-02-25 23:05:14 UTC

The glibc-alpha discussion suggests that this is considered a reasonable idea by upstream, but it still requires someone with the time and interest in doing the work: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html

Comment 14 Adam Borowski 2015-03-30 23:13:20 UTC

The biggest size concern, that of LC_COLLATE, doesn't apply, as, at least in Debian's implementation, LC_COLLATE=C.UTF-8 works identically as LC_COLLATE=C, ie, the collation is strictly lexicographical based on unsigned values of subsequent bytes, which thanks to UTF-8's properties also happens to be strictly lexicographical based on values of Unicode codepoints.

Comment 15 Kevin Kofler 2015-08-04 23:30:12 UTC

As indicated in the bug #1250238 that Ivan Romanov linked, KDE Plasma now requires this locale and does not work at all if you do not explicitly have a non-C locale set. (Thankfully, most users do, but still…)

Comment 16 Adam Borowski 2015-08-05 00:55:49 UTC

Note that there are two separate issues here:
1. adding the C.UTF-8 locale.  This can be done the old space-costly way.
2. replacing the ancient CTYPE code with something Unicode-centric.  This would reduce the space taken by UTF-8 locales, and make C.UTF-8 come for free.

1. is easy, and is the way Debian and co went.  I suggest you do this for now, as the proper rewrite can take years of waiting for someone to step up.  What needs to be done is making the new locale copy CTYPE from en_US.UTF-8 and all other facets from C/POSIX.

For 2., I think Unicode CTYPE should be hardcoded as opposed to current loadable handling, the way C and POSIX are handled today and ISO-8859-1 and KOI8-R used to be in the past.  This would optimize the Unicode case, allowing getting rid of all the duplication.  There are rumours musl managed to cram that 1.5MB data into 8KB.  That's acceptable even for static linking.  And perhaps non-UTF-8 locales could use this built-in data by converting at runtime, making support for loadable CTYPE data unneeded.

Heck, these days even completely dropping support for legacy locales might be worth considering.  If we can agree on that, the locale handling rewrite would be so much easier.

Comment 17 Jens Petersen 2015-08-05 10:03:27 UTC

Another important use-case seems to be Containers, that often
have minimal locales defined and could benefit from C.UTF-8.

Comment 18 Carlos O'Donell 2015-09-16 15:16:22 UTC

Thanks to Mike FABIAN we are adding C.UTF-8 to Fedora, not as a built-in (requires pre-mounted /usr), but as a drop-in addition. It is also merged with locale-archive so there is no performance loss to access the locale. However if locale-archive is purged, C, POSIX, and C.UTF-8 are present (unless the user deletes /usr/lib/locale/C.utf8).

Comment 19 Carlos O'Donell 2015-09-16 15:19:07 UTC

*** Bug 1241381 has been marked as a duplicate of this bug. ***

Comment 20 Carlos O'Donell 2015-09-17 16:34:03 UTC

This is now fixed in Fedora Rawhide.

We now have an "uninstallable" C.UTF-8 locale that is available even if you delete locale-archive, or change the installed language set for locale-archive.

You can however remove C.UTF-8 if you delete /usr/lib/locale/C.utf8.

Thanks to Mike FABIAN for all the help here.

Comment 21 Kevin Kofler 2018-11-06 01:03:36 UTC

For historical reference, this was fixed in Rawhide in 2.22.90-7.fc24.
The change was later backported to F23 in 2.22-11.fc23 and F22 in 2.21-13.fc22 (the last glibc update that was pushed to the F22 stable updates). All currently supported Fedora releases have this locale.

Note You need to log in before you can comment on or make changes to this bug.

a.badger
bkabrda
codonell
fweimer
jakub
josh
kevin
kilobyte
law
mattdm
mcepl
mfabian
mnewsome
mst
myllynen
ncoghlan
pbrobinson
petersen
pfrankli
pnemade
psatpute
rdieter
redhat-bugzilla
rvokal
schwab