Bug 902094
Summary: | C.UTF-8 locale? | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bill Nottingham <notting> |
Component: | glibc | Assignee: | Carlos O'Donell <codonell> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | a.badger, bkabrda, codonell, fweimer, jakub, josh, kevin, kilobyte, law, mattdm, mcepl, mfabian, mnewsome, mst, myllynen, ncoghlan, pbrobinson, petersen, pfrankli, pnemade, psatpute, rdieter, redhat-bugzilla, rvokal, schwab |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 2.22.90-7.fc24 | Doc Type: | Enhancement |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-09-17 16:34:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1250238, 1361965, 1365486 |
Description
Bill Nottingham
2013-01-20 20:56:31 UTC
I have no objection to adding a useful `C' locale that has UTF-8 support, and I've seen no objections to this upstream. I've seen proposals for this as early as 2009 (http://www.sourceware.org/ml/libc-alpha/2009-09/msg00042.html) but no action on the part of the submitter. The best way forward is for someone to do the work to create a builtin (see glibc/locale/C-*) C.UTF-8 locale and submit it upstream, that way it will automatically get pulled into rawhide. Providing a non-builtin C.UTF-8 locale is less useful, but certainly a simpler first step, again it should just be submitted upstream for inclusion and then pulled into rawhide. Providing a builtin C.UTF-8 locale is IMHO a bad idea, UTF-8 locales are IMHO just too large for that. Jakub, How bit do you estimate one UTF-8 locale would be if it were builtin like the normal C locale? I don't have a compiled tree of glibc around me and locale-archive hides the exact file sizes, just look at the typical sizes of compiled locales if they are for eight-bit charset vs. UTF-8. LC_COLLATE typically grows many times, and LC_CTYPE also somewhat. localedata/POSIX currently defines collation only for the ASCII set, you'd need to define some collation for all other UTF-8 characters too. Jakub, It's ~1.5MB uncompressed for en_US.UTF-8. I can agree that it's useful to have a universal UTF-8 locale to use without needing to install any other UTF-8 locale. The use cases make some sense to me, but I haven't gone over them with any kind of critical approach. I think we could get a C.UTF-8 locale down to a smaller size, and I also think we might be able to do something to load it on demand when required. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19 One of the challenges with Python 3's Linux integration is that when the OS claims the locale encoding is ASCII, the Python 3 interpreter believes it. While I eventually hope to deal with that problem upstream, it's not a trivial fix due to where the assumption that the locale encoding is accurate occurs during the startup sequence. In the meantime, being able to do "LANG=C.UTF-8 python3" as a simple alternative to "LANG=C python3" would be a convenient workaround. Any update on this issue? It looks like the last update was in January 2013. I'm interested in getting this fixed in glibc upstream, to eliminate skew between distributions that have implemented this and distributions that haven't. Carlos suggested in comment 1 that someone submit a patch upstream to get C.UTF-8 into glibc by default so that it gets pulled into rawhide automatically. All we need is someone to volunteer to actually do that. The ideal folks to do this would be maintainers of distributions that ship this locale. Upstream RFE filed: https://sourceware.org/bugzilla/show_bug.cgi?id=17318 If that doesn't get accepted as a reasonable idea, we can reexamine the possibility of a Fedora specific solution. The glibc-alpha discussion suggests that this is considered a reasonable idea by upstream, but it still requires someone with the time and interest in doing the work: https://sourceware.org/ml/libc-alpha/2015-02/msg00247.html The biggest size concern, that of LC_COLLATE, doesn't apply, as, at least in Debian's implementation, LC_COLLATE=C.UTF-8 works identically as LC_COLLATE=C, ie, the collation is strictly lexicographical based on unsigned values of subsequent bytes, which thanks to UTF-8's properties also happens to be strictly lexicographical based on values of Unicode codepoints. As indicated in the bug #1250238 that Ivan Romanov linked, KDE Plasma now requires this locale and does not work at all if you do not explicitly have a non-C locale set. (Thankfully, most users do, but still…) Note that there are two separate issues here: 1. adding the C.UTF-8 locale. This can be done the old space-costly way. 2. replacing the ancient CTYPE code with something Unicode-centric. This would reduce the space taken by UTF-8 locales, and make C.UTF-8 come for free. 1. is easy, and is the way Debian and co went. I suggest you do this for now, as the proper rewrite can take years of waiting for someone to step up. What needs to be done is making the new locale copy CTYPE from en_US.UTF-8 and all other facets from C/POSIX. For 2., I think Unicode CTYPE should be hardcoded as opposed to current loadable handling, the way C and POSIX are handled today and ISO-8859-1 and KOI8-R used to be in the past. This would optimize the Unicode case, allowing getting rid of all the duplication. There are rumours musl managed to cram that 1.5MB data into 8KB. That's acceptable even for static linking. And perhaps non-UTF-8 locales could use this built-in data by converting at runtime, making support for loadable CTYPE data unneeded. Heck, these days even completely dropping support for legacy locales might be worth considering. If we can agree on that, the locale handling rewrite would be so much easier. Another important use-case seems to be Containers, that often have minimal locales defined and could benefit from C.UTF-8. Thanks to Mike FABIAN we are adding C.UTF-8 to Fedora, not as a built-in (requires pre-mounted /usr), but as a drop-in addition. It is also merged with locale-archive so there is no performance loss to access the locale. However if locale-archive is purged, C, POSIX, and C.UTF-8 are present (unless the user deletes /usr/lib/locale/C.utf8). *** Bug 1241381 has been marked as a duplicate of this bug. *** This is now fixed in Fedora Rawhide. We now have an "uninstallable" C.UTF-8 locale that is available even if you delete locale-archive, or change the installed language set for locale-archive. You can however remove C.UTF-8 if you delete /usr/lib/locale/C.utf8. Thanks to Mike FABIAN for all the help here. For historical reference, this was fixed in Rawhide in 2.22.90-7.fc24. The change was later backported to F23 in 2.22-11.fc23 and F22 in 2.21-13.fc22 (the last glibc update that was pushed to the F22 stable updates). All currently supported Fedora releases have this locale. |