Bug 1551073 - glibc: [RFE] Reduce size of locales by attempting to overlap collation tables for common elements.
Summary: glibc: [RFE] Reduce size of locales by attempting to overlap collation tables...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Carlos O'Donell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-02 17:01 UTC by Kevin Fenzi
Modified: 2019-10-15 13:33 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-15 13:33:17 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Sourceware 25105 0 None None None 2019-10-15 13:33:16 UTC

Description Kevin Fenzi 2018-03-02 17:01:22 UTC
There seems to be a pretty large increase in size between glibc-2.27-5.fc29 and glibc-2.27.9000-7.fc29. 

glibc-all-langpacks-2.27-5.fc29 is about 108MB on disk
glibc-all-langpacks-2.27-9000-7.fc29 is about 200MB on disk. 

All the single langpacks are larger too.

Comment 1 Florian Weimer 2018-03-02 18:00:54 UTC
glibc-2.27-6.fc28 has the same issue.

Mike, do you see an easy way to reduce cross-locale variance in the generated tables, so that the locale archive becomes smaller?

Comment 2 Carlos O'Donell 2018-03-02 18:36:06 UTC
(In reply to Kevin Fenzi from comment #0)
> There seems to be a pretty large increase in size between glibc-2.27-5.fc29
> and glibc-2.27.9000-7.fc29. 
> 
> glibc-all-langpacks-2.27-5.fc29 is about 108MB on disk
> glibc-all-langpacks-2.27-9000-7.fc29 is about 200MB on disk. 
> 
> All the single langpacks are larger too.

This is expected, and I have confirmed this. The actual growth is less than you list, ~80MiB of additional space is needed for all the locales to have the correct collation matching thew newer standards.

In glibc 2.28 (the current development branch) Mike Fabian completed this work:
https://fedoraproject.org/wiki/Changes/Glibc_collation_update_and_sync_with_cldr
Which has updated glibc to stay in sync with ISO 14651 which now tracks Unicode 9.0 characters.

The benefit is that we now have sorting (collation) for all the new characters added in the past 15 years... the downside is that takes up an additional ~80MiB for all the locales we support in glibc-all-langpacks.

The collation data went from 425kib ot 3.3MiB alone in our stored sources in the project git repo.

If size is an issue we recommend installing glibc-minimal-langpack (just C/POSIX and C.UTF-8), or the specific language pack you need.

Comment 3 Carlos O'Donell 2018-03-02 18:39:02 UTC
(In reply to Florian Weimer from comment #1)
> glibc-2.27-6.fc28 has the same issue.
> 
> Mike, do you see an easy way to reduce cross-locale variance in the
> generated tables, so that the locale archive becomes smaller?

This is harder to do than you think because the collation tables are mixed up with all the other collation elements and weights, any new character changes the table for that locale. We would have to invent a new way to segregate those tables and the rules and still arrive at correct results. This would need some significant engineering.

However, we might get away with a post generation analysis of the tables and compression, sharing the tables in some way, but it would be post-generation of the final tables.

This would be an RFE.

Comment 4 Carlos O'Donell 2019-10-15 13:33:17 UTC
I am closing this RFE here and we are going to track this upstream:
https://sourceware.org/bugzilla/show_bug.cgi?id=25105

We need to work on this problem upstream and get a solution that works for all downstream distributions.


Note You need to log in before you can comment on or make changes to this bug.