There seems to be a pretty large increase in size between glibc-2.27-5.fc29 and glibc-2.27.9000-7.fc29. glibc-all-langpacks-2.27-5.fc29 is about 108MB on disk glibc-all-langpacks-2.27-9000-7.fc29 is about 200MB on disk. All the single langpacks are larger too.
glibc-2.27-6.fc28 has the same issue. Mike, do you see an easy way to reduce cross-locale variance in the generated tables, so that the locale archive becomes smaller?
(In reply to Kevin Fenzi from comment #0) > There seems to be a pretty large increase in size between glibc-2.27-5.fc29 > and glibc-2.27.9000-7.fc29. > > glibc-all-langpacks-2.27-5.fc29 is about 108MB on disk > glibc-all-langpacks-2.27-9000-7.fc29 is about 200MB on disk. > > All the single langpacks are larger too. This is expected, and I have confirmed this. The actual growth is less than you list, ~80MiB of additional space is needed for all the locales to have the correct collation matching thew newer standards. In glibc 2.28 (the current development branch) Mike Fabian completed this work: https://fedoraproject.org/wiki/Changes/Glibc_collation_update_and_sync_with_cldr Which has updated glibc to stay in sync with ISO 14651 which now tracks Unicode 9.0 characters. The benefit is that we now have sorting (collation) for all the new characters added in the past 15 years... the downside is that takes up an additional ~80MiB for all the locales we support in glibc-all-langpacks. The collation data went from 425kib ot 3.3MiB alone in our stored sources in the project git repo. If size is an issue we recommend installing glibc-minimal-langpack (just C/POSIX and C.UTF-8), or the specific language pack you need.
(In reply to Florian Weimer from comment #1) > glibc-2.27-6.fc28 has the same issue. > > Mike, do you see an easy way to reduce cross-locale variance in the > generated tables, so that the locale archive becomes smaller? This is harder to do than you think because the collation tables are mixed up with all the other collation elements and weights, any new character changes the table for that locale. We would have to invent a new way to segregate those tables and the rules and still arrive at correct results. This would need some significant engineering. However, we might get away with a post generation analysis of the tables and compression, sharing the tables in some way, but it would be post-generation of the final tables. This would be an RFE.
I am closing this RFE here and we are going to track this upstream: https://sourceware.org/bugzilla/show_bug.cgi?id=25105 We need to work on this problem upstream and get a solution that works for all downstream distributions.