The current implementation of C.UTF-8 in RHEL9 has several defects in the higher code-point ranges (incorrect sorting). The C.UTF-8 upstream update (v8) fixes all of these defects and further reduces the size of the locale from ~2MiB down to ~400KiB (saves ~1.6MiB): https://sourceware.org/pipermail/libc-alpha/2021-August/130501.html The new version of C.UTF-8 will most likely be included in glibc 2.35, with a backport to glibc 2.34. We should include glibc 2.34 in RHEL9 regardless of the upstream decision to backport. We want to make this change before RHEL9 GA, after RHEL9 GA we would not want to change the collation of code points because of the impact it has on sorting data that customers have (we only change this at X-stream boundaries). Therefore I think this should be fixed immediately.
Fixed upstream with these two commits. commit 466f2be6c08070e9113ae2fdc7acd5d8828cba50 Author: Carlos O'Donell <carlos> Date: Wed Sep 1 15:19:19 2021 -0400 Add generic C.UTF-8 locale (Bug 17318) ... commit f5117c6504888fab5423282a4607c552b90fd3f9 Author: Carlos O'Donell <carlos> Date: Thu Jul 29 22:45:39 2021 -0400 Add 'codepoint_collation' support for LC_COLLATE. ...