Python in FC4 is compiled with UCS4 unicode strings instead of the upstream
default of UCS2, this changes the exported ABI and renders Fedora libpython
incompatible with other distributions. That's a serious problem, it means nobody
who wishes to distribute binaries on Linux can realistically embed Python or use
Python C modules in their app!
Red Hat has been shipping Python with UCS4 support since Red Hat Linux 9, if my
memory serves right. Doing so gives us access to large character sets, and it
also uncovered a number of bugs that were fixed upstream ever since. I don't
think we are in the position at this point to go back to UCS2, since we perceive
the future to be UCS4.
The way people solve the ABI incompatibility problem is by shipping packages for
specific distributions (which is probably the right thing to do in general).
If the future is UCS4 why isnt the upstream python package providing that by
default?. Has these patches been pushed there?. Distribution specific packages
for all third party software embedding python is a no go
I believe the answer is the amount of memory you are willing to use.
Python encodes Unicode characters as a fixed-size array of chars. This has the
advantage of simplifying a lot of the string operations, but it doubles or
quadruples the memory requirements if what you use mostly is ASCII only. The
flipside would have been to use UTF8 for the internal representation of Unicode
chars, which saves you memory when you use ASCII only (or mostly ASCII with very
few non-ASCII), but then string operations would be slowed down, since computing
the length of a string would require examining each character in the string to
see if it's a single-byte or multi-byte char.
UCS2 is using 2 bytes for each char, UCS4 uses 4 bytes. Unicode defines more
than 65535 characters, so a 2-byte representation is not enough (although, if I
understand correctly, characters outside of UCS2 are not that frequently used).
Moving from UCS2 to UCS4 (or the other way around) is a rather major undertake;
we decided to make that move some time ago, and we are trying to preserve the
ABI within Red Hat's products and avoid the exact type of problem you describe
There is no one-size-fits-all, unfortunately: people who don't care about
characters outside of UCS2 would probably want the extra memory back; OTOH, some
people want the ability to represent those chars.
That being said: which distros are still shipping UCS2? As far as I know, SuSE
ships UCS4 since 9, Debian ships UCS4. Mandriva 2006 seems to still be UCS2.