Bug 455981

Summary: Missing locl romanian magic
Product: [Fedora] Fedora Reporter: Nicolas Mailhot <nicolas.mailhot>
Component: dejavu-fontsAssignee: Ben Laenen <bl.bugs>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: fonts-bugs, gaburici, i18n-bugs, quantumburnz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-10-27 03:20:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 438944    
Attachments:
Description Flags
My version of Sans, no features enabled - all test glyphs use combining!
none
The new ROM rlig feature activated. It looks much better than with mark on!
none
Turning on both rlig and locl works as expected.
none
Giant patch ball for Romanian comman and cedilla ligatures. none

Description Nicolas Mailhot 2008-07-19 18:33:41 UTC
According to
http://fedoraproject.org/wiki/L10N/Tasks/Ro_fonts

DejaVu has all the glyphs Romanian uses but is missing some necessary locl magic
already available in SIL and proprietary fonts

Comment 1 Ben Laenen 2008-07-19 18:46:54 UTC
I've been told by a Romanian person (as I explicitly asked about this) that 
they don't expect to see the S/T with comma below if they type the S/T with 
cedilla...

Comment 2 Nicolas Mailhot 2008-07-19 19:12:19 UTC
Romanians seems to be in disagreement :(

Comment 3 Ben Laenen 2008-07-19 19:18:48 UTC
My personal view is that it shouldn't be done. The S/T with cedilla code 
points are no longer unified with the Romanian letters S/T with comma below, 
so they should never appear like them anymore, and if they do, it's a bug.

Comment 4 Vasile Gaburici 2008-07-19 21:27:46 UTC
It should be done because locl is an *optional* font feature. The application is
free to request it or not. Unfortunately pango always turns locl on based on
laguage. It should be configurable at pango's level, preferably in way that
allows application to modify it via pango markup. It's okay for the default
pango settings to turn locl on for Romania because the Romanian Academy
typographic standard requires commas not cedillas for Romanian text. The only
possible trouble spot is a Turkish name. But that name could be marked-up as
being in the Turkish language in the document [well, not in plain text].

BTW, if you ever saw a Romanian document rendered with mixed cedillas and commas
you wouldn't doubt the necessity of locl. Adobe introduced ROM/locl because they
(and 99% of commercial fonts) remap "t with cedilla" to "t with comma"
regardless of locale, based on the assumption that "t with cedilla" is not used
in any language [There's a post on Adobe forums, but I'm too tired to find it
now]. Mixed diacritics look like €rap for Romanian text in the pre-Unicode 3.0
encoding, which sadly is still far more widespread at least on the web [check
with Google]. A [mild] picture of mixed diacritics is here:
http://en.wikipedia.org/wiki/Romanian_alphabet#Adobe.2FLinotype.2FVista_de-facto_standard.
This visual inconsistency is why Adobe Pro fonts can also map "s with cedilla"
to "s with comma" when ROM/locl is turned on. FYI: Vista fonts and the Linotype
fonts [you can check on their site] behave the same way.

Now assume DejaVu, which currently doesn't honor ROM/locl, is used in a document
together with a font that does honor the ROM/locl substituion, not necessarily a
comercial one, e.g. one of the free SIL fonts[*]. You'd get mixed diacritics
again. Granted this is not in the same font, but it is still in the same
document and it looks bad...

Footnote [*] SIL fonts use ROM/ccmp to do the mapping, but pango turns that on
too. I'm not aware of any other fonts except those from SIL that work this way.
Please don't make dejavu work that way. I'd rather have you adopt the Adobe
standard which is used in hundreds of fonts.


Comment 5 Ben Laenen 2008-07-21 12:05:56 UTC
Ugh, Unicode seems to have made an even bigger mess out of this than I 
originally thought...

So, apparently both U+015E-U+015F, U+0162-U+0163, and U+0218-U+021B can still 
all be used for Romanian. With the extra string attached to U+0218-U+021B that 
they should be used when a distinct shape with comma below is needed. So 
you're still allowed the U+015E-U+015F, U+0162-U+0163 glyphs to write Romanian 
apparently.

And since Unicode only cares about code points, it didn't give any clue on how 
fonts or renderers are supposed to know when distinct glyphs are needed. Yet 
Unicode expects them to clean up the mess they've made.

> It should be done because locl is an *optional* font feature.

I thought it was obligated if a language was passed to the renderer (but I may 
be wrong on this).

> Adobe introduced ROM/locl because they (and 99% of commercial fonts) remap
> "t with cedilla" to "t with comma" regardless of locale

That's just bad, t with cedilla _is_ used sometimes. I think it was even 
proposed a long time ago to be used in French for when a t sounds like /s/, 
like "relaţion" (didn't catch on unfortunately :-) ). Unicode itself mentions 
Semitic transliteration (but I guess that needs a lot of other glyphs those 
fonts don't have).

So far I've only found three Adobe fonts with Romanian glyphs and two didn't 
have the locl rule, so it looks like Adobe doesn't do it often either. They 
all have indeed t with comma below in the place of t with cedilla. If you have 
documents with mixed diacritics you can blame it on that practice, _not_ the 
absence of locl rules in the font.

I've also checked the MS Vista fonts once (usually they make the de facto 
standard rules since their fonts are most widely spread). Segoe UI and the new 
versions of Arial, Times New Roman etc. don't have locl rules or anything else 
and have t with cedilla at U+0162-U+0163 (I think the old versions known as 
the corefonts were pre-Unicode 3.0). The C-fonts which were made by another 
foundry have t with comma below at U+0162-U+0163 like Adobe fonts, and have a 
salt (stylistic alternate) _and_ a locl feature for s with cedilla glyphs to s 
with comma below for Romanian.

Also, one thing I'm asking myself is: why doesn't Gentium have locl rules (or 
ccmp rules)? It's a more recent font compared to Doulos and Charis, so the SIL 
people seem to have changed their minds about it, and I'd like to know their 
reasons before changing anything in DejaVu.

So, short conclusion: how it's dealt with it seems to just depend on the 
foundry that made the fonts, and it also seems to depend on who you ask. So 
far, I haven't seen enough yet to be sure that a locl rule is needed.

Also, don't always assume commercial fonts have it right. As said above, the 
same fonts have t with comma below in place of t with cedilla, together with a 
s with cedilla, which is the worst thing you can do here.


Comment 6 Vasile Gaburici 2008-07-22 10:40:37 UTC
(In reply to comment #5)
> So, apparently both U+015E-U+015F, U+0162-U+0163, and U+0218-U+021B can still 
> all be used for Romanian. With the extra string attached to U+0218-U+021B that 
> they should be used when a distinct shape with comma below is needed. So 
> you're still allowed the U+015E-U+015F, U+0162-U+0163 glyphs to write Romanian 
> apparently.

Microsoft took about 7 years to include U+0218-U+021B in *some* Windows XP
fonts, which happened only after Romanian got into the EU :) Some XP fonts
(Georgia, Courier) still don't have the proper glyphs, even after the update
[http://www.microsoft.com/downloads/details.aspx?familyid=0ec6f335-c3de-44c5-a13d-a1e7cea5ddea&displaylang=en]
(google "EU font expansion update if that ugly link doesn't work).

The result is that documents using the pre-Unicode 3.0 encoding (U+015E-U+015F,
U+0162-U+0163) still dominate.

> > It should be done because locl is an *optional* font feature.
> 
> I thought it was obligated if a language was passed to the renderer (but I may
be wrong on this).

Currently Uniscribe (the XP renderer) doesn't honor it at all. At least in XP SP3.

> > Adobe introduced ROM/locl because they (and 99% of commercial fonts) remap
> > "t with cedilla" to "t with comma" regardless of locale
> 
> That's just bad, t with cedilla _is_ used sometimes. I think it was even 
> proposed a long time ago to be used in French for when a t sounds like /s/, 
> like "relaţion" (didn't catch on unfortunately :-) ). Unicode itself mentions 
> Semitic transliteration (but I guess that needs a lot of other glyphs those 
> fonts don't have).

I agree it's bad. *Very few* commercial fonts have a proper "t with cedilla".
Verdna and Tahoma are only significant ones. Everything else follows the Adobe
standard. You can check commercial fonts at Linotypes' website. Below is a link
that restricts the search to fonts that support the Romanian characters:
[http://www.linotype.com/featuresearch?cf[]=adobece&cf[]=euro&cf[]=latinext]
You have to enter a test string yourself, since that doesn't go in the URL.
Use: aăâiîsştţ€sștț.

> So far I've only found three Adobe fonts with Romanian glyphs and two didn't 
> have the locl rule, so it looks like Adobe doesn't do it often either. They 
> all have indeed t with comma below in the place of t with cedilla. If you have 
> documents with mixed diacritics you can blame it on that practice, _not_ the 
> absence of locl rules in the font.

You probably looked at old fonts. All the Pro fonts they are currently shipping
have complete support for Romanian, with a "t with cedilla" substituted by the
comma variant regardles of locale, and with a ROM/locl feature that
*additionally* substitutes "s with cedilla" with "s with comma". Vista C-series
fonts have exactly the same feature set, as you pointed out.
[http://en.wikipedia.org/wiki/Romanian_alphabet#Adobe.2FLinotype.2FVista_de-facto_standard]

> Also, one thing I'm asking myself is: why doesn't Gentium have locl rules (or 
> ccmp rules)? It's a more recent font compared to Doulos and Charis, so the SIL 
> people seem to have changed their minds about it, and I'd like to know their 
> reasons before changing anything in DejaVu.

You need to ask them. IMHO, their implementation of the remapping via ccmp
violates the OpenType 1.4 standard: ccmp should *not* depend on the langage.

> 
> So, short conclusion: how it's dealt with it seems to just depend on the 
> foundry that made the fonts, and it also seems to depend on who you ask. So 
> far, I haven't seen enough yet to be sure that a locl rule is needed.

The are some variations, but 99% of commercial fonts follow the Adobe standard.
Check on Linotype's website! Unfortunately you cannot check for locl there. But
the Romanian locl issue has be debated to death on typophile forums, and the
opinion leaders there (fokes that run foundries) follow the Adobe standard, locl
included.

> Also, don't always assume commercial fonts have it right. As said above, the 
> same fonts have t with comma below in place of t with cedilla, together with a 
> s with cedilla, which is the worst thing you can do here.

Adobe fonts look ok with locl on. Adobe assumed that Microsoft would implement
locl sooner rather than later. InDesign CS3 supports locl in it's own renderer.


Comment 7 Ben Laenen 2008-07-22 11:31:44 UTC
OK, I guess it doesn't break anything else to add this (except your Turkish 
texts when reading in Romanian locale...). But it still goes against my 
philosophy of "don't fix problems of the past, but make sure you don't make 
more problems that you need to fix in the future".

So, is this only for latn{ROM} and latn{MOL}, or are there other dialects that 
need it as well? If you know any, the full list of languages that can be used 
in OpenType is at http://www.microsoft.com/typography/otspec/languagetags.htm 
so you can check if it's there.

Similar issue, the s/t with cedilla code point should be canonically the same 
as s/t + combining cedilla. In short, that would mean that when you write such 
a sequence you need to get a t with comma below as well for Romanian. But I'm 
not entirely sure how to do that yet... Probably a "calt" (contextual 
alternate) feature for the combining cedilla, but that's not applied by 
default in Pango unfortunately, we could misuse "ccmp" (glyph 
composition/decomposition) for it, but I'd like to see "calt" turned on once, 
and this way I can use the Romanian community to push Behdad :-)


Comment 8 Vasile Gaburici 2008-07-22 11:55:00 UTC
(In reply to comment #7)
>
> So, is this only for latn{ROM} and latn{MOL}, or are there other dialects that 
> need it as well? If you know any, the full list of languages that can be used 
> in OpenType is at http://www.microsoft.com/typography/otspec/languagetags.htm 
> so you can check if it's there.

As you pointed out, Adobe's fonts do this for latn{MOL} as well. But Moldavians
have their own academy (and country), so I don't know it this is appropriate or
not. Wikipedia doesn't have a page on their alphabet. I guess Adobe is preparing
Moldavians for an an anschluss ;)

No other languages should need it, or if they do, Adobe ignores them for now...

> 
> Similar issue, the s/t with cedilla code point should be canonically the same 
> as s/t + combining cedilla. In short, that would mean that when you write such 
> a sequence you need to get a t with comma below as well for Romanian. But I'm 
> not entirely sure how to do that yet... Probably a "calt" (contextual 
> alternate) feature for the combining cedilla, but that's not applied by 
> default in Pango unfortunately, we could misuse "ccmp" (glyph 
> composition/decomposition) for it, but I'd like to see "calt" turned on once, 
> and this way I can use the Romanian community to push Behdad :-)

Can you provide a test string string for the combining business? Fontmatrix does
not rely on pango for OpenType features, so I can test it there.



Comment 9 Ben Laenen 2008-07-22 12:03:33 UTC
U+015E-U+0163 (s and t with cedilla): Ş ş Ţ ţ
U+0218-U+021B (s and t with comma below): Ş ş Ţ ţ
S/T + U+0327 (combining cedilla): Ş ş Ţ ţ
S/T + U+0326 (combining comma below): Ș ș Ț ț

Comment 10 Ben Laenen 2008-07-22 12:04:48 UTC
oops, second line was wrong. This is the correct list:

U+015E-U+0163 (s and t with cedilla): Ş ş Ţ ţ
U+0218-U+021B (s and t with comma below): Ș ș Ț ț
S/T + U+0327 (combining cedilla): Ş ş Ţ ţ
S/T + U+0326 (combining comma below): Ș ș Ț ț

Comment 11 Vasile Gaburici 2008-07-22 12:10:51 UTC
(In reply to comment #6)

> But the Romanian locl issue has be debated to death on typophile forums, and the
> opinion leaders there (fokes that run foundries) follow the Adobe standard, locl
> included.

For reference purposes, I'm linking to John Hudson's comment on typophile:
[http://www.typophile.com/node/2764#comment-22015]. John is co-founder Tiro
Typeworks, which jointly registered with Adobe the locl feature tag:
[http://www.microsoft.com/typography/otspec/features_ko.htm#locl]


Comment 12 Ben Laenen 2008-07-22 12:14:34 UTC
There's apparently some issue with the Gagauz language as well. The wikipedia 
page uses comma below, but Unicode people don't seem to know what to use 
http://unicode.org/mail-arch/unicode-ml/y2002-m10/0020.html ...

Comment 13 Vasile Gaburici 2008-07-22 12:52:54 UTC
(In reply to comment #12)
> There's apparently some issue with the Gagauz language as well. The wikipedia 
> page uses comma below, but Unicode people don't seem to know what to use 
> http://unicode.org/mail-arch/unicode-ml/y2002-m10/0020.html ...

I check the OpenType spec: Gagauz has the language tag GAG. So, you can do
something special for it, assuming you know what to do. So far I haven't seen
any fonts that pay attention to it, so I'd say do nothing now. Like the email
you pointed to said, let some Gagauz speak up before we decide anything for them.


Comment 14 Nicolas Mailhot 2008-07-22 13:07:18 UTC
I think that on this subject, the current opinion of an Unicode guru such as
Everson would be very valuable.

Comment 15 Vasile Gaburici 2008-07-22 14:13:40 UTC
(In reply to comment #14)
> I think that on this subject, the current opinion of an Unicode guru such as
> Everson would be very valuable.

This is an OpenType issue, not an Unicode issue, but surely some expert opinion
would not hurt.



Comment 16 Ben Laenen 2008-07-22 14:24:39 UTC
Everson's document about Gagauz is here: 
http://www.evertype.com/alphabets/gagauz.pdf

Basically he says they use comma below, but some may prefer cedilla...

Comment 17 Vasile Gaburici 2008-07-22 14:37:19 UTC
Full quote:

Gagauzi in Russia use Cyrillic; Gagauzi in Romania use Latin. Note that in
Romania, Gagauz uses the characters S WITH COMMA BELOW and T WITH COMMA BELOW.
In inferior Gagauz typography, the glyphs for these characters are sometimes
drawn with CEDILLAs, but it is strongly recommended to avoid this practice.
However, because Gagauz is a Turkic language, it may be left to the user to
decide whether S WITH COMMA BELOW (as in Romanian) or S WITH CEDILLA (as in
Turkish) is preferred.


Comment 18 Vasile Gaburici 2008-07-22 14:41:03 UTC
Btw, Everson is wrong about the use of quotes in Romanian
(http://www.evertype.com/alphabets/romanian.pdf), so I wouldn't take him as the
ultimate guru...


Comment 19 Nicolas Mailhot 2008-07-22 14:43:20 UTC
Everson is a type designer, not just an Unicode expert. And what I meant was his
opinion on the whole locl thing, not on Gagauz only

Comment 20 Ben Laenen 2008-07-24 11:21:40 UTC
OK, I quickly pushed the locl rules for S/T with cedilla in DejaVu before the 
freeze for the next release this weekend. So please test the latest snapshot 
at http://dejavu.sourceforge.net/snapshots/ and see if it works as expected 
(no need to test the condensed fonts, they'll get updated as well soon). Also 
test out if everything else like ligatures and mark placement (combining 
diacritic placement) still work for Romanian.

I didn't make changes to the combining cedilla yet, I don't know how to 
properly tackle that yet.

Comment 21 Vasile Gaburici 2008-07-24 12:45:09 UTC
I had a quick look at the Sans. Results:
• locl - OK
• mark - HM (S/s cedilla both OK, T/t cedilla still shifted)

The rest were tested with locl on:
• salt - OK (checked J)
• liga - OK (checked ff)
• mark - NO (as expected)

I also found a way to make the combining work, see next message.


Comment 22 Vasile Gaburici 2008-07-24 12:48:49 UTC
To make the combining (i) look good and (ii) work with locl you do not need a
contextual substitution. A "ligature" is enough! See the Adobe feature file doc:
you only need a type-4, not a type-6 substitution for this. I decided to put
these in a rlig table for latn{MOL,ROM}. Of course, this rlig has to come before
the locl, so locl can affect ti. I'm attaching some screenshots first and later
some patches later (but there's some fiddle with those).






Comment 23 Vasile Gaburici 2008-07-24 12:50:10 UTC
Created attachment 312552 [details]
My version of Sans, no features enabled - all test glyphs use combining!

Comment 24 Vasile Gaburici 2008-07-24 12:51:17 UTC
Created attachment 312553 [details]
The new ROM rlig feature activated. It looks much better than with mark on!

Comment 25 Vasile Gaburici 2008-07-24 12:52:12 UTC
Created attachment 312554 [details]
Turning on both rlig and locl works as expected.

Comment 26 Vasile Gaburici 2008-07-24 13:03:59 UTC
Created attachment 312555 [details]
Giant patch ball for Romanian comman and cedilla ligatures.

Comment 27 Vasile Gaburici 2008-07-24 13:05:46 UTC
Comment on attachment 312555 [details]
Giant patch ball for Romanian comman and cedilla ligatures.

You could also add a breve and a, i circumflex to the rlig. But I don't know
the combining unicodes for those...

Comment 28 Ben Laenen 2008-07-24 13:41:29 UTC
No, we prefer using anchors to place diacritics. We don't want them in ccmp, 
liga or rlig features. We've had plenty of discussions about this in the past, 
and even had a lot of these as ligatures in the past, but removed them because 
we thought it was a bad idea. With anchors it's just much more maintainable.

The problem with the T/t with cedilla is just the missing cedilla anchor in 
the T and t glyphs. Easy to make it work, but that's counted as "feature" so 
something for after the release :-)

Comment 29 Vasile Gaburici 2008-07-24 14:02:26 UTC
How about using abvs and blws tables for these ligatures? 

Tag: 'abvs'
Friendly name: Above-base Substitutions
Registered by: Microsoft
Function: Substitutes a ligature for a base glyph and mark that's above it. 
UI suggestion: This feature should be on by default.

Tag:  "blws"
Friendly name: Below-base Substitutions
Registered by: Microsoft
Function: Produces ligatures that comprise of base glyph and below-base forms. 
UI suggestion:  This feature should be on by default.


Comment 30 Vasile Gaburici 2008-07-24 14:09:03 UTC
Btw, your suggestion to use calt for changing presumably just the accent when it
follows S or T in Romanian seem to be a bit different than what the standard
says calt is for:

Tag: 'calt'
Friendly name: Contextual Alternates
Registered by: Adobe
Function: In specified situations, replaces default glyphs with alternate forms
which provide better joining behavior. Used in script typefaces which are
designed to have some or all of their glyphs join.


Comment 31 Vasile Gaburici 2008-07-24 14:28:31 UTC
Okay, I finished reading all the OpenType tag descriptions. The current spec
doesn't have a type-6 table designated for the purpose you want (replacing
diacritics). So, if you want to avoid making ligatures at all cost, Redhat would
have to register a new OpenType tag...


Comment 32 Nicolas Mailhot 2008-07-24 15:03:17 UTC
(In reply to comment #31)
> So, if you want to avoid making ligatures at all cost, Redhat would
> have to register a new OpenType tag...

Ben is not @rh. He's one of the top DejaVu people, and was kind enough to get an
account in Fedora bugzilla


Comment 33 Ben Laenen 2008-07-24 15:14:18 UTC
Yeah, I'm not even a fedora user :-)

abvs and blws are used for Indic scripts only.

I think I've mentioned before somewhere, if there's no good feature, we can 
misuse ccmp for it. It's common practice to misuse ccmp to replace i by 
dotless i before a diacritic above, and renderers also apply ccmp by default 
and can handle this.

calt would still be more beautiful though (also for the dotless i situation), 
but it's not applied by default (even though the specs suggest otherwise).

Comment 34 Vasile Gaburici 2008-07-24 15:45:15 UTC
Well, I tried to add a calt table: I can enter what should be substituted, but
when I try to click on the box where the replacement should go, FontForge
segfaults. So, this feature will have to wait a little longer. Anyway, given the
complete lack of support in commercial fonts for this feature, I don't think
we'll see users asking about it anytime soon...


Comment 35 Vasile Gaburici 2008-07-28 09:34:05 UTC
Fixed upstream in 2.26.


Comment 36 Tony Fu 2008-09-10 03:08:23 UTC
requested by Jens Petersen (#27995)

Comment 37 Christopher D. Stover 2008-10-27 03:20:13 UTC
My understanding is that this issue is resolved.  Please reopen and assign if I am wrong.