After the re_match fix in glibc-2.2.93, i've tested syntax highlighting of UTF-8 files again (a .schemas file for example). As soon as the regex methods encounter a UTF-8 character, the highlighting is screwed up. Is there anything special that needs to be done to get this to work properly? I'll find out more when i have the time.
OK, i've written a small test app that shows my problem. It does a '"' re_match on somebody name, first in UTF-8 and then in ASCII. Now the problem seems to be that even though in UTF-8, the strings use the same number of characters on screen, strlen counts the UTF-8 string as one character longer. It seems this is the root of my syntax highlighting problems. testing utf8... blah "Gustavo Giraldez" blah asdf length = 35 match: 6 match: 24 testing ascii... blah "Gustavo Giraldez" blah asdf length = 34 match: 6 match: 23 Any ideas how to fix this?
Created attachment 75484 [details] test code
Well, I don't find anything wrong with the testcase. If you want to handle UTF-8, you need to a) have current LC_CTYPE as UTF-8 locale (like setlocale(LC_CTYPE, "en_US.UTF-8") etc. - your testcase doesn't do this) - setlocale(LC_ALL, ""); works too b) at least if MB_CUR_MAX > 1, use the multibyte string routines instead of string ones, like mbrlen and other mb* routines) Regex in recent glibc handles multibyte character sets (such as UTF-8) well, e.g. if LC_CTYPE is set to some UTF-8 locale, then Gir.ldez regex will match the substring, although if LC_CTYPE was e.g. C, it would not (since in MB_CUR_MAX == 1 charsets it is 2 letters).
It's been a while since i looked at the bug in the highlighting code. I just tried setting the locale, but i don't know in which library the method setlocale is located, so ld complains about an undefined reference. Anyway, LANG is already set to en_US.UTF8 here on red hat 8.0. Shouldn't that be enough? My point is that re_search counts the UTF8 character (the 'a' in Giraldez) as 2 positions: re_search returns 24 as the position of the '"' instead of the correct 23. This is probably because the unicode character is 2 bytes. Is this the intended behavior? Shouldn't re_search count characters instead of bytes? Or is that part of the code non-UTF8?
I just committed a patch to gtksourceview that works around the bug in re_search where it returns the number of bytes matched instead of the number of characters. It uses g_utf8_strlen to check if the value returned by re_search needs adjusting. gtksourceview now highlights UTF8 text correctly. /* Work around a re_search bug where it returns the number of bytes * instead of the number of characters (unicode characters can be * more than 1 byte) it matched. See redhat bugzilla #73644. */ len = (gint)g_utf8_strlen (text, -1); pos += (pos - (gint)g_utf8_strlen (text, pos)); match->startpos = re_search (®ex->buf, text, len, pos, (forward ? len - pos : -pos), ®ex->reg); if (match->startpos > -1) { len = (gint)g_utf8_strlen (text, match->startpos); diff = match->startpos - len; match->startpos = len; match->endpos = regex->reg.end[0] - diff; }
re_search is no well-defined interface. It's some code emacs uses. If you want reliable results stick with the POSIX interfaces.