Bug 73644 - regex UTF-8 still broken
regex UTF-8 still broken
Status: CLOSED NOTABUG
Product: Red Hat Linux
Classification: Retired
Component: glibc (Show other bugs)
8.0
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-09-07 14:06 EDT by jeroen
Modified: 2016-11-24 10:22 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-04-22 03:12:21 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test code (1.42 KB, text/plain)
2002-09-08 08:03 EDT, jeroen
no flags Details

  None (edit)
Description jeroen 2002-09-07 14:06:25 EDT
After the re_match fix in glibc-2.2.93, i've tested syntax highlighting of UTF-8
files again (a .schemas file for example).

As soon as the regex methods encounter a UTF-8 character, the highlighting is
screwed up. Is there anything special that needs to be done to get this to work
properly?

I'll find out more when i have the time.
Comment 1 jeroen 2002-09-08 08:02:12 EDT
OK, i've written a small test app that shows my problem. It does a '"' re_match
on somebody name, first in UTF-8 and then in ASCII. Now the problem seems to be
that even though in UTF-8, the strings use the same number of characters on
screen, strlen counts the UTF-8 string as one character longer. It seems this is
the root of my syntax highlighting problems.

testing utf8...
blah "Gustavo Giraldez" blah asdf
length = 35
match: 6
match: 24

testing ascii...
blah "Gustavo Giraldez" blah asdf
length = 34
match: 6
match: 23

Any ideas how to fix this?
Comment 2 jeroen 2002-09-08 08:03:00 EDT
Created attachment 75484 [details]
test code
Comment 3 Jakub Jelinek 2002-09-08 16:37:00 EDT
Well, I don't find anything wrong with the testcase.
If you want to handle UTF-8, you need to
a) have current LC_CTYPE as UTF-8 locale (like setlocale(LC_CTYPE, "en_US.UTF-8") etc. -
   your testcase doesn't do this) - setlocale(LC_ALL, ""); works too
b) at least if MB_CUR_MAX > 1, use the multibyte string routines instead of
   string ones, like mbrlen and other mb* routines)
Regex in recent glibc handles multibyte character sets (such as UTF-8) well,
e.g. if LC_CTYPE is set to some UTF-8 locale, then Gir.ldez regex will match
the substring, although if LC_CTYPE was e.g. C, it would not (since in MB_CUR_MAX == 1
charsets it is 2 letters).
Comment 4 jeroen 2003-01-03 10:38:54 EST
It's been a while since i looked at the bug in the highlighting code. I just
tried setting the locale, but i don't know in which library the method setlocale
is located, so ld complains about an undefined reference. Anyway, LANG is
already set to en_US.UTF8 here on red hat 8.0. Shouldn't that be enough?

My point is that re_search counts the UTF8 character (the 'a' in Giraldez) as 2
positions: re_search returns 24 as the position of the '"' instead of the
correct 23. This is probably because the unicode character is 2 bytes.

Is this the intended behavior? Shouldn't re_search count characters instead of
bytes? Or is that part of the code non-UTF8?
Comment 5 jeroen 2003-01-04 09:00:38 EST
I just committed a patch to gtksourceview that works around the bug in re_search
where it returns the number of bytes matched instead of the number of
characters. It uses g_utf8_strlen to check if the value returned by re_search
needs adjusting. gtksourceview now highlights UTF8 text correctly.

	/* Work around a re_search bug where it returns the number of bytes
	 * instead of the number of characters (unicode characters can be
	 * more than 1 byte) it matched. See redhat bugzilla #73644. */
	len = (gint)g_utf8_strlen (text, -1);
	pos += (pos - (gint)g_utf8_strlen (text, pos));

	match->startpos = re_search (&regex->buf,
				     text,
				     len,
				     pos,
				     (forward ? len - pos : -pos),
				     &regex->reg);

	if (match->startpos > -1) {
		len = (gint)g_utf8_strlen (text, match->startpos);
		diff = match->startpos - len;
		match->startpos = len;
		match->endpos = regex->reg.end[0] - diff;
	}
Comment 6 Ulrich Drepper 2003-04-22 03:12:21 EDT
re_search is no well-defined interface.  It's some code emacs uses.  If you want
reliable results stick with the POSIX interfaces.

Note You need to log in before you can comment on or make changes to this bug.