Bug 73644

Summary: regex UTF-8 still broken
Product: [Retired] Red Hat Linux Reporter: jeroen <jeroen>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0CC: fweimer
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-04-22 07:12:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test code none

Description jeroen 2002-09-07 18:06:25 UTC
After the re_match fix in glibc-2.2.93, i've tested syntax highlighting of UTF-8
files again (a .schemas file for example).

As soon as the regex methods encounter a UTF-8 character, the highlighting is
screwed up. Is there anything special that needs to be done to get this to work
properly?

I'll find out more when i have the time.

Comment 1 jeroen 2002-09-08 12:02:12 UTC
OK, i've written a small test app that shows my problem. It does a '"' re_match
on somebody name, first in UTF-8 and then in ASCII. Now the problem seems to be
that even though in UTF-8, the strings use the same number of characters on
screen, strlen counts the UTF-8 string as one character longer. It seems this is
the root of my syntax highlighting problems.

testing utf8...
blah "Gustavo Giraldez" blah asdf
length = 35
match: 6
match: 24

testing ascii...
blah "Gustavo Giraldez" blah asdf
length = 34
match: 6
match: 23

Any ideas how to fix this?

Comment 2 jeroen 2002-09-08 12:03:00 UTC
Created attachment 75484 [details]
test code

Comment 3 Jakub Jelinek 2002-09-08 20:37:00 UTC
Well, I don't find anything wrong with the testcase.
If you want to handle UTF-8, you need to
a) have current LC_CTYPE as UTF-8 locale (like setlocale(LC_CTYPE, "en_US.UTF-8") etc. -
   your testcase doesn't do this) - setlocale(LC_ALL, ""); works too
b) at least if MB_CUR_MAX > 1, use the multibyte string routines instead of
   string ones, like mbrlen and other mb* routines)
Regex in recent glibc handles multibyte character sets (such as UTF-8) well,
e.g. if LC_CTYPE is set to some UTF-8 locale, then Gir.ldez regex will match
the substring, although if LC_CTYPE was e.g. C, it would not (since in MB_CUR_MAX == 1
charsets it is 2 letters).

Comment 4 jeroen 2003-01-03 15:38:54 UTC
It's been a while since i looked at the bug in the highlighting code. I just
tried setting the locale, but i don't know in which library the method setlocale
is located, so ld complains about an undefined reference. Anyway, LANG is
already set to en_US.UTF8 here on red hat 8.0. Shouldn't that be enough?

My point is that re_search counts the UTF8 character (the 'a' in Giraldez) as 2
positions: re_search returns 24 as the position of the '"' instead of the
correct 23. This is probably because the unicode character is 2 bytes.

Is this the intended behavior? Shouldn't re_search count characters instead of
bytes? Or is that part of the code non-UTF8?

Comment 5 jeroen 2003-01-04 14:00:38 UTC
I just committed a patch to gtksourceview that works around the bug in re_search
where it returns the number of bytes matched instead of the number of
characters. It uses g_utf8_strlen to check if the value returned by re_search
needs adjusting. gtksourceview now highlights UTF8 text correctly.

	/* Work around a re_search bug where it returns the number of bytes
	 * instead of the number of characters (unicode characters can be
	 * more than 1 byte) it matched. See redhat bugzilla #73644. */
	len = (gint)g_utf8_strlen (text, -1);
	pos += (pos - (gint)g_utf8_strlen (text, pos));

	match->startpos = re_search (&regex->buf,
				     text,
				     len,
				     pos,
				     (forward ? len - pos : -pos),
				     &regex->reg);

	if (match->startpos > -1) {
		len = (gint)g_utf8_strlen (text, match->startpos);
		diff = match->startpos - len;
		match->startpos = len;
		match->endpos = regex->reg.end[0] - diff;
	}

Comment 6 Ulrich Drepper 2003-04-22 07:12:21 UTC
re_search is no well-defined interface.  It's some code emacs uses.  If you want
reliable results stick with the POSIX interfaces.