73644 – regex UTF-8 still broken

Bug 73644 - regex UTF-8 still broken

Summary: regex UTF-8 still broken

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	glibc
Sub Component:
Version:	8.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-09-07 18:06 UTC by jeroen
Modified:	2016-11-24 15:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-04-22 07:12:21 UTC
Embargoed:

Attachments	(Terms of Use)
test code (1.42 KB, text/plain) 2002-09-08 12:03 UTC, jeroen	no flags	Details
View All

Description jeroen 2002-09-07 18:06:25 UTC

After the re_match fix in glibc-2.2.93, i've tested syntax highlighting of UTF-8
files again (a .schemas file for example).

As soon as the regex methods encounter a UTF-8 character, the highlighting is
screwed up. Is there anything special that needs to be done to get this to work
properly?

I'll find out more when i have the time.

Comment 1 jeroen 2002-09-08 12:02:12 UTC

OK, i've written a small test app that shows my problem. It does a '"' re_match
on somebody name, first in UTF-8 and then in ASCII. Now the problem seems to be
that even though in UTF-8, the strings use the same number of characters on
screen, strlen counts the UTF-8 string as one character longer. It seems this is
the root of my syntax highlighting problems.

testing utf8...
blah "Gustavo Giraldez" blah asdf
length = 35
match: 6
match: 24

testing ascii...
blah "Gustavo Giraldez" blah asdf
length = 34
match: 6
match: 23

Any ideas how to fix this?

Comment 2 jeroen 2002-09-08 12:03:00 UTC

Created attachment 75484 [details]
test code

Comment 3 Jakub Jelinek 2002-09-08 20:37:00 UTC

Well, I don't find anything wrong with the testcase.
If you want to handle UTF-8, you need to
a) have current LC_CTYPE as UTF-8 locale (like setlocale(LC_CTYPE, "en_US.UTF-8") etc. -
   your testcase doesn't do this) - setlocale(LC_ALL, ""); works too
b) at least if MB_CUR_MAX > 1, use the multibyte string routines instead of
   string ones, like mbrlen and other mb* routines)
Regex in recent glibc handles multibyte character sets (such as UTF-8) well,
e.g. if LC_CTYPE is set to some UTF-8 locale, then Gir.ldez regex will match
the substring, although if LC_CTYPE was e.g. C, it would not (since in MB_CUR_MAX == 1
charsets it is 2 letters).

Comment 4 jeroen 2003-01-03 15:38:54 UTC

It's been a while since i looked at the bug in the highlighting code. I just
tried setting the locale, but i don't know in which library the method setlocale
is located, so ld complains about an undefined reference. Anyway, LANG is
already set to en_US.UTF8 here on red hat 8.0. Shouldn't that be enough?

My point is that re_search counts the UTF8 character (the 'a' in Giraldez) as 2
positions: re_search returns 24 as the position of the '"' instead of the
correct 23. This is probably because the unicode character is 2 bytes.

Is this the intended behavior? Shouldn't re_search count characters instead of
bytes? Or is that part of the code non-UTF8?

Comment 5 jeroen 2003-01-04 14:00:38 UTC

I just committed a patch to gtksourceview that works around the bug in re_search
where it returns the number of bytes matched instead of the number of
characters. It uses g_utf8_strlen to check if the value returned by re_search
needs adjusting. gtksourceview now highlights UTF8 text correctly.

	/* Work around a re_search bug where it returns the number of bytes
	 * instead of the number of characters (unicode characters can be
	 * more than 1 byte) it matched. See redhat bugzilla #73644. */
	len = (gint)g_utf8_strlen (text, -1);
	pos += (pos - (gint)g_utf8_strlen (text, pos));

	match->startpos = re_search (&regex->buf,
				     text,
				     len,
				     pos,
				     (forward ? len - pos : -pos),
				     &regex->reg);

	if (match->startpos > -1) {
		len = (gint)g_utf8_strlen (text, match->startpos);
		diff = match->startpos - len;
		match->startpos = len;
		match->endpos = regex->reg.end[0] - diff;
	}

Comment 6 Ulrich Drepper 2003-04-22 07:12:21 UTC

re_search is no well-defined interface.  It's some code emacs uses.  If you want
reliable results stick with the POSIX interfaces.

Note You need to log in before you can comment on or make changes to this bug.