Bug 73644

Summary:

regex UTF-8 still broken

Product:

[Retired] Red Hat Linux

Reporter:

jeroen <jeroen>

Component:

glibc

Assignee:

Jakub Jelinek <jakub>

Status:

CLOSED NOTABUG

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

8.0

CC:

fweimer

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-04-22 07:12:21 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
test code	none

Description jeroen 2002-09-07 18:06:25 UTC

After the re_match fix in glibc-2.2.93, i've tested syntax highlighting of UTF-8
files again (a .schemas file for example).

As soon as the regex methods encounter a UTF-8 character, the highlighting is
screwed up. Is there anything special that needs to be done to get this to work
properly?

I'll find out more when i have the time.

Comment 1 jeroen 2002-09-08 12:02:12 UTC

OK, i've written a small test app that shows my problem. It does a '"' re_match
on somebody name, first in UTF-8 and then in ASCII. Now the problem seems to be
that even though in UTF-8, the strings use the same number of characters on
screen, strlen counts the UTF-8 string as one character longer. It seems this is
the root of my syntax highlighting problems.

testing utf8...
blah "Gustavo Giraldez" blah asdf
length = 35
match: 6
match: 24

testing ascii...
blah "Gustavo Giraldez" blah asdf
length = 34
match: 6
match: 23

Any ideas how to fix this?

Comment 2 jeroen 2002-09-08 12:03:00 UTC

Created attachment 75484 [details]
test code

Comment 3 Jakub Jelinek 2002-09-08 20:37:00 UTC

Well, I don't find anything wrong with the testcase.
If you want to handle UTF-8, you need to
a) have current LC_CTYPE as UTF-8 locale (like setlocale(LC_CTYPE, "en_US.UTF-8") etc. -
   your testcase doesn't do this) - setlocale(LC_ALL, ""); works too
b) at least if MB_CUR_MAX > 1, use the multibyte string routines instead of
   string ones, like mbrlen and other mb* routines)
Regex in recent glibc handles multibyte character sets (such as UTF-8) well,
e.g. if LC_CTYPE is set to some UTF-8 locale, then Gir.ldez regex will match
the substring, although if LC_CTYPE was e.g. C, it would not (since in MB_CUR_MAX == 1
charsets it is 2 letters).

Comment 4 jeroen 2003-01-03 15:38:54 UTC

It's been a while since i looked at the bug in the highlighting code. I just
tried setting the locale, but i don't know in which library the method setlocale
is located, so ld complains about an undefined reference. Anyway, LANG is
already set to en_US.UTF8 here on red hat 8.0. Shouldn't that be enough?

My point is that re_search counts the UTF8 character (the 'a' in Giraldez) as 2
positions: re_search returns 24 as the position of the '"' instead of the
correct 23. This is probably because the unicode character is 2 bytes.

Is this the intended behavior? Shouldn't re_search count characters instead of
bytes? Or is that part of the code non-UTF8?

Comment 5 jeroen 2003-01-04 14:00:38 UTC

I just committed a patch to gtksourceview that works around the bug in re_search
where it returns the number of bytes matched instead of the number of
characters. It uses g_utf8_strlen to check if the value returned by re_search
needs adjusting. gtksourceview now highlights UTF8 text correctly.

	/* Work around a re_search bug where it returns the number of bytes
	 * instead of the number of characters (unicode characters can be
	 * more than 1 byte) it matched. See redhat bugzilla #73644. */
	len = (gint)g_utf8_strlen (text, -1);
	pos += (pos - (gint)g_utf8_strlen (text, pos));

	match->startpos = re_search (&regex->buf,
				     text,
				     len,
				     pos,
				     (forward ? len - pos : -pos),
				     &regex->reg);

	if (match->startpos > -1) {
		len = (gint)g_utf8_strlen (text, match->startpos);
		diff = match->startpos - len;
		match->startpos = len;
		match->endpos = regex->reg.end[0] - diff;
	}

Comment 6 Ulrich Drepper 2003-04-22 07:12:21 UTC

re_search is no well-defined interface.  It's some code emacs uses.  If you want
reliable results stick with the POSIX interfaces.