Bug 73644
Summary: | regex UTF-8 still broken | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | jeroen <jeroen> | ||||
Component: | glibc | Assignee: | Jakub Jelinek <jakub> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8.0 | CC: | fweimer | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2003-04-22 07:12:21 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
jeroen
2002-09-07 18:06:25 UTC
OK, i've written a small test app that shows my problem. It does a '"' re_match on somebody name, first in UTF-8 and then in ASCII. Now the problem seems to be that even though in UTF-8, the strings use the same number of characters on screen, strlen counts the UTF-8 string as one character longer. It seems this is the root of my syntax highlighting problems. testing utf8... blah "Gustavo Giraldez" blah asdf length = 35 match: 6 match: 24 testing ascii... blah "Gustavo Giraldez" blah asdf length = 34 match: 6 match: 23 Any ideas how to fix this? Created attachment 75484 [details]
test code
Well, I don't find anything wrong with the testcase. If you want to handle UTF-8, you need to a) have current LC_CTYPE as UTF-8 locale (like setlocale(LC_CTYPE, "en_US.UTF-8") etc. - your testcase doesn't do this) - setlocale(LC_ALL, ""); works too b) at least if MB_CUR_MAX > 1, use the multibyte string routines instead of string ones, like mbrlen and other mb* routines) Regex in recent glibc handles multibyte character sets (such as UTF-8) well, e.g. if LC_CTYPE is set to some UTF-8 locale, then Gir.ldez regex will match the substring, although if LC_CTYPE was e.g. C, it would not (since in MB_CUR_MAX == 1 charsets it is 2 letters). It's been a while since i looked at the bug in the highlighting code. I just tried setting the locale, but i don't know in which library the method setlocale is located, so ld complains about an undefined reference. Anyway, LANG is already set to en_US.UTF8 here on red hat 8.0. Shouldn't that be enough? My point is that re_search counts the UTF8 character (the 'a' in Giraldez) as 2 positions: re_search returns 24 as the position of the '"' instead of the correct 23. This is probably because the unicode character is 2 bytes. Is this the intended behavior? Shouldn't re_search count characters instead of bytes? Or is that part of the code non-UTF8? I just committed a patch to gtksourceview that works around the bug in re_search where it returns the number of bytes matched instead of the number of characters. It uses g_utf8_strlen to check if the value returned by re_search needs adjusting. gtksourceview now highlights UTF8 text correctly. /* Work around a re_search bug where it returns the number of bytes * instead of the number of characters (unicode characters can be * more than 1 byte) it matched. See redhat bugzilla #73644. */ len = (gint)g_utf8_strlen (text, -1); pos += (pos - (gint)g_utf8_strlen (text, pos)); match->startpos = re_search (®ex->buf, text, len, pos, (forward ? len - pos : -pos), ®ex->reg); if (match->startpos > -1) { len = (gint)g_utf8_strlen (text, match->startpos); diff = match->startpos - len; match->startpos = len; match->endpos = regex->reg.end[0] - diff; } re_search is no well-defined interface. It's some code emacs uses. If you want reliable results stick with the POSIX interfaces. |