Bug 112192 - C++ Wide streams do not work with UTF-8 locale
C++ Wide streams do not work with UTF-8 locale
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: gcc (Show other bugs)
1
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Benjamin Kosnik
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-12-15 15:53 EST by jouni
Modified: 2013-08-09 01:46 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-11-27 08:39:27 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
The mentioned source code (962 bytes, text/plain)
2003-12-15 15:55 EST, jouni
no flags Details

  None (edit)
Description jouni 2003-12-15 15:53:41 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1)
Gecko/20031114

Description of problem:
I have been trying to get wchar_t streams working on Fedora. 
Shouldn't the following code print on wcout and on the file correct
UTF-8 stream?

The src in real source file includes UTF-8 encoded string

The output of "printf("\n===\n%s%ls===\n", src, dst);" is two same
correctly UTF-8 encoded strings.

However, the 'wcout' screen output stops just before the first
non-ASCII character.  On the file I receive only the first line which
has only ASCII chars.

How can this be?  I understood that codecvt used internally in
fstreams uses wcsrtombs itself, so shouldn't wide C++ streams behave
same way as %ls in C-printf?

==========================================

#include <fstream>
#include <iostream>
#include <locale>
#include <string>

#include <stdlib.h>

using namespace std;

int main()
{
  wchar_t dst[256];
  unsigned char *ptr = (unsigned char *) dst;

  char *src = "T�m� on ������ testi\n";

  locale::global(locale("en_US.UTF-8"));
  //locale::global(locale(""));
  cout << locale().name() << endl;

  wcout.imbue(locale());
  cout.imbue(locale());

  int i = mbstowcs(dst, src, 255);
  printf(">%d< %d\n", i, sizeof(dst));

  for(int x=0; x<i*4; x++) printf("%02hx ", ptr[x]);
  printf("\n===\n%s%ls===\n", src, dst);

  wofstream myfile;
  myfile.imbue(locale());
  myfile.open("log.log");

  wcout  << L"Testi " << endl << dst << endl;
  perror("1"); errno = 0; 
  myfile << L"Rivi 1 " << endl << dst << endl;
  perror("2");

  myfile.close();

}

I get following output (on UTF-8 encoded gnome-terminal):

en_US.UTF-8
>21< 1024
54 00 00 00 e4 00 00 00 6d 00 00 00 e4 00 00 00 20 00 00 00 6f 00 00
00 6e 00 00 00 20 00 00 00 f6 00 00 00 e4 00 00 00 e5 00 00 00 d6 00
00 00 c4 00 00 00 c5 00 00 00 20 00 00 00 74 00 00 00 65 00 00 00 73
00 00 00 74 00 00 00 69 00 00 00 0a 00 00 00 00
===
T�m� on ������ testi
T�m� on ������ testi
===
Testi
1: Invalid or incomplete multibyte or wide character
2: Invalid or incomplete multibyte or wide character
T


Version-Release number of selected component (if applicable):
gcc-c++-3.3.2-1

How reproducible:
Always

Steps to Reproduce:
1.  Compile above included source
2.  See the error message
3.
    

Additional info:

I wonder if it matters, but my actual /etc/sysconfig/i18n is:

LANG="C"
LC_CTYPE="fi_FI@euro"
LC_PAPER="fi_FI@euro"
LESSCHARSET=latin1
SUPPORTED="en_US.UTF-8:en_US:en:fi_FI.UTF-8:fi_FI:fi"
SYSFONT="lat0-16"

I have also run the program so that locale shows:

# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Comment 1 jouni 2003-12-15 15:55:56 EST
Created attachment 96548 [details]
The mentioned source code
Comment 2 jouni 2003-12-16 06:02:21 EST
Hopefully more describing title
Comment 3 Jakub Jelinek 2003-12-16 06:24:55 EST
I've built the proglet with -Wl,-Bstatic -lstdc++ -Wl,-Bdynamic to look at ltrace.
Here are the interesting lines:
__newlocale(64, "C", NULL)                                                     = 0x003c02e0
...
# The following snippet is outputting "Testi ".  Why is it done a character at a time?
perror("0")                                                                    = <void>
wcslen(0x0808fb40, 0x080a69e0, 0xbfe55088, 0x0804b6a0, 577)                    = 6
__uselocale(0x003c02e0, 11, 0xbfe54f58, 0x0807c11f, 11)                        = -1
wcsrtombs(0xbfe54f30, 0xbfe54f18, 1, 0x080a6554, 11)                           = 1
__uselocale(-1, 0xbfe54f18, 1, 0x080a6554, 11)                                 = 0x003c02e0
fwrite("T\341)", 1, 1, 0x003c0be0)                                             = 1
__uselocale(0x003c02e0, 0xbfe54f30, 0xbfe54f08, 0x0808624a, 0xbfe54f30)        = -1
wcsrtombs(0xbfe54f30, 0xbfe54f18, 1, 0x080a6554, 0xbfe54f30)                   = 1
__uselocale(-1, 0xbfe54f18, 1, 0x080a6554, 0xbfe54f30)                         = 0x003c02e0
fwrite("e\341)", 1, 1, 0x003c0be0)                                             = 1
...
# And here wcout ... << endl << dst:
fwrite("\n", 1, 1, 0x003c0be0)                                                 = 1
fflush(0x003c0be0)                                                             = 0
wcslen(0xbfe55270, 0x080a69e0, 0xbfe5507c, 0, 0x080a69e0)                      = 21
__uselocale(0x003c02e0, 0x002ec9cf, 0x003c0be0, 0xbfe54f70, 1)                 = -1
wcsrtombs(0xbfe54f50, 0xbfe54f38, 1, 0x080a6554, 1)                            = 1
__uselocale(-1, 0xbfe54f38, 1, 0x080a6554, 1)                                  = 0x003c02e0
fwrite("TO\345\277pO\345\277\001", 1, 1, 0x003c0be0)                           = 1
__uselocale(0x003c02e0, 0xbfe54f50, 0xbfe54f28, 0x0808624a, 0xbfe54f50)        = -1
wcsrtombs(0xbfe54f50, 0xbfe54f38, 1, 0x080a6554, 0xbfe54f50)                   = -1
__uselocale(-1, 0xbfe54f38, 1, 0x080a6554, 0xbfe54f50)                         = 0x003c02e0
perror("1")                                                                    = <void>

Apparently, libstdc++ is trying to convert the wide string to ASCII (and a byte at a time).
Benjamin, can you please look at this?
Thanks.
Comment 4 Benjamin Kosnik 2003-12-19 13:35:45 EST
Hey Jakub. This looks like it's fixed in the 3.4 toolchain. I can try
to port this back to 3.3.x. I get:

en_US.UTF-8
>21< 1024
54 00 00 00 e4 00 00 00 6d 00 00 00 e4 00 00 00 20 00 00 00 6f 00 00
00 6e 00 00 00 20 00 00 00 f6 00 00 00 e4 00 00 00 e5 00 00 00 d6 00
00 00 c4 00 00 00 c5 00 00 00 20 00 00 00 74 00 00 00 65 00 00 00 73
00 00 00 74 00 00 00 69 00 00 00 0a 00 00 00 00
===
Tämä on öäåÖÄÅ testi
Tämä on öäåÖÄÅ testi
===
0: Success
Testi
Tämä on öäåÖÄÅ testi
 
1: Illegal seek
2: Success

.... seems to be correct. Anyway. WRT to the conversion, I can't
duplicate your results. I think this is also fixed in mainline but I
can't verify.

I can't reproduce your ltrace bits with mainline. I get:

_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_(0x0804abc8,
0x00613758, 0xbffff3f8, 0x080491f2, 0x0804abc8en_US.UTF-8
) = 0x0804abc8
>21< 1024
54 00 00 00 e4 00 00 00 6d 00 00 00 e4 00 00 00 20 00 00 00 6f 00 00
00 6e 00 00 00 20 00 00 00 f6 00 00 00 e4 00 00 00 e5 00 00 00 d6 00
00 00 c4 00 00 00 c5 00 00 00 20 00 00 00 74 00 00 00 65 00 00 00 73
00 00 00 74 00 00 00 69 00 00 00 0a 00 00 00 00
===
Tämä on öäåÖÄÅ testi
Tämä on öäåÖÄÅ testi
===
0: Success
_ZSt4endlIwSt11char_traitsIwEERSt13basic_ostreamIT_T0_ES6_(0x0804ab28,
0x00613758, 0xbffff3f8, 0x0804941b, 0x0804ab28Testi
) = 0x0804ab28
Tämä on öäåÖÄÅ testi
_ZSt4endlIwSt11char_traitsIwEERSt13basic_ostreamIT_T0_ES6_(0x0804ab28,
0x08049851, 0xbffff3f8, 0x0804943e, 0x0804ab28
) = 0x0804ab28
1: Illegal seek
_ZSt4endlIwSt11char_traitsIwEERSt13basic_ostreamIT_T0_ES6_(0xbfffee80,
0x08049880, 0xbffff3f8, 0x0804947d, 0xbfffee80) = 0xbfffee80
_ZSt4endlIwSt11char_traitsIwEERSt13basic_ostreamIT_T0_ES6_(0xbfffee80,
0x08049880, 0xbffff3f8, 0x080494a0, 0xbfffee80) = 0xbfffee80
2: Success
+++ exited (status 0) +++
Comment 5 Jakub Jelinek 2003-12-19 13:45:15 EST
To get usable ltrace output, you need to link libstdc++.a into the
application instead of libstdc++.so.  ltrace unfortunately only handles
PLT slots in the binary.
Comment 6 jouni 2003-12-19 15:07:19 EST
Just to be double check:  From above test the strings from printf were:

===
Tämä on öäåÖÄÅ testi
Tämä on öäåÖÄÅ testi
===
wcout output should look the same.  Now after "testi" it was printed:

Tämä on öäåÖÄÅ testi.

All three strings + file should be printed with the same
character-set.  Now they look to be from different character sets at
least on my browser (printf was UTF-8 and wcout latin-1?).  Maybe this
was only due to cut&paste.  Or I misunderstood something.

Just another question outside the context of this bug:  When testing
this behaviour and learning streambufs I noticed in
/usr/include/c++/3.3.2/streambuf that 
virtual void imbue was not empty.  It is setting the _M_buf_locale. 
Shouldn't this be set in pubimbue before calling virtual imbue, so
that even when overriding imbue the local locale gets set.  This is
the behaviour I understood from Langer&Kreft' IOStreams book. 
"Default imbue does nothing"
Comment 7 jouni 2003-12-20 04:38:29 EST
My last question seems to be just answered:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13007

Concerning to other comments, it looks to me that imbue does not
change the locale used for for character conversations.  (Does Jakub
have traditional "C" locale where no conversion is possible -> crash
and Benjamin some Latin-1 (ISO 8859-1) based locale as system default
locale which converts wchar_t to latin-1).

Unfortunately at this moment I have no Fedora based server available
which I could boot to test if changing /etc/sysconfig/i18n before
booting would change the behaviour.

wcout.imbue() does actually change streams locale since if I add "cout
<< wcout.rdbuf()->getloc().name() << endl;" to my test program I get
the expected "en_US.UTF-8" out (and "C" before calling imbue).  

Trying to briefly find out how the conversation works I found that in
file /usr/include/c++/3.3.2/bits/fstream.tcc in function
_M_convert_to_external() _M_state_cur is passed as state parameter to
cvt.out.  Expect of initializing it in constructor to __state_type() I
could not find anything which would change the value of this
_M_state_cur afterwards.

If
/usr/include/c++/3.3.2/i386-redhat-linux/bits/codecvt_specializations.h
is the rigth place to look for do_out specialization there in do_out()
iconv is used for actual conversation.  In here 
__state._M_get_out_descriptor() is used as the first parameter to
iconv.  __state is the first passed parameter to out (do_out) =
_M_state_cur.

Therefore I wonder shouln't _M_state_cur be updated in somewhere (in
basic_filebuf's imbue?) + iconv_close(); iconv_open(the new codes...)

Comment 8 jouni 2004-11-27 08:39:27 EST
Works ok with FC3

Note You need to log in before you can comment on or make changes to this bug.