Description of problem: An unjust error occurs in tcsh Version-Release number of selected component (if applicable): tcsh-6.14.05 How reproducible: Always occurred Steps to Reproduce: 1.Make a shell script with multibyte character. I send an example together.(It's EUC_jp environment.) 2.Excecute this script. 3.An error occurs. Actual results: "end: while/foreachã®ä¸ã§ã¯ããã¾ããã" I do not understand how it is displayed in English,sorry. Expected results: normal end. Additional info:
Created attachment 131163 [details] tcsh script (EUC_jp character coexists)
The error occurs as same on development version tcsh-6.14-8 ( and FC5 tcsh tcsh-6.14-6.fc5.1 ). Note: if I convert the attachment 131163 [details] from EUCJP to UTF-8 and execute it, the another error message is added. [tasaka1@localhost Linux]$ iconv -f eucjp -t utf8 attachment.cgi > TEMP.csh [tasaka1@localhost Linux]$ tcsh ./TEMP.csh �ãããããããããããããããããããããããããããããããããã: ã³ãã³ã ãè¦ã¤ããã¾ãã. end: while/foreachã®ä¸ã§ã¯ããã¾ãã. ("ã³ãã³ããè¦ã¤ããã¾ãã" means in English "command not found", and "while/foreachã®ä¸ã§ã¯ããã¾ãã." means "Not in while/foreach").
The cause of this problem seems wrong calculation of l->f_seek in the function btell in sh.lex.c. btell(l) { ... #ifdef WIDE_STRINGS if (cantell && fseekp >= fbobp && fseekp < feobp) { size_t i; l->f_seek = fbobp; for (i = 0; i < fseekp - fbobp; i++) l->f_seek += fclens[i]; } else #endif l->f_seek represents byte position. To handle the case when one character has more than one bytes (multi-byte character), it uses the array fclens which hold the number of bytes in each character. In this code segment, the initial value of l->f_seek is set to fbobp. This is OK at first when fbobp is zero. However, when the first buffer(size is 4096) has been used up, the second buffer is used and the pointers are updated as follows in the function bgetc. bgetc() { ... if (fseekp == feobp) { fbobp = feobp; This update of fbobp is necessary for the other part of the program. However, the updated fbobp cannot be used as an initial value for the calculation of l- >f_seek in btell, because the updated fbobp does not represent 'byte position' anymore. It represents 'character position'. I made a patch to correct the problem. The patch adds a new pointer fbobp2 which hold byte position. Some of the fbobp in the program are replaced by fbobp2 by this patch.
Created attachment 131717 [details] Patch to fix the problem.
Created attachment 131869 [details] Document and fix buffer offset counting Thanks for the patch, I'm testing the attached one - the idea is a bit more complicated, but the resulting code is simpler. If you have other scripts affected by the bug, can you please test them with this patch? Also, can we add the test case in attachment 131163 [details] to the tcsh test suite, to be distributed with tcsh under a BSD license, please?
Created attachment 131870 [details] Corrected patch I'm sorry, attachment 131869 [details] is an incomplete working version; please test this one.
I am sorry that an answer is late. I tested it, but there was not the problem. Thank you.
Umm... Sorry to respond late, however, even if I applied the patch (attachment 131870 [details]) , the problem persists. If I convert the test script (attachment 131870 [details]) from EUC-JP to UTF-8, the problem seems to be disappeared (if not applying the patch, the problem appears even on UTF-8 script). Is this only for me?
Sorry: In the comment 8, I meant that the test script is the attachment 131163 [details].
>Also, can we add the test case in attachment 131163 [details] [edit] to the tcsh test >suite, to >be distributed with tcsh under a BSD license, please? Yes,Of course.
More precisely, with the patch (attachment 131870 [details]) applied, I meet with below: [tasaka2@localhost tmp]$ echo $LANG ja_JP.UTF-8 [tasaka2@localhost tmp]$ ./test-EUCJP.sh end: while/foreachã®ä¸ã§ã¯ããã¾ãã. [tasaka2@localhost tmp]$ ./test-ISO2022JP.sh AAA [tasaka2@localhost tmp]$ ./test-UTF8.sh AAA [tasaka2@localhost tmp]$ env LC_ALL=ja_JP.eucJP ./test-EUCJP.sh AAA [tasaka2@localhost tmp]$ env LC_ALL=ja_JP.eucJP ./test-ISO2022JP.sh AAA [tasaka2@localhost tmp]$ env LC_ALL=ja_JP.eucJP ./test-UTF8.sh (character corrupted here): ã³ãã³ããè¦ã¤ããã¾ãã. end: while/foreachã®ä¸ã§ã¯ããã¾ãã. (Note: test-EUCJP.sh is the same as attachment 131163 [details], and test-ISO2022JP.sh, test-UTF8.sh is the csh script with its charater coding converted into ISO-2022-JP and UTF-8)
Created attachment 131932 [details] Updated patch, to handle encoding mismatches better Thanks for testing. Technically, using files in a different encoding than the one specified by LC_CTYPE is purely an user error - nevertheless it is easy enough to fix for scripts, please test the updated patch. This won't help if the script is not in a seekable file, e.g. in (cat my_script | tcsh), though.
Umm..., I was not able to patched a file that is "sh.lex.c" Message: |diff -u tcsh-6.14.00/sh.lex.c tcsh-6.14.00/sh.lex.c |-- tcsh-6.14.00/sh.lex.c 2006-07-03 03:46:11.000000000 +0200 <-???miss??? |++ tcsh-6.14.00/sh.lex.c 2006-07-05 03:46:11.000000000 +0200 Hunk #1 FAILD at 1736. 1 out of 3 hunks FAILD -- saving rejects to file tcsh-6.14.00/sh.lex.c.rej Thank you.
Hi: I applied the attachment 131932 [details] against rawhide tcsh-6.14-8 and seems to work WELL for all the case I commented in the comment 11. Thanks!!
Nobuhiro-san, the patch applies cleanly for me to the result of (rpmbuild -bp), after tcsh-6.14.00-wide-crash.patch (which also touches wide_read()), among others. Have you perhaps tried to patch the original tcsh-6.14.00 tarball?
>Have you perhaps tried to patch the original tcsh-6.14.00 tarball? I worked normally when a patch was successful and checked it. Thank you so match!!
Thanks for all the testing, tcsh-6.14-6.fc5.2 is now in the updates-testing repository.
... and published as a final update.