Bug 196648

Summary: Emacs cannot edit files in directories with names containing ä
Product: [Fedora] Fedora Reporter: Pawel Salek <pawsa-gpa>
Component: emacsAssignee: Chip Coldwell <coldwell>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-06 16:48:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
enable multibyte strings in command line arguments none

Description Pawel Salek 2006-06-26 07:40:00 UTC
Description of problem:


Version-Release number of selected component (if applicable):
emacs-21.4-14

How reproducible:
Always

Steps to Reproduce:
1. Execute:
mkdir ä; cd ä; emacs xxx
  
Actual results:
emacs says "File not found and directory write-protected".

Expected results:
emacs should allow to edit the file.

Additional info:
The same holds for directory names containing ö, Ã¥, ø, but not eg. Å, Å or Ä.

Comment 1 Chip Coldwell 2006-08-03 19:36:52 UTC
Actually, it works fine if you C-x C-f the file from within a running emacs.  So
this must have something to do with expanding PWD.

Chip


Comment 2 Chip Coldwell 2006-08-14 19:40:57 UTC
Oh, joy.  It appears that we're up against the ol' ISO8859-1 (Latin-1) versus
UTF-8 encoding problem.

This does work:

LANG=en_US.ISO8859-1 emacs ä/foo

This doesn't:

LANG=en_US.UTF-8 emacs ä/foo

emacs-21 seems to be a lot happier with Latin-1 strings than with UTF-8 strings.




Comment 3 Chip Coldwell 2006-08-14 19:53:52 UTC
(In reply to comment #0)

> The same holds for directory names containing ö, Ã¥, ø, but not eg. Å, Å or Ä.

The difference here seems to be the way they are encoded in UTF-8:

ä: \303\244
Ã¥: \303\245
ö: \303\266
ø: \303\270

Ä: \304\205
Å: \305\202
Å: \305\204

I would guess that the important difference is the value of the first byte in
the multibyte encoding: \303 (0xC3) doesn't work, \304 (0xC4) and higher does.

Chip

Comment 4 Chip Coldwell 2006-10-26 18:13:08 UTC
Amusingly,

emacsclient --no-wait ä/foo

works just fine (if you have the server started), even when

emacs ä/foo

doesn't.

Chip


Comment 5 Chip Coldwell 2006-10-26 19:05:33 UTC
The problem is in src/charset.h, where we have this macro defined:

#define UNIBYTE_STR_AS_MULTIBYTE_P(str, length, bytes)	\
  (((str)[0] < 0x80 || (str)[0] >= 0xA0)		\
   ? (bytes) = 1					\
   : (((bytes) = BYTES_BY_CHAR_HEAD ((str)[0])),	\
      ((bytes) > 1 && (bytes) <= (length)		\
       && (str)[0] != LEADING_CODE_8_BIT_CONTROL	\
       && !CHAR_HEAD_P ((str)[1])			\
       && ((bytes) == 2					\
	   || (!CHAR_HEAD_P ((str)[2])			\
	       && ((bytes) == 3				\
		   || !CHAR_HEAD_P ((str)[3])))))))

Really, only the first three lines matter.  I think the problem is that UTF-8
encoded multibte sequences can have leading bytes with values > 0xA0, and this
macro will report them as being single byte sequences.

I don't think there is a way of writing such a macro that will work
simultaneoulsy on UTF-8 and ISO-8859-1 (Latin-1) strings.  I think we will have
to check the LANG environment variable.

Chip


Comment 6 Chip Coldwell 2006-11-01 21:49:30 UTC
I take that back.  It looks like this bug has been fixed in emacs 22, but the
approach is different from what I expected.  It's a combination of C and Lisp
code.  This is in src/emacs.c around line 600:

  Vcommand_line_args = Qnil;

  for (i = argc - 1; i >= 0; i--)
    {
      if (i == 0 || i > skip_args)
	/* For the moment, we keep arguments as is in unibyte strings.
	   They are decoded in the function command-line after we know
	   locale-coding-system.  */
	Vcommand_line_args
	  = Fcons (make_unibyte_string (argv[i], strlen (argv[i])),
		   Vcommand_line_args);
    }

N.B. the command line arguments are parsed as unibyte strings; emacs 21 has a
call to "build_string" here that will attempt to parse the command line
arguments as iso-8859-1 strings.

There's also an additional change in lisp/startup.el:

  ;; Convert the arguments to Emacs internal representation.
  (let ((args (cdr command-line-args)))
    (while args
      (setcar args
	      (decode-coding-string (car args) locale-coding-system t))
      (pop args)))

I'm going to try backporting these changes to emacs-21.

Comment 7 Chip Coldwell 2006-11-01 22:03:28 UTC
The backport seems to have succeeded in fixing the bug.  I have some test rpms up at

http://people.redhat.com/coldwell/bugs/emacs/196648/

Could you please download and test them to verify the fix?

Thanks.

Chip


Comment 8 Pawel Salek 2006-11-01 22:11:54 UTC
The test rpms work very well - thanks for tracking this down!

Comment 9 Chip Coldwell 2006-11-03 15:07:25 UTC
Created attachment 140252 [details]
enable multibyte strings in command line arguments

Comment 10 Chip Coldwell 2006-11-06 16:48:59 UTC
FC-5: 21.4-16.2
FC-6: 21.4-17.1
RAWHIDE: 21.4-20