Bug 2182144 - cpp: UTF-8 identifier printing problems, that do not occur with GCC cpp
Summary: cpp: UTF-8 identifier printing problems, that do not occur with GCC cpp
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: clang
Version: 36
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Tom Stellard
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-27 16:36 UTC by Jason Vas Dias
Modified: 2023-05-25 19:32 UTC (History)
12 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2023-05-25 19:32:16 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
C file to reproduce issue when pre-processed (191 bytes, text/plain)
2023-03-27 16:36 UTC, Jason Vas Dias
no flags Details
How the C source file is displayed by UTF-8 compliant software (85.82 KB, image/png)
2023-03-27 16:44 UTC, Jason Vas Dias
no flags Details
My Macro header necessary for demo program (7.50 KB, text/x-csrc)
2023-03-31 00:18 UTC, Jason Vas Dias
no flags Details
demo program (1.08 KB, text/x-csrc)
2023-03-31 00:19 UTC, Jason Vas Dias
no flags Details
improved demo program (1.33 KB, text/x-csrc)
2023-03-31 05:17 UTC, Jason Vas Dias
no flags Details
More recent M.h (7.58 KB, text/x-csrc)
2023-03-31 05:39 UTC, Jason Vas Dias
no flags Details
better demo program (1.36 KB, text/x-csrc)
2023-03-31 05:57 UTC, Jason Vas Dias
no flags Details
Current demo program (1.36 KB, text/plain)
2023-03-31 06:00 UTC, Jason Vas Dias
no flags Details
Current M.h (7.58 KB, text/plain)
2023-03-31 06:01 UTC, Jason Vas Dias
no flags Details
Final / Best M.h (9.18 KB, text/plain)
2023-03-31 19:21 UTC, Jason Vas Dias
no flags Details
Final / Best tUTF.c (570 bytes, text/plain)
2023-03-31 19:23 UTC, Jason Vas Dias
no flags Details
Actual version of tUTF.c compiled above (597 bytes, application/octet-stream)
2023-04-01 17:29 UTC, Jason Vas Dias
no flags Details
version of M.h compiled above (11.41 KB, application/octet-stream)
2023-04-01 17:32 UTC, Jason Vas Dias
no flags Details

Description Jason Vas Dias 2023-03-27 16:36:35 UTC
Created attachment 1953967 [details]
C file to reproduce issue when pre-processed

Description of problem:


[jvd@jvdspc]:/tmp [3295] 17:32:29 [#:861!:20407]{0}	
$ clang -E clang-utf8-issue.c 
# 1 "clang-utf8-issue.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 357 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "clang-utf8-issue.c" 2


typedef unsigned int U32_t;
# 13 "clang-utf8-issue.c"
U32_t
  c1 = ὸA
 ,c2 = ﹩
 ,c3 = ὝF
 ,c4 = Ὗ2
  ;
[jvd@jvdspc]:/tmp [3295] 17:33:41 [#:862!:20408]{0}	
$ gcc -E clang-utf8-issue.c 
# 0 "clang-utf8-issue.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "clang-utf8-issue.c"


typedef unsigned int U32_t;
# 13 "clang-utf8-issue.c"
U32_t
  c1 = \U00001f78A
 ,c2 = \U0000fe69
 ,c3 = \U00001f5dF
 ,c4 = \U00001f5f2
  ;


Why is clang not printing out the UTF-8 correctly, while GCC is ?

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jason Vas Dias 2023-03-27 16:41:00 UTC
Comment on attachment 1953967 [details]
C file to reproduce issue when pre-processed

Oops, it appears Bugzilla has UTF8 issues too, so that it doesn't display
my C file correctly. I please download 'clang-utf8-issue.c' and open in
an editor. I will append a screen-shot of how the file is meant to look,
as it does in Emacs.

Comment 2 Jason Vas Dias 2023-03-27 16:44:08 UTC
Created attachment 1953969 [details]
How the C source file is displayed by UTF-8 compliant software

This shows how the C source file SHOULD display and print.

Comment 3 Stephan Bergmann 2023-03-28 06:26:06 UTC
Clang behaves correctly here.  Note that `\u1F78A` is two characters, the universal character name `\u1F78` followed by a plain `A`.  You probably want `\U0001F78A` instead.

Comment 4 Jason Vas Dias 2023-03-28 21:50:47 UTC
Sorry, I don't agree this is 'NOT_A_BUG' :

In what way is clang's output, shown above : 

    # 13 "clang-utf8-issue.c"
     U32_t
      c1 = ὸA
     ,c2 = ﹩
     ,c3 = ὝF
     ,c4 = Ὗ2
     ;

"correct" wrt GCC's output :

    # 13 "clang-utf8-issue.c"
     U32_t
      c1 = \U00001f78A
     ,c2 = \U0000fe69
     ,c3 = \U00001f5dF
     ,c4 = \U00001f5f2
     ;

This is really the crux of this whole bug report, 
and what I was complaining  about / 
would really like to know why :
 clang's cpp is displaying these bogus strings like
 'ὸA' and 'Ὗ2' , whereas GCC's cpp is displaying
 correct UTF names. Why is this ?

How can users be confidant, seeing that clang CPP output,
that clang is passing the values of c[1-4] correctly
to the backend ?

I can reproduce this problem either in the LC_ALL=POSIX
locale or in my default en_GB.UTF-8 locale.

Comment 5 Jason Vas Dias 2023-03-28 21:59:54 UTC
And sorry, no you are wrong :
> Note that `\u1F78A` is two characters, the universal character name `\u1F78` followed by a plain `A`

Then why doesn't GCC have the same problem parsing \u1F78A as clang does ? 
Perhaps the problem is then that clang is not parsing that UTF-8 code correctly ?


No :
    128906	1F78A	(7 0)	🞊	'WHITE CIRCLE CONTAINING BLACK SMALL CIRCLE'

This is printed by the SBCL LISP :

    (ql:quickload "cl-unicode")
    (defvar tab (make-string 1 :element-type 'character :initial-element #\tab))
    (defun print-all-unicode-chars (&optional (stream *standard-output*))
     (loop :for i :below cl-unicode:+code-point-limit+
          :for name := (cl-unicode:unicode-name i)
          :when name :do
          (format stream "~10d~a~x~a~a~a~a~a'~a'~%"
           i                     tab
           i                     tab
           (cl-unicode:age i)    tab
	   (cl-unicode:character-named name) tab
           name
          )
     )
    )
    (print-all-unicode-chars)

So why doesn't clang's CPP parse the utf-8 \u1F78A correctly, the way GCC does ?

Comment 6 Jason Vas Dias 2023-03-28 22:15:03 UTC
From : https://en.wikipedia.org/wiki/Escape_sequences_in_C :
"
Universal character names

From the C99 standard, C has also supported escape sequences that denote Unicode code points in string literals. Such escape sequences are called universal character names, and have the form \uhhhh or \Uhhhhhhhh, where h stands for a hex digit. Unlike the other escape sequences considered, a universal character name may expand into more than one code unit.

The sequence \uhhhh denotes the code point hhhh, interpreted as a hexadecimal number. The sequence \Uhhhhhhhh denotes the code point hhhhhhhh, interpreted as a hexadecimal number. (Therefore, code points located at U+10000 or higher must be denoted with the \U syntax, whereas lower code points may use \u or \U.) The code point is converted into a sequence of code units in the encoding of the destination type on the target system. For example (where the encoding is UTF-8, and UTF-16 for wchar_t):

char s1[] = "\xC0"; // A single byte with the value 0xC0, not valid UTF-8
char s2[] = "\u00C0"; // Two bytes with values 0xC3, 0x80, the UTF-8 encoding of U+00C0
wchar_t s3[] = L"\xC0"; // A single wchar_t with the value 0x00C0
wchar_t s4[] = L"\u00C0"; // A single wchar_t with the value 0x00C0

A value greater than \U0000FFFF may be represented by a single wchar_t if the UTF-32 encoding is used, or two if UTF-16 is used.

Importantly, the universal character name \u00C0 always denotes the character "À", regardless of what kind of string literal it is used in, or the encoding in use. The octal and hex escape sequences always denote certain sequences of numerical values, regardless of encoding. Therefore, universal character names are complementary to octal and hex escape sequences; while octal and hex escape sequences represent code units, universal character names represent code points, which may be thought of as "logical" characters. 
"


So clang's behaviour is incorrect - \u1F78A should be the same character as \U1F78A , as gcc does agree .

Why differ from GCC in this respect if GCC does it this way ? 

Most annoying - now all my code that uses UTF-8 names must be changed to compile under clang .

Comment 7 Jason Vas Dias 2023-03-28 22:26:45 UTC
And if I do change the program to use '\Uxxxxx' instead of '\uxxxxx', then:

[jvd@jvdspc]:/tmp [3295] 23:22:47 [#:1059!:20605]{0}	
$ gcc -E clang-utf8-issue.c 
# 0 "clang-utf8-issue.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "clang-utf8-issue.c"


typedef unsigned int U32_t;
# 13 "clang-utf8-issue.c"
U32_t
  c1 = \U1F78A
 ,c2 = \U0000fe69
 ,c3 = \U1F5DF
 ,c4 = \U1F5F2
  ;
[jvd@jvdspc]:/tmp [3295] 23:22:59 [#:1060!:20606]{0}	
$ clang -E clang-utf8-issue.c 
# 1 "clang-utf8-issue.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 357 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "clang-utf8-issue.c" 2


typedef unsigned int U32_t;clang-utf8-issue.c:5:16: warning: incomplete universal character name; treating as '\' followed by identifier [-Wunicode]
#define <U+1F78A>() \U1F78A
                    ^
clang-utf8-issue.c:7:16: warning: incomplete universal character name; treating as '\' followed by identifier [-Wunicode]
#define <U+1F5DF>() \U1F5DF
                    ^
clang-utf8-issue.c:11:16: warning: incomplete universal character name; treating as '\' followed by identifier [-Wunicode]
#define <U+1F5F2>() \U1F5F2
                    ^

# 13 "clang-utf8-issue.c"
U32_t
  c1 = \U1F78A
 ,c2 = ﹩
 ,c3 = \U1F5DF
 ,c4 = \U1F5F2
  ;
3 warnings generated.


ie. program now reads:


typedef unsigned int U32_t;

#define 🞊() \U1F78A

#define 🗟() \U1F5DF

#define ﹩() \uFE69

#define 🗲() \U1F5F2

U32_t
  c1 = 🞊()
 ,c2 = ﹩()
 ,c3 = 🗟()
 ,c4 = 🗲()
  ;


and still clang is not interpreting the UTF-8 correctly, in the same way as GCC is.
Surely I should not be required to specify leading zeros in hex string ?

Comment 8 Jason Vas Dias 2023-03-28 22:30:19 UTC
I'd just like clang to behave the same way GCC does, here.

Why introduce incompatibilities over such a silly non-issue ?

Comment 9 Tom Stellard 2023-03-28 23:08:34 UTC
The part you quoted from the Wikipedia article says that the sequence must have either 4 or 8 hexidecimal digits.  In your example, you have 5 as @sbergman pointed out.  I looked it up in the official spec and it seems to confirm what was in Wikipedia that it must be either 4 or 8 digits.  So to me, it looks like clang is correct.  I get that it would be more convenient to not have to add the leading zeros (which appears to be what gcc allows you to do), but this is not what the spec says.

Comment 10 Jason Vas Dias 2023-03-29 04:16:13 UTC
Every other C escape handler ignores leading zeros in hex strings.

I think GCC is prioritizing the C & C++ Escape Handling standards here, 
while Clang is idiosyncratically for no good reason prioritizing the Unicode standard -
which precise Unicode standard is that ?

I think that particular aspect of that particular standard is very silly in this respect: 

  IFF you are specifying a \U form name, you are NEVER specifying MORE THAN ONE UTF-8 character.

OK, if you are using a \u form name, and it has more than 4 hex digits, then you are specifying
two UTF-8 digits. But the long \U form cannot specify more than one UTF-8 character !

Clang is just wrong here , and GCC is taking the sensible approach .

Please fix !

Comment 11 Jason Vas Dias 2023-03-29 04:39:32 UTC
If you won't, I'll have to add a new Clang flag:

   -fgcc-compatible-unicode-escapes

or something to the Clang compilers I use . And I suppose GCC now needs a:

   -fclang-compatible-unicode-escapes

option.

I also want a  

   -funicode-identifier-whitelist=$filename 

option for both GCC and clang, to allow UTF-8 chars in identifiers not formally allowed by the standards.

Investigating how hard it would be to develop such patches ...

Comment 12 corentin.jabot 2023-03-29 09:02:52 UTC
Stephan Bergmann is absolutely correct.


That `\u` calls for 4 digits and `\U` for 8 has nothing to do with trying to conform with Unicode (and it's certainly unrelated to utf-8) , and everything to do to the need for these things to be parsable, and to be conforming with the C and C++ grammar
if `\u1F78A` was somehow greedy, there would be no way to spell some sequences of codepoints, such as `ὸ` followed by `A` in your example, as any character in the range [A-F] succeeding a universal character name would be considered part of that universal character name escape sequence.
Which is incidentally how \x works. \x is greedy and will consume all your string literal if you let it. You can get around that by splitting your string "\xABC" "DEF" - which is not something you can do with identifiers,
so it's a good thing that universal character names requires exactly 4 or 8 hexadecimal digits.

The code under consideration here is not conforming, and the solution is to either
 * use  \U00001f78A`
 * use the new in C++23 delimited escape sequences \u{1f78A}


GCC makes the incorrect assumption that you want to call your identifier `🞊` (which as an aside is not a valid identifier), instead of `ὸA` which is not a valid assumption to make. and `ὸA` is a perfectly valid identifier name. maybe there is an `ὸB`, who knows)

Comment 13 Jakub Jelinek 2023-03-29 16:08:45 UTC
> GCC makes the incorrect assumption that you want to call your identifier `🞊` (which as an aside is not a valid identifier), instead of `ὸA` which is not a valid assumption to make. and `ὸA` is a perfectly valid identifier name. maybe there is an `ὸB`, who knows)

That is not the case.
The only difference is that gcc -E chooses to print those universal character names as (canonicalized to \U) UCNs, while clang prints them as the unicode characters.
In that
U32_t
  c1 = \U00001f78A
 ,c2 = \U00001f5dF
 ,c3 = \U0000fe69
 ,c4 = \U00001f5f2
  ;
you can see say \u1F78 has been printed as \U00001f78.
If you preprocess with gcc and preprocess with clang and then compile each preprocessed file with gcc and clang, you'll see both are rejected the same:
typedef unsigned int U32_t;
#define 🞊() \u1F78A
#define 🗟() \u1F5DF
#define ﹩() \uFE69
#define 🗲() \u1F5F2

U32_t
  c1 = 🞊()
 ,c2 = 🗟()
 ,c3 = ﹩()
 ,c4 = 🗲()
  ;
gcc -E -o /tmp/1g.i /tmp/1.c
clang -E -o /tmp/1c.i /tmp/1.c
gcc -S -o /tmp/1g.s /tmp/1g.i
/tmp/1.c:8:8: error: ‘ὸA’ undeclared here (not in a function)
    8 |   c1 = 🞊()
      |        ^~~     
/tmp/1.c:9:8: error: ‘ὝF’ undeclared here (not in a function)
    9 |  ,c2 = 🗟()
      |        ^~~     
/tmp/1.c:10:8: error: ‘﹩’ undeclared here (not in a function)
   10 |  ,c3 = ﹩()
      |        ^~~~     
/tmp/1.c:11:8: error: ‘Ὗ2’ undeclared here (not in a function)
   11 |  ,c4 = 🗲()
      |        ^~~     
gcc -S -o /tmp/1c.s /tmp/1c.i
/tmp/1.c:8:8: error: ‘ὸA’ undeclared here (not in a function)
    8 |   c1 = 🞊()
      |        ^
/tmp/1.c:9:8: error: ‘ὝF’ undeclared here (not in a function)
    9 |  ,c2 = 🗟()
      |        ^
/tmp/1.c:10:8: error: ‘﹩’ undeclared here (not in a function)
   10 |  ,c3 = ﹩()
      |        ^~
/tmp/1.c:11:8: error: ‘Ὗ2’ undeclared here (not in a function)
   11 |  ,c4 = 🗲()
      |        ^
clang -S -o /tmp/1g.s /tmp/1g.i
/tmp/1.c:8:8: error: use of undeclared identifier 'ὸA'
  c1 = \U00001f78A
       ^
/tmp/1.c:9:8: error: use of undeclared identifier 'ὝF'
 ,c2 = \U00001f5dF
       ^
/tmp/1.c:10:8: error: use of undeclared identifier '﹩'
 ,c3 = \U0000fe69
       ^
/tmp/1.c:11:8: error: use of undeclared identifier 'Ὗ2'
 ,c4 = \U00001f5f2
       ^
4 errors generated.
clang -S -o /tmp/1c.s /tmp/1c.i
/tmp/1.c:8:8: error: use of undeclared identifier 'ὸA'
  c1 = ὸA
       ^
/tmp/1.c:9:8: error: use of undeclared identifier 'ὝF'
 ,c2 = ὝF
       ^
/tmp/1.c:10:8: error: use of undeclared identifier '﹩'
 ,c3 = ﹩
       ^
/tmp/1.c:11:8: error: use of undeclared identifier 'Ὗ2'
 ,c4 = Ὗ2
       ^
4 errors generated.

Comment 14 Jason Vas Dias 2023-03-29 16:41:39 UTC
Thanks for the comments !
I was really initially just wondering why GCC's & Clang's CPP output differed so much -
I was not trying to produce an executable file, I guess I should have tried - if I had,
I would have written :

"
#define _🙶(_X_)  #_X_
#define  🙶(_X_)  _🙶(_X_)

typedef unsigned int U32_t;

#define utf8_char(_U_) ((🙶(_U_))[0])

#define 🞊() utf8_char(\U0001F78A)

#define 🗟() utf8_char(\U0001F5DF)

#define ﹩() utf8_char(\uFE69)

#define 🗲() utf8_char(\U0001F5F2)

static const U32_t
  c1 = 🞊()
 ,c2 = ﹩()
 ,c3 = 🗟()
 ,c4 = 🗲()
  ;
"

which does actually compile with NO warnings or errors to produce an object file on both GCC and Clang.

And all those characters are legal identifiers, else neither compiler would compile it.

But if I remove the leading zeros from the \U names, they cannot compile it at all .

Thanks for the UTF-8 & C clarification discussion! Best Regards.

Comment 15 corentin.jabot 2023-03-29 16:46:42 UTC
(In reply to Jakub Jelinek from comment #13)
> > GCC makes the incorrect assumption that you want to call your identifier `🞊` (which as an aside is not a valid identifier), instead of `ὸA` which is not a valid assumption to make. and `ὸA` is a perfectly valid identifier name. maybe there is an `ὸB`, who knows)
> 
> That is not the case.
> The only difference is that gcc -E chooses to print those universal
> character names as (canonicalized to \U) UCNs, while clang prints them as
> the unicode characters.

You are right, i managed to confuse myself, sorry!

Comment 16 Tom Honermann 2023-03-29 18:46:35 UTC
I'm chiming in to note agreement with Stephan, Jakub, and Corentin; both compilers are behaving properly as far as I can tell. Since the C and C++ standards do not specify anything about preprocessed output, there are no conformance concerns regarding their differing behavior. Both production of UTF-8 (Clang's behavior) and production of ASCII with UCNs (gcc's behavior) are useful. It would make sense to allow an encoding for preprocessed output to be specified via a command line option; I'm not aware if any current compilers support such an option. I wouldn't be surprised if some compilers transcode to the current locale encoding when generating preprocessed output but I don't know of any examples.

Jason, the syntactical requirements for universal character names (UCNs) is specified for C++ in [lex.charset] (http://eel.is/c++draft/lex.charset#nt:universal-character-name); the grammar clearly specifies that \u must be followed by exactly one hex-quad and that \U must be followed by exactly two hex-quads.

As Corentin noted, C++23 added support for an additional delimited UCN form in which the hex sequence can be of any length and is delimited by '{' and '}'. The corresponding proposal is P2290 (https://wg21.link/p2290). There is also a proposal for C, but it has not yet been accepted; it is N2785 (https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2785.pdf).

The C and C++ standard committees changed the set of characters that are allowed in identifiers during the C23 and C++23 development cycles. The relevant proposals are:
- C++: P1949 (https://wg21.link/p1949)
- C: N2836 (https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2836.pdf)

Corentin already noted that some of the characters in the test case are not valid (following adoption of the above papers) in identifiers (because these characters have neither the `XID_Start` or `XID_Continue` Unicode property. These include:
- 🞊 U+1F78A (WHITE CIRCLE CONTAINING BLACK SMALL CIRCLE)
- 🗟 U+1F5DF (PAGE WITH CIRCLED TEXT)
- ﹩ U+FE69 (SMALL DOLLAR SIGN)
- 🗲 U+1F5F2 (LIGHTNING MOOD)

The C++ specification for what characters are allowed in identifiers is specified by [lex.name] (http://eel.is/c++draft/lex.name).

Comment 17 Jason Vas Dias 2023-03-31 00:16:33 UTC
Many thanks for all the informative responses!

Just in case anyone else is troubled by these issues,
here is what I consider to be the optimum solution --
use of my "M.h" macros file (attached),
 and new UTF8 macros like those demonstrated by :

<code>

#include <stdio.h>

#include "M.h"

#define DEFAULT_𝕌8N() 0

#define 𝕌8N﹩(_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)\
﹩(\U,﹩(🗌﹖(𝕌8N,_n7_),﹩(🗌﹖(𝕌8N,_n6_),﹩(🗌﹖(𝕌8N,_n5_),﹩(🗌﹖(𝕌8N,_n4_),﹩(🗌﹖(𝕌8N,_n3_),﹩(🗌﹖(𝕌8N,_n2_),﹩(🗌﹖(𝕌8N,_n1_),🗌﹖(𝕌8N,_n0_)))))))))

#define 𝕌8N(_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)\
﹩(0x,﹩(🗌﹖(𝕌8N,_n7_),﹩(🗌﹖(𝕌8N,_n6_),﹩(🗌﹖(𝕌8N,_n5_),﹩(🗌﹖(𝕌8N,_n4_),﹩(🗌﹖(𝕌8N,_n3_),﹩(🗌﹖(𝕌8N,_n2_),﹩(🗌﹖(𝕌8N,_n1_),🗌﹖(𝕌8N,_n0_)))))))))

typedef unsigned char byte;
typedef unsigned int  U32_t;

const byte
  s[]=🙶(𝕌8N﹩(,,,,F,F,,4));

const U32_t
  c = 𝕌8N(,,,,F,F,,4);

int  main(int argc, char *const* argv)
{
U32_t
  c2=  (((U32_t)(s[7]))<<28)|(((U32_t)(s[6]))<<24)|(((U32_t)(s[5]))<<20)|(((U32_t)(s[4]))<<16)|
       (((U32_t)(s[3]))<<12)|(((U32_t)(s[2]))<<8) |(((U32_t)(s[1]))<<4)|((U32_t)(s[0]))
;

 printf("UTF-8 char %s has unicode name 0x%x and is represented by 0x%x.\n",
         s, c, c2
        );
  return 1;
}

</code>

Which prints:

$ gcc -I. -Wall -o tUTF tUTF.c 
[jvd@jvdspc]:~/J [3295] 01:15:52 [#:1632!:21178]{0}	
$ ./tUTF 
UTF-8 char $ has unicode name 0xff04 and is represented by 0xff48fef.

Comment 18 Jason Vas Dias 2023-03-31 00:18:15 UTC
Created attachment 1954814 [details]
My Macro header necessary for demo program

Comment 19 Jason Vas Dias 2023-03-31 00:19:08 UTC
Created attachment 1954815 [details]
demo program

demo program

Comment 20 Jason Vas Dias 2023-03-31 00:46:03 UTC
Here is clang building same file:

$ clang -I. -Wall -o tUTF_CL tUTF.c 
In file included from tUTF.c:3:
./M.h:134:10: warning: treating Unicode character <U+FF06> as identifier character rather than as '&' symbol [-Wunicode-homoglyph]
#define _&11 1
         ^~
./M.h:135:10: warning: treating Unicode character <U+FF06> as identifier character rather than as '&' symbol [-Wunicode-homoglyph]
#define _&10 0
         ^~
./M.h:136:10: warning: treating Unicode character <U+FF06> as identifier character rather than as '&' symbol [-Wunicode-homoglyph]
#define _&01 0
         ^~
./M.h:137:10: warning: treating Unicode character <U+FF06> as identifier character rather than as '&' symbol [-Wunicode-homoglyph]
#define _&00 0
         ^~
./M.h:139:10: warning: treating Unicode character <U+FF5C> as identifier character rather than as '|' symbol [-Wunicode-homoglyph]
#define _|11 1
         ^~
./M.h:140:10: warning: treating Unicode character <U+FF5C> as identifier character rather than as '|' symbol [-Wunicode-homoglyph]
#define _|10 1
         ^~
./M.h:141:10: warning: treating Unicode character <U+FF5C> as identifier character rather than as '|' symbol [-Wunicode-homoglyph]
#define _|01 1
         ^~
./M.h:142:10: warning: treating Unicode character <U+FF5C> as identifier character rather than as '|' symbol [-Wunicode-homoglyph]
#define _|00 0
         ^~
./M.h:144:10: warning: treating Unicode character <U+FF06> as identifier character rather than as '&' symbol [-Wunicode-homoglyph]
#define _&_(_A_,_B_)/**/﹩(_&,﹩(𝔹(_A_),𝔹(_B_)))/**/
         ^~
./M.h:144:32: warning: treating Unicode character <U+FF06> as identifier character rather than as '&' symbol [-Wunicode-homoglyph]
#define _&_(_A_,_B_)/**/﹩(_&,﹩(𝔹(_A_),𝔹(_B_)))/**/
                             ^~
./M.h:146:10: warning: treating Unicode character <U+FF5C> as identifier character rather than as '|' symbol [-Wunicode-homoglyph]
#define _|_(_A_,_B_)/**/﹩(_|,﹩(𝔹(_A_),𝔹(_B_)))/**/
         ^~
./M.h:146:32: warning: treating Unicode character <U+FF5C> as identifier character rather than as '|' symbol [-Wunicode-homoglyph]
#define _|_(_A_,_B_)/**/﹩(_|,﹩(𝔹(_A_),𝔹(_B_)))/**/
                             ^~
tUTF.c:8:6: warning: \U used with no following hex digits; treating as '\' followed by identifier [-Wunicode]
﹩(\U,﹩(<U+1F5CC>﹖(𝕌8N,_n7_),﹩(<U+1F5CC>﹖(𝕌8N,_n6_),﹩(<U+1F5CC>﹖(𝕌8N,_n5_),﹩(<U+1F5CC>﹖(𝕌8N,_n4_),﹩(<U+1F5CC>﹖(𝕌8N,_n3_),﹩(<U+1F5CC>﹖(𝕌8N,_n2_),﹩(<U+1F5CC>﹖(𝕌8N,_n1_),<U+1F5CC>﹖(𝕌8N,_n0_)))))))))
    ^
13 warnings generated.



I'd like to find a way of making Clang NOT generate these spurious warnings, and to interpret the UTF-8 correctly, which it is not doing.

Comment 21 Jason Vas Dias 2023-03-31 05:17:22 UTC
Created attachment 1954825 [details]
improved demo program

Now this version of the demo program is slightly more elegant:

<code>
#include <stdio.h>

#include "M.h"

#define DEFAULT_𝕌8N()/**/0/**/

#define _𝕌N(_c_,_n_)/**/🗌﹖(𝕌8N,_n_)_c_()/**/

#define _𝕌﹩(_c_,_n_)/**/_ǝ(_n_)_c_()/**/

#define _𝕌8N(_c_,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)\
   _Ɐ_(_𝕌﹩,_c_,_Ɐ_(_𝕌N,_c_,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_))

#define 𝕌8N﹩(_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)\
   ﹩(\U,_𝕌8N(_🗌_,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_))

#define 𝕌𝓃𝒽(_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_,_u_)\
  ((((U32_t)(_n7_))&0xF)<<28)|((((U32_t)(_n6_))&0xF)<<24)|((((U32_t)(_n5_))&0xF)<<20)|((((U32_t)(_n4_))&0xF)<<16)|\
  ((((U32_t)(_n3_))&0xF)<<12)|((((U32_t)(_n2_))&0xF)<<8)|((((U32_t)(_n1_))&0xF)<<4)|(((U32_t)(_n0_))&0xF)

#define _𝕌𝓃𝒽_() 𝕌𝓃𝒽
#define _𝓍(_c_,_X_)﹩(0x,_X_)_c_()

#define 𝕌8N(_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)\
   _⦅⦆🞲_(_𝕌𝓃𝒽_,_Ɐ_(_𝓍,_,_,_𝕌8N(_,_,_n7_,_n6_,_n5_,_n4_,_n3_,_n2_,_n1_,_n0_)))

typedef unsigned char byte;
typedef unsigned int  U32_t;

const byte
  s[]=🙶(𝕌8N﹩(,,,,F,F,,4));

const U32_t
  c = 𝕌8N(,,,,F,F,,4);

int  main(int argc, char *const* argv)
{
U32_t
  c2= 𝕌𝓃𝒽(s[7],s[6],s[5],s[4],s[3],s[2],s[1],s[0],)
;

  printf("UTF-8 char %s has unicode name 0x%x and is represented by 0x%x.\n",
         s, c, c2
        );
  return 0;
}

</code>

I could even collapse the 'c2=' statement into a '_Ɐ_()' invocation.


But what really bugs me, is that while GCC compiles the above without 
complaint:

$ gcc -I. -Wall -o tUTF tUTF.c 
[jvd@jvdspc]:~/J [3295] 06:00:06 [#:1708!:21254]{0}	
$ ./tUTF
UTF-8 char $ has unicode name 0xff04 and is represented by 0xf404cf.
[jvd@jvdspc]:~/J [3295] 06:00:12 [#:1709!:21255]{0}	

Clang emits this spurious warning, even when I disable its
'unicode-homoglyph' warnings :

$ clang -I. -Wall -Wno-unicode-homoglyph -o tUTF_CL tUTF.c 
tUTF.c:15:9: warning: \U used with no following hex digits; treating as '\' followed by identifier [-Wunicode]
   ﹩(\U,_𝕌8N(_<U+1F5CC>_,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_))
       ^
1 warning generated.	
$ ./tUTF_CL
UTF-8 char $ has unicode name 0xff04 and is represented by 0xf404cf.

The input parser should know not to evaluate macro arguments, but
to pass them through to CPP, for it to evalute. 

Why does clang complain about this when GCC does not ?

Comment 22 Jason Vas Dias 2023-03-31 05:39:03 UTC
Created attachment 1954826 [details]
More recent M.h

Added _,_().

Comment 23 Jason Vas Dias 2023-03-31 05:57:55 UTC
Created attachment 1954827 [details]
better demo program

Oops, c2 should have been:
 𝕌𝓃𝒽(s[0]>>4,s[0]&0xF,s[1]>>4,s[1]&0xF,s[2]>>4,s[2]&0xF,s[3]>>4,s[3]&0xF,)
It now prints the encoded UTF-8 hex correctly:

$ ./tUTF
UTF-8 char $ has unicode name 0xff04 and is represented by 0xefbc8400.

Comment 24 Jason Vas Dias 2023-03-31 06:00:32 UTC
Created attachment 1954828 [details]
Current demo program

Comment 25 Jason Vas Dias 2023-03-31 06:01:11 UTC
Created attachment 1954829 [details]
Current M.h

Comment 26 Jason Vas Dias 2023-03-31 15:40:33 UTC
RE: Comment #16: Thanks so much @Tom Honermann : those specs are exactly what I was looking for !

My only concern here is not so much standards compliance but more compatibility between default
/ vanilla / "no options" behaviour of GCC with Clang.

All my code has to work on both Linux and Android, I need to find a set of options that make
Clang & GCC behave as identically as possible under our use-cases.

Through this I learned I need the Clang '-Wno-unicode-homoglyphs' option to achieve the above.

In general, I'd suggest making Clang disable some of the more pedantic UTF-8 standards compliance
warnings unless some extra '-Wunicode-standards' option is given - '-Wunicode' should be reserved for
complaining about really broken / unparsable unicode, IMHO - just my 2cents worth.

Comment 27 Jason Vas Dias 2023-03-31 16:00:09 UTC
And this spurious Clang warning:
  $ clang -I. -Wall -Wno-unicode-homoglyph -o tUTF_CL tUTF.c 
    tUTF.c:15:9: warning: \U used with no following hex digits; treating as '\' followed by identifier [-Wunicode]
    ﹩(\U,_𝕌8N(_<U+1F5CC>_,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_))


is worrying because it suggests Clang is not ignoring input text within macro expansions as it should be.

Why does Clang's cpp complain about stringifying (\U) while GCC's does not ? 

Note there is no way in standards compliant CPP to paste a single escape character '\' / '??/' with anything
without a space being inserted .

And as for not using '\U' without leading zeros, clang itself is doing so in its output, like '<U+1F5CC>',
Clang should be consistant in using in its output formats the format it would expect to see in its input.

Comment 28 Jason Vas Dias 2023-03-31 19:21:29 UTC
Created attachment 1954952 [details]
Final / Best M.h

Incorporated UTF-8 macros into M.h - final version

Comment 29 Jason Vas Dias 2023-03-31 19:23:32 UTC
Created attachment 1954954 [details]
Final / Best tUTF.c

Much Better ! Now:

<code>

#include <stdio.h>

#include "M.h"

typedef unsigned char byte;
typedef unsigned int  U32_t;

const byte
  s[4]=🙶(𝕌8N﹩(,,,,F,F,,4));

const U32_t
  c = 𝕌8N(,,,,F,F,,4);

int  main(int argc, char *const* argv)
{
U32_t
  c2=𝕌𝕍﹩𝒷(&s);
;

  printf("UTF-8 char %s has unicode name 0x%x and is represented by 0x%x ( %X ).\n",
         s, c,
         ((((U32_t)(*((B4_t)(&s)))[0])<<24)|
          (((U32_t)(*((B4_t)(&s)))[1])<<16)|
          (((U32_t)(*((B4_t)(&s)))[2])<<8)|
           ((U32_t)(*((B4_t)(&s)))[3])
         ), c2
        );
  return 0;
}

</code>

Comment 30 Tom Honermann 2023-03-31 21:13:23 UTC
Jason, it is not clear to me what you are trying to accomplish or what problem(s) you are trying to solve. This isn't the right forum for discussing preprocessor tricks; I suggest posting to stackoverflow.com or another similar site.

> UTF-8 char $ has unicode name 0xff04 and is represented by 0xff48fef.

That is not correct. '$' (U+FF04, FULLWIDTH DOLLAR SIGN) is encoded in UTF-8 as the three hex bytes EF BC 84.

>   printf("UTF-8 char %s has unicode name 0x%x and is represented by 0x%x ( %X ).\n",
>          s, c,
>          ((((U32_t)(*((B4_t)(&s)))[0])<<24)|
>           (((U32_t)(*((B4_t)(&s)))[1])<<16)|
>           (((U32_t)(*((B4_t)(&s)))[2])<<8)|
>            ((U32_t)(*((B4_t)(&s)))[3])
>          ), c2
>         );

If this is intended to show the UTF-8 encoding of a select code point, well, it doesn't; that isn't how UTF-8 works. Wikipedia has a decent introductory explanation; https://en.wikipedia.org/wiki/UTF-8.

> $ clang -I. -Wall -Wno-unicode-homoglyph -o tUTF_CL tUTF.c 
> tUTF.c:15:9: warning: \U used with no following hex digits; treating as '\' followed by identifier [-Wunicode]
>    ﹩(\U,_𝕌8N(_<U+1F5CC>_,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_))
>        ^

That warning is issued because the code is not valid C++. Per [cpp.concat]p3 (http://eel.is/c++draft/cpp.concat#3), construction of a universal-character-name via token concatenation has undefined behavior. Clang is being nice by telling you how it is interpreting the code and by warning you that the code is not portable.

Comment 31 Jason Vas Dias 2023-04-01 17:20:05 UTC
RE: Comment #30: Sorry, Tom, you picked up an obsoleted version of the file which did have that error.

The current version (Attachment #1954954 [details] : tUTF8.c, Attachment #1954952 [details]: M.h) prints:
$ gcc -I. -std=c17 -Wall -o tUTF tUTF.c
[jvd@jvdspc]:~/J [3295] 18:05:20 [#:1921!:21467]{0}	
$ ./tUTF
UTF-8 char $ has unicode name 0xff04 and is represented by 0xefbc8400 ( EFBC84 ).


My point is that it is rather difficult to elegantly workaround the ambiguities 
self-contradictions and inadequacies caused by slavish adherance to UTF-8 
standards at the expense of C/C++ standards.

Ask yourself what happens when the compiler encounters:

  unsigned long x = 0x1234123412341234123412341234123412341234123412341234123412341234;
or
  char c='\x1234123412341234123412341234123412341234123412341234123412341234';

and why the behaviour is needlessly and confusingly so different for UTF-8.

I wrote an inline escape and UTF-8 parser / deparser in < 150 lines of code ,
it was so simple because it insisted only ONE character can be represented by ONE escape.

And as I needed to solve this problem for a work related project, and package my CPP tricks
header somehow for work, I found these issues and a way of fixing them that works, 
"killing two birds with one stone".

I do think it is a bug that Clang says:

$ clang -I. -Wno-unicode-homoglyph -std=c17 -Wall -o tUTF tUTF.c
In file included from tUTF.c:3:
./M.h:228:12: warning: \U used with no following hex digits; treating as '\' followed by identifier [-Wunicode]
*/_𝕌8N(\U,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)/**/
        ^
1 warning generated.

when I am NOT compiling in C++ mode - I will get one of these errors for EVERY use of my macro in Clang.
If you could please suggest a macro that avoids this warning, and still works, I will accept this is
not a bug.
 
I am not using C++ , I explicitly tell the compilers I want C17 standard compliance.

Comment 32 Jason Vas Dias 2023-04-01 17:29:09 UTC
Created attachment 1955103 [details]
Actual version of tUTF.c compiled above

$ sha256sum tUTF.c
8ca539859d32f58c5371158a5946f4e8074a2e1f9606099790b73b5f37764684  tUTF.c

Comment 33 Jason Vas Dias 2023-04-01 17:32:28 UTC
Created attachment 1955104 [details]
version of M.h compiled above

$ sha256sum M.h
57f074cf166f20d6782d0ac4784d9c43a909750683ebe84c4f18da319b2ce3e6  M.h

Both above attachments are of type 'application/octet-stream', unlike
other attachments - maybe BZ was somehow messing up the UTF-8 ? :-)

Comment 34 Jason Vas Dias 2023-04-01 17:52:45 UTC
Another point is that clang is in no position to know how '\U' is going
to be used when it issues that particular warning - it should NOT
be issuing it during parsing of the header, for sure.

I'll guess I'll have to start debugging Clang's CPP and develop a patch
for it.

Comment 35 Jason Vas Dias 2023-04-01 17:54:37 UTC
Clang's CPP does not appear to understand that text in a macro definition
is not a "Usage" - a basic, fundamental, non-standards compliant error.

Comment 36 Jason Vas Dias 2023-04-01 18:39:08 UTC
RE: Comment #30:

You don't need reference to any of above code to see that what you said :
> If this is intended to show the UTF-8 encoding of a select code point, well, it doesn't;
is incorrect:


jvd@jvdspc]:~/J [3295] 19:26:35 [#:1950!:21496]{0}	
$ cexp -d 'typedef unsigned int U32' -d 'const unsigned char *dollar="\U0000FF04"' -p '%hhx %hhx %hhx %x\n' 'dollar[0], dollar[1], dollar[2], ( ((((U32)(*dollar))&0xFFU)<<16) | ((((U32)(*(dollar+1)))&0xFFU)<<8) | ( ((U32)(*(dollar+2)))&0xFFU) )'
ef bc 84 efbc84

removed '/tmp/rHLSz7.c'
removed '/tmp/rHLSz7'
[jvd@jvdspc]:~/J [3295] 19:26:42 [#:1951!:21497]{0}	
$ cexp -k tUTF_decode -d 'typedef unsigned int U32' -d 'const unsigned char *dollar="\U0000FF04"' -p '%hhx %hhx %hhx %x\n' 'dollar[0], dollar[1], dollar[2], ( ((((U32)(*dollar))&0xFFU)<<16) | ((((U32)(*(dollar+1)))&0xFFU)<<8) | ( ((U32)(*(dollar+2)))&0xFFU) )'
ef bc 84 efbc84

[jvd@jvdspc]:~/J [3295] 19:26:53 [#:1952!:21498]{0}	
$ cat tUTF_decode.c
#include <sys/types.h>
#include <stdint.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, const char *const* argv, const char *const* envp)
{
	typedef unsigned int U32 ;
	const unsigned char *dollar="\U0000FF04" ;
	printf("%hhx %hhx %hhx %x\n", 	dollar[0], dollar[1], dollar[2], ( ((((U32)(*dollar))&0xFFU)<<16) | ((((U32)(*(dollar+1)))&0xFFU)<<8) | ( ((U32)(*(dollar+2)))&0xFFU) ));
}

Now, if I can write :
  "\U0000FF04" , by all the C and C++ standards I've read, I can also write:
  "\U" "0000FF04" .

Clang should not complain about usage of ("\U"...) in a CPP macro parameter, is all I am saying.


Indeed, both C and Clang compile the above fine.

Comment 37 Jason Vas Dias 2023-04-01 19:23:36 UTC
Here's the thing: both GCC and Clang backends rightly barf with the input:

  "\U" "0000FF04"

But, my M.h code does NOT actually generate this input to the backend -
by the time the code reaches the backend , CPP has already coalesced
the strings into

  "\U0000FF04"

which is a 100% valid UTF-8 string.

This is because the use of the 🙶() macro ensures that ("\U" "0000FF04",...)
becomes ("\U0000FF04",...) so that :

const byte
  s[4]=🙶(𝕌8N﹩(,,,,F,F,,4));

emits

const byte
  s[4]="\U0000FF04";

to the compiler backend, in both cases with GCC and Clang - I can prove this 
with strace logs if desired.

That is why it is particularly galling that I still cannot use my macro
package at work, because it would produce too many warnings under Clang,
when the problem is Clang's spurious warning generation.

So why is Clang's pre-processor generating a spurious warning for the occurrence
of ("\U",...) in a macro definition , when actual uses of the macro are all within
other macro calls, so that string coalescing does occur and a correct UTF-8 string 
is formed?

It is because Clang's pre-processor is fundamentally flawed & broken in this
respect only: it either is incorrectly applying the valid UTF-8 input checks
to macro parameters that have not been used , or it is applying them at the
wrong stage of macro expansion, it should ONLY be applying them to the final
result of macro expansion, not to any intermediate stage.

Please explain to me how this is not a bug, I don't get it.

Comment 38 Tom Honermann 2023-04-03 05:35:24 UTC
Jason, I'm not going to engage in any further discussion regarding your M.h header or in any examples that use it; if you believe that the compilers are not behaving correctly, please provide a minimal test case. I find the contents of the M.h header inscrutable.

(In reply to Jason Vas Dias from comment #31)
> I do think it is a bug that Clang says:
> 
> $ clang -I. -Wno-unicode-homoglyph -std=c17 -Wall -o tUTF tUTF.c
> In file included from tUTF.c:3:
> ./M.h:228:12: warning: \U used with no following hex digits; treating as '\'
> followed by identifier [-Wunicode]
> */_𝕌8N(\U,_n7_,_n6_,_n5_,_n4_, _n3_,_n2_,_n1_,_n0_)/**/
>         ^
> 1 warning generated.

All major compilers issue similar warnings or errors for use of a stray '\U' sequence or for construction of a UCN via token concatenation in their C17 modes. See https://godbolt.org/z/bbKTeTxeK.

> I'll guess I'll have to start debugging Clang's CPP and develop a patch for it.

For a patch to be accepted, you will have to demonstrate that the compiler is failing to conform to the C standard.

> Please explain to me how this is not a bug, I don't get it.

UCNs are processed during translation phase 4. In C17, the wording states in 5.1.1.2 (Translation phases):

"Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is produced by token concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted."

Please take note of the second sentence.

Both the C and C++ standards currently state that production of a UCN via token concatenation results in undefined behavior. As I stated earlier, Clang is kindly advising you that your code is not portable because it relies on undefined behavior.

There is a proposal for C++ to change the specification so that production of a UCN via token concatenation will become well-formed. See https://wg21.link/p2621. That proposal currently targets C++26. I am not aware of a corresponding proposal for C, but if the C++ proposal is adopted, I expect a similar one to materialize for C. As of right now, the proposal would only make such code well-formed in C++26, not for any earlier standards modes.

Comment 39 Jason Vas Dias 2023-04-03 12:28:07 UTC
RE: Comment #38:  Thanks so much Tom for your most informative responses.

Yes, "CPP Tips and Tricks" : https://github.com/pfultz2/Cloak/wiki/C-Preprocessor-tricks,-tips,-and-idioms ,
which lines 175-138 of my M.h faithfully paraphrases, takes time to grasp. A remarkable piece of work!

But this bug has nothing to do with such advanced CPP usage, EXCEPT in the respect 
THAT IT IS POSSIBLE TO FORM A WELL-FORMED UTF-8 STRING ESCAPE SEQUENCE USING CPP MACROS ALONE .
                                                ^^^^^^

Note I said 'STRING', not 'CHARACTER' .

My M.h code does demonstrate that it is possible to form well-formed standards conformant
UTF-8 double-quoted STRINGS with macros, at no point does my code above form UTF-8
CHARACTER CONSTANTS with macros.

Yes, all C / C++ standards DO mandate 
  "If a character sequence that matches the syntax of a universal character name is produced 
   by token concatenation (6.10.3.3), the behavior is undefined."
.

But none of my code above is forming CHARACTER escape sequences, it is forming STRING escape
sequences. There is a big difference, which is why BOTH Clang and GCC DO allow the STRING
tokens it produces through.

They DO NOT work for SYMBOL NAMES :

 const byte
  s[4]=🙶(𝕌8N﹩(,,,,F,F,,4));

 becomes

 const byte
  s[4]="\U0000FF04";


Whereas, IFF that part of the standard quoted above were NOT in effect, I WOULD be able to do:
A.
  const byte
   𝕌8N﹩(,,,,F,F,,4) [4] = 🙶(𝕌8N﹩(,,,,F,F,,4));

Which, because the line in the standards above is ENFORCED by both GCC and Clang, becomes
B.
  const byte
   \U 0000FF04 [4] = "\U0000FF04";

You can see this with output from 'gcc -E ...' / 'clang -E ...' .

So I am ONLY complaining about the fact that Clang issues a warning in the (A.) case above,
when GCC correctly DOES NOT, with maximum warning level options, because my (\U,...)
occurs only within a Macro Definition , NOT in a Usage.

Text within Macro Parameters, Macro Definitions, and Macro Expansions must be 
completely ignored wrt to correctness until the FINAL RESULT of the expansion
is known; then UTF-8 correctness / undefined behavior checks can be made only
on the FINAL RESULT of the expansion, not on any macro definition or
intermediate phase.

THERE IS NO UNDEFINED BEHAVIOUR involved with any statements generated by M.h or tUTF.c
(latest versions attached above).

Clang's CPP is getting the results right, but it should not be issuing a warning about
them . This is my only point here.

Comment 40 Tom Honermann 2023-04-03 17:06:21 UTC
> THAT IT IS POSSIBLE TO FORM A WELL-FORMED UTF-8 STRING ESCAPE SEQUENCE USING CPP MACROS ALONE .
>                                                 ^^^^^^

That is incorrect.

The C and C++ standards do not specify a "UTF-8 string escape sequence"; they specify universal-character-names (UCNs) of the form `\uXXXX` and `\UXXXXXXXX`. UCNs are recognized in three contexts, 1) in string literals, 2) in character constants/literals, and 3) in pp-tokens/identifiers. The standards do specify some different requirements for those contexts; for example, a UCN in a pp-token/identifier is not allowed to specify a code point corresponding to a member of the basic character set while a UCN in the other contexts may.

The wording that I provided applies to the preprocessor. Please read it again. Construction of a UCN via token concatenation is (currently) undefined behavior, even as an intermediate result of macro expansion. Both standards are quite clear on this point.

Again, I will not spend time looking at examples that are incomplete or that depend on your M.h header. If you want to discuss a specific example, provide a complete, minimal, standalone example that uses simple identifiers.

Consider this minimal test case and the behavior compilers exhibit at https://godbolt.org/z/MPb6nYPGr.

#define DOSTRINGIFY(X) #X
#define STRINGIFY(X) DOSTRINGIFY(X)
#define CONCAT(A,B) A##B
STRINGIFY(CONCAT(\u,1234))

Clang issues a warning, icc issues an error, gcc and MSVC all accept the code with no diagnostic. These are all conforming behaviors. The code has undefined-behavior, so implementors can do what ever they like with it. I prefer the icc behavior personally, followed by the Clang behavior. The fact that three of the four selected compilers behave as you expect is motivation for the changes proposed in https://wg21.link/p2621.

Comment 41 Ben Cotton 2023-04-25 18:26:06 UTC
This message is a reminder that Fedora Linux 36 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '36'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 36 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 42 Ludek Smid 2023-05-25 19:32:16 UTC
Fedora Linux 36 entered end-of-life (EOL) status on 2023-05-16.

Fedora Linux 36 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.