Bug 922433 - Fedora 19 bugs cannot be reported because the server side cannot handle the release name "Schrödinger's Cat"
Fedora 19 bugs cannot be reported because the server side cannot handle the r...
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: libreport (Show other bugs)
19
All All
unspecified Severity urgent
: ---
: ---
Assigned To: abrt
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-16 16:58 EDT by Adam Williamson
Modified: 2013-09-03 15:02 EDT (History)
22 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-20 04:46:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Adam Williamson 2013-03-16 16:58:51 EDT
[root@adam ccpp-2013-03-16-13:53:55-21780]# abrt-cli report ./
Server side error: 'Validation failed: error validating 'os': error validating 'version': string "" contains illegal characters'

File 'os-release' reads:

Fedora release 19 (Schrödinger's Cat)

If I edit os-release to read:

Fedora release 19 (Schrodingers Cat)

I can get past the error. (Didn't check if it's the ' or the ö that causes the problem, sorry).
Comment 1 Adam Williamson 2013-03-16 17:04:16 EDT
It's the ' that causes the trouble.
Comment 2 Adam Williamson 2013-03-16 17:24:50 EDT
actually, it's both.
Comment 3 Michal Toman 2013-03-18 06:54:10 EDT
The problem is in libreport src/lib/problem_data.c, function is_text_file().
The file containing "Schrödinger's Cat" is considered binary although it should be text (because of both ö and ').
Comment 4 Sergio Monteiro Basto 2013-03-18 19:09:08 EDT
ah , same here, no reports for F19 
I can't change the name
Comment 5 Dave Malcolm 2013-03-19 02:35:01 EDT
Quoting "man os-release"

"The /etc/os-release file contains operating system identification data."

"The basic file format of os-release is a newline-separated list of environment-like shell-compatible variable assignments."

[...snip...]

"Variable assignment values should be enclosed in double or single quotes if they include spaces, semicolons or other special characters outside of A-Z, a-z, 0-9."

[...snip...]

"All strings should be in UTF-8 format"

[...snip...]

"If double or single quotes or backslashes are to be used within variable assignments they should be escaped with backslashes, following shell style"
Comment 6 Denys Vlasenko 2013-03-19 05:43:38 EDT
Fixed in upstream git:

commit ea1f6ed6b1d742eca33ce9a598fd0b568e0fb0cb
Author: Denys Vlasenko <dvlasenk@redhat.com>
Date:   Mon Mar 18 22:20:06 2013 +0100

    is_text_file(): bump allowable non-ASCII chars from 2% to 10%. Closes rhbz#922433

    This change was tested to treat
    os_release = "Fedora release 19 (Schrödinger's Cat)"
    as text.
    I feel this is a stop-gap measure, though...
Comment 7 Adam Jackson 2013-03-19 08:45:27 EDT
Why don't you just require UTF8 like the spec says?
Comment 8 Adam Williamson 2013-03-19 12:50:11 EDT
yeah, just based on the info presented here, that sounds like all kinds of crazy. there are many many non-ASCII characters which are 'text'. Any function called 'is_text_file' should be perfectly happy with a string/file/whatever that contains nothing but Unicode. Okay, Unicode text characters if that's a concept that's easily expressed/consumed.
Comment 9 Jiri Moskovcak 2013-03-20 04:36:27 EDT
Ok, to make this more clear for everyone who might be interested why this  happened to abrt. The function is_text_file() is used to determine whether the file is text or binary and abrt uses this information when deciding how to treat this file - text files are usually shown in comments and binaries (if sent) are attachments. In this case the heuristics in is_text_file() failed, marked the file containing the os release as binary and didn't send it to abrt server. I think that using heuristics is a wrong idea, because deciding whether the file is unicode 'text' file or a binary which happens to be also a valid unicode file can't be 100% reliable.
Comment 10 David Woodhouse 2013-03-20 05:17:42 EDT
You shouldn't just be checking for bytes >= 0x80. You should be checking for bytes >= 0x80 *which are not part of a valid UTF-8 byte sequence*.

And your percentage threshold can probably go back to 2% or even lower. In the 21st century it's no longer such a gamble. In the olden days you had to allow a certain percentage to account for false positives from genuine non-ASCII characters in the legacy 8-bit encodings. But with the above change, those non-ASCII characters will no longer cause false positives.
Comment 11 Miloslav Trmač 2013-03-20 10:58:27 EDT
The canonical "is a file text or binary?" used throughout GNU is simply "does it have NUL (=0x00) bytes?".  E.g. http://www.gnu.org/software/diffutils/manual/html_node/Binary.html .
Comment 12 Jan Pazdziora 2013-03-22 02:57:11 EDT
(In reply to comment #11)
> The canonical "is a file text or binary?" used throughout GNU is simply
> "does it have NUL (=0x00) bytes?".  E.g.
> http://www.gnu.org/software/diffutils/manual/html_node/Binary.html .

It really depends on what you plan to do with that file / sequence of bytes. For handling it with C libraries, the ASCIIZ is good criteria. If you need to decide whether to Base64 that string or not when sending it as part of some XML(RPC) somewhere, the criteria has to be much more strict because XML is based on Unicode and not all sequences of non-0x00 bytes are valid.
Comment 13 Daniel Mach 2013-03-22 04:01:15 EDT
I'd probably avoid reinventing the wheel and rely on something which already works just fine:

$ cat /etc/system-release
Fedora release 19 (Schrödinger's Cat)

$ file /etc/system-release --dereference --mime --brief
text/plain; charset=utf-8
Comment 14 Thorsten Glaser 2013-03-22 12:41:10 EDT
Just use this one instead, then:

Schrödinger’s Cat

That’s even typographically better…
Comment 15 Nicolas Mailhot 2013-03-22 16:37:48 EDT
(In reply to comment #11)
> The canonical "is a file text or binary?" used throughout GNU is simply
> "does it have NUL (=0x00) bytes?".  E.g.
> http://www.gnu.org/software/diffutils/manual/html_node/Binary.html .

And it has been terminally since unicode started to be used
It triggers so often on a system using a non-English locale people routinely ignore it as a nuisance in most parts of the world
Comment 16 Denys Vlasenko 2013-03-23 18:10:21 EDT
(In reply to comment #13)
> I'd probably avoid reinventing the wheel and rely on something which already
> works just fine:
> 
> $ cat /etc/system-release
> Fedora release 19 (Schrödinger's Cat)
> 
> $ file /etc/system-release --dereference --mime --brief
> text/plain; charset=utf-8

We tried that. A string of simply one characher "0" isn't considered text by file tool (and by its underlying library):

# echo -n 0 >0
# file 0 
0: very short file (no magic)
Comment 17 Denys Vlasenko 2013-03-23 18:14:48 EDT
(In reply to comment #11)
> The canonical "is a file text or binary?" used throughout GNU is simply
> "does it have NUL (=0x00) bytes?".  E.g.
> http://www.gnu.org/software/diffutils/manual/html_node/Binary.html .

We can adopt something like that.
Currently, NULs are taken as "this file is binary", as well as excessive amounts of chars > 0x7e.
Just dropping the second condition would do what you propose.
But it may become too permissive (too biased to declare stuff to be "text").

How about this: not only NUL, but any control char that is not a whitespace laeds to a decision that file is "binary".
Comment 18 Jan Pazdziora 2013-03-25 03:29:11 EDT
(In reply to comment #17)
> 
> We can adopt something like that.
> Currently, NULs are taken as "this file is binary", as well as excessive
> amounts of chars > 0x7e.
> Just dropping the second condition would do what you propose.
> But it may become too permissive (too biased to declare stuff to be "text").
> 
> How about this: not only NUL, but any control char that is not a whitespace
> laeds to a decision that file is "binary".

Before you modify the algorithm -- what is the purpose of this "string is binary" test? What does the code do differently based on that test? Without knowing what codepaths your code will start taking if you modify the test, it's pretty dangerous to do the change.
Comment 19 Denys Vlasenko 2013-03-25 04:14:06 EDT
(In reply to comment #18)
> (In reply to comment #17)
> > 
> > We can adopt something like that.
> > Currently, NULs are taken as "this file is binary", as well as excessive
> > amounts of chars > 0x7e.
> > Just dropping the second condition would do what you propose.
> > But it may become too permissive (too biased to declare stuff to be "text").
> > 
> > How about this: not only NUL, but any control char that is not a whitespace
> > laeds to a decision that file is "binary".
> 
> Before you modify the algorithm -- what is the purpose of this "string is
> binary" test? What does the code do differently based on that test? Without
> knowing what codepaths your code will start taking if you modify the test,
> it's pretty dangerous to do the change.

The code shows text data in GUI textboxes, it puts in into emails it composes,
prints on stdout when asked to show detailed problem data.

Whereas binary files are excluded from all of these uses. When reporting bugs,
they are uploaded as attachments, not as parts of textual comments.
Comment 20 Jan Pazdziora 2013-03-25 04:21:30 EDT
(In reply to comment #19)
> 
> The code shows text data in GUI textboxes, it puts in into emails it
> composes,
> prints on stdout when asked to show detailed problem data.

Which means that you don't care that much about text / binary distinction but about valid UTF-8 / the rest. You can well have valid ISO-8859-1 text string but if you handle it as text and put into GUI without conversion, the result will be wrong.
Comment 21 Dave Malcolm 2013-03-25 11:09:13 EDT
(In reply to comment #7)
> Why don't you just require UTF8 like the spec says?

To reiterate Ajax's question: why were you trying to guess text vs binary?  The spec says that this file is UTF-8, and it is.
Comment 22 Denys Vlasenko 2013-03-25 12:17:59 EDT
commit ed2911f292275e879c80c7147fcd13a79f4e00b7
Author: Denys Vlasenko <dvlasenk@redhat.com>
Date:   Mon Mar 25 11:22:43 2013 +0100

    improve is_text_file() to not treat valid Unicode as bad_chars
    
    In valid Unicode, first non-ASCII byte is always 11xxxxxx
    and the following bytes are always 10xxxxxx.
    Random binary data is likely to violate this rule *a lot*.
    
    This patch makes bad_chars++ decision only if it sees a non-ASCII
    byte which violates the rule described above.
    
    Also, it makes immediate decision "it's binary"
    not only on NULs, but on any non-whitespace control chars.
    
    This hopefully improves handling of the problem
    identified in rhbz#922433
    
    Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
    Signed-off-by: Jakub Filak <jfilak@redhat.com>

diff --git a/src/lib/problem_data.c b/src/lib/problem_data.c
index 2e5e55f..77c46dd 100644
--- a/src/lib/problem_data.c
+++ b/src/lib/problem_data.c
@@ -288,7 +288,7 @@ static char* is_text_file(const char *name, ssize_t *sz)
     }
     lseek(fd, 0, SEEK_SET);
 
-    char *buf = (char*)xmalloc(*sz);
+    unsigned char *buf = xmalloc(*sz);
     ssize_t r = full_read(fd, buf, *sz);
     close(fd);
     if (r < 0)
@@ -306,7 +306,7 @@ static char* is_text_file(const char *name, ssize_t *sz)
     {
         base++;
         if (is_in_string_list(base, (char**)always_text_files))
-            return buf;
+            return (char*)buf;
     }
 
     /* Every once in a while, even a text file contains a few garbled
@@ -316,28 +316,44 @@ static char* is_text_file(const char *name, ssize_t *sz)
      * os_release = "Schrödinger's Cat". Bumped to 10%.
      * Alternatives: add os_release to always_text_files[]
      * or add "if it is valid Unicode, then it's text" check here.
+     *
+     * Replaced crude "buf[r] > 0x7e is bad" logic with
+     * "if it is a broken Unicode, then it's bad".
      */
     const unsigned RATIO = 10;
     unsigned total_chars = r + RATIO;
     unsigned bad_chars = 1; /* 1 prevents division by 0 later */
-    while (--r >= 0)
+    bool prev_was_unicode = 0;
+    ssize_t i = -1;
+    while (++i < r)
     {
-        if (buf[r] >= 0x7f
-         /* among control chars, only '\t','\n' etc are allowed */
-         || (buf[r] < ' ' && !isspace(buf[r]))
-        ) {
-            if (buf[r] == '\0')
-            {
-                /* We don't like NULs very much. Not text for sure! */
-                free(buf);
-                return NULL;
-            }
+        /* Among control chars, only '\t','\n' etc are allowed */
+        if (buf[i] < ' ' && !isspace(buf[i]))
+        {
+            /* We don't like NULs and other control chars very much.
+             * Not text for sure!
+             */
+            free(buf);
+            return NULL;
+        }
+        if (buf[i] == 0x7f)
             bad_chars++;
+        else if (buf[i] > 0x7f)
+        {
+            /* We test two possible bad cases with one comparison:
+             * (1) prev byte was unicode AND cur byte is 11xxxxxx:
+             * BAD - unicode start byte can't be in the middle of unicode char
+             * (2) prev byte wasnt unicode AND cur byte is 10xxxxxx:
+             * BAD - unicode continuation byte can't start unicode char
+             */
+            if (prev_was_unicode == ((buf[i] & 0x40) == 0x40))
+                bad_chars++;
         }
+        prev_was_unicode = (buf[i] > 0x7f);
     }
 
     if ((total_chars / bad_chars) >= RATIO)
-        return buf; /* looks like text to me */
+        return (char*)buf; /* looks like text to me */
 
     free(buf);
     return NULL; /* it's binary */
Comment 23 Jiri Moskovcak 2013-03-25 12:24:38 EDT
(In reply to comment #21)
> (In reply to comment #7)
> > Why don't you just require UTF8 like the spec says?
> 
> To reiterate Ajax's question: why were you trying to guess text vs binary? 
> The spec says that this file is UTF-8, and it is.

We (ABRT) don't have hardcoded knowledge that the /etc/fedora-release is utf8. It's just copied with other information to /var/tmp/abrt/<somedir> and processed when user decides to report the problem and by that time abrt doesn't know where the file came from so it tries to guess if it's text and uses this information when creating the report (the binary files are send as attachments and text files usually ends up in comment fields).
Comment 24 Dave Malcolm 2013-03-25 12:28:54 EDT
(In reply to comment #23)
> (In reply to comment #21)
> > (In reply to comment #7)
> > > Why don't you just require UTF8 like the spec says?
> > 
> > To reiterate Ajax's question: why were you trying to guess text vs binary? 
> > The spec says that this file is UTF-8, and it is.
> 
> We (ABRT) don't have hardcoded knowledge that the /etc/fedora-release is
> utf8. It's just copied with other information to /var/tmp/abrt/<somedir> and
> processed when user decides to report the problem and by that time abrt
> doesn't know where the file came from so it tries to guess if it's text and
> uses this information when creating the report (the binary files are send as
> attachments and text files usually ends up in comment fields).

Am I right in understanding that you're *not* tracking where all the files came from?  That sounds like something that should be fixed - it seems like extremely pertinent information to me.
Comment 25 David Woodhouse 2013-03-25 12:31:11 EDT
(In reply to comment #23)
> We (ABRT) don't have hardcoded knowledge that the /etc/fedora-release is
> utf8. It's just copied with other information to /var/tmp/abrt/<somedir> and
> processed when user decides to report the problem and by that time abrt
> doesn't know where the file came from so it tries to guess if it's text and
> uses this information when creating the report (the binary files are send as
> attachments and text files usually ends up in comment fields).

In that case, 'text' and 'utf-8' are fairly much synonymous, surely? If it's legacy 20th century 8-bit crap like ISO8859-1, it wants to be an attachment rather than wrongly included in a "text" part of the email anyway.

Note You need to log in before you can comment on or make changes to this bug.