Bug 1281940 - Unicode decoding and encoding errors when mixing unicode and utf8 encoded strings
Summary: Unicode decoding and encoding errors when mixing unicode and utf8 encoded str...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.17.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-3.6.2
: 4.17.16
Assignee: Nir Soffer
QA Contact: Gonza
URL:
Whiteboard:
Depends On:
Blocks: 1260131
TreeView+ depends on / blocked
 
Reported: 2015-11-13 21:00 UTC by Nir Soffer
Modified: 2019-10-10 10:30 UTC (History)
9 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-02-18 11:10:10 UTC
oVirt Team: Infra
Embargoed:
oourfali: ovirt-3.6.z?
rule-engine: planning_ack?
oourfali: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 48565 0 master ABANDONED compat: Use simplejson instead of json Never
oVirt gerrit 48661 0 master MERGED startup: Change system default encoding to utf8 2015-12-21 14:24:45 UTC
oVirt gerrit 51273 0 ovirt-3.6 MERGED startup: Change system default encoding to utf8 2016-01-04 08:24:27 UTC

Description Nir Soffer 2015-11-13 21:00:41 UTC
Description of problem:

jsonrpc pass unicode objects to vdsm apis when it should pass plain strings:

2015-11-13 20:17:15,690 INFO [dispatcher] (jsonrpc.Executor/0) Run and protect: connectStoragePool(spUUID=u'591475db-6fa9-455d-9c05-7f6e30fb06d5', hostID=1, msdUUID=u'6c77adb1-74fc-4fa9-a0ac-3b5a4b789318', masterVersion=3, domainsMap={u'869b6036-a9ca-43eb-9287-828803822783': u'active', u'6267179b-bd65-43c1-8a1e-f33cf12582fa': u'attached', u'e43d9b66-aef1-49d7-bfb4-bc06ddc9f025': u'attached', u'6c77adb1-74fc-4fa9-a0ac-3b5a4b789318': u'active'}, options=None) [logUtils:48(wrapper)]

In this example, arguments like spUUID are pass as a unicode object
u'591475db-6fa9-455d-9c05-7f6e30fb06d5' instead of the plain string:
'591475db-6fa9-455d-9c05-7f6e30fb06d5'.

These unicode strings may cause failures when combining or formatting them
with non-ascii strings:

>>> "ascii" + "\xd7\x90"
'ascii\xd7\x90'

>>> u"ascii" + "\xd7\x90"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128)

Since u"ascii" is unicode string, Python try to convert "\xd7\x90" assuming
the default encoding (ascii), and fails.

Another example is logging:

>>> import logging
>> logging.error("ascii=%s utf8=%s", u"ascii", "\xd7\x90")
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/__init__.py", line 859, in emit
    msg = self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 732, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 471, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 335, in getMessage
    msg = msg % self.args
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128)

Again, works fine with using string object:

>>> logging.error("ascii=%s utf8=%s", "ascii", "\xd7\x90")
ERROR:root:ascii=ascii utf8=א

The root cause is using the json library from the Python standard library.

>>> import json
>>> json.loads(json.dumps({"key": unicode("\xd7\x90", 'utf8')}))
{u'key': u'\u05d0'}

Here the string "key" was converted to u"key".

When running on Python 2.6, we used the simplejson library, which
does not have this issue:

>>> json.loads(json.dumps({"key": unicode("\xd7\x90", 'utf8')}))
{'key': u'\u05d0'}

Note that "key" is a plain string as it should be.

We do use simplejson when it is available:

# lib/vdsm/compat.py
try:
    # on RHEL/Centos 6.x, the JSON module in the python standard
    # library does not include significant speedups:
    # stdlib is based on simplejson 1.9, speedups were added on 2.0.9.
    # In general, speedups are first found on the
    # simplejson package.
    import simplejson as json
    json  # make pyflakes happy
except ImportError:
    # no big deal, fallback to standard libary
    import json
    json  # yep, this is needed twice.

But we do not require simplejson since this patch:

commit fb1c44af320fbeb99f1cba406eae70e653b6de2f
Author: Yaniv Bronhaim <ybronhei>
Date:   Sun Jun 14 14:03:26 2015 +0300

    Remove el6 specific code from spec

    Change-Id: Ifef95b7c61b8bd8c86ef2642e74275da87659cb8
    Signed-off-by: Yaniv Bronhaim <ybronhei>
    Reviewed-on: https://gerrit.ovirt.org/42317
    Continuous-Integration: Jenkins CI
    Reviewed-by: Dima Kuznetsov <dkuznets>
    Reviewed-by: Dan Kenigsberg <danken>

Since the library is not available, we use fallback to builtin
json.

Version-Release number of selected component (if applicable):
Since 3.5, when running on EL7 or Fedora (both use Python 2.7)

How reproducible:
Always

Comment 1 Nir Soffer 2015-11-13 21:05:08 UTC
Setting severity to high since this can fail some flows when non-ascii
values are sent from engine.

Changing component to core, since jsonrpc is not the root cause, it just
happens when using it instead of xmlrpc.

Comment 2 Oved Ourfali 2015-11-14 07:20:20 UTC
Nir, can I assign to you? 
Should it be on post?

Comment 3 Nir Soffer 2015-11-14 11:14:20 UTC
Taken

Comment 4 Oved Ourfali 2015-11-14 12:20:19 UTC
(In reply to Nir Soffer from comment #3)
> Taken

Thanks. 
Highly appreciated!

Comment 5 Dan Kenigsberg 2015-11-14 15:17:42 UTC
(In reply to Nir Soffer from comment #1)
> Setting severity to high since this can fail some flows when non-ascii
> values are sent from engine.

This seems the underlying bug: Engine sending unadorned binary data as strings violates jsonrpc spec. If we take the suggested patch as a stop-gap measure to avoid a 3.6 regression, we should include a logging mechanism to catch such cases and fix them in Engine.

Comment 6 Nir Soffer 2015-11-14 15:55:46 UTC
(In reply to Dan Kenigsberg from comment #5)
> (In reply to Nir Soffer from comment #1)
> > Setting severity to high since this can fail some flows when non-ascii
> > values are sent from engine.
> 
> This seems the underlying bug: Engine sending unadorned binary data as
> strings violates jsonrpc spec.

JSON string is unicode [1] - can you show where the jsonrpc spec [2]
forbids unicode strings?

Also vdsm schema does not mention anything about unicode values. If we
forbid them it should be documented.

> If we take the suggested patch as a stop-gap
> measure to avoid a 3.6 regression, we should include a logging mechanism
> to catch such cases and fix them in Engine.

Warning about unicode values is expensive, requires recursive search in 
arguments dict in jsonrpc, and manual checking of the argument in xmlrpc.

We can enhance the password protection code to warn about unicode values.
But I think we better spend time on useful features and handle this in 
the schema.

[1] http://www.json.org/
[2] http://www.jsonrpc.org/specification

Comment 7 Nir Soffer 2015-11-16 17:46:56 UTC
We are investigating a different direction, that will allow using the
built-in json library safely, and be more compatible with Python 3.

Comment 8 Nir Soffer 2015-11-17 10:34:00 UTC
String to use the json built-in library expose this issue, but it is not the
root cause. The root cause is mixing unicode and utf8 encoded strings, 
while python default encoding in Python 2 is ascii.

Each time a string is encoded using str() or decoded using unicode() or 
x.decode() without specifying the encoding, python default to 'ascii'. 

It seems that changing the default encoding is a quick fix avoiding 
these issues, until vdsm is fixed to never mix unicode and encoded strings.

Comment 10 Sandro Bonazzola 2015-12-23 15:04:05 UTC
Missing backport to 3.6 branch, moving to post

Comment 11 Oved Ourfali 2016-01-03 07:40:23 UTC
Nir - what's the status of this one?

Comment 12 Nir Soffer 2016-01-03 08:18:38 UTC
(In reply to Oved Ourfali from comment #11)
> Nir - what's the status of this one?
Fixed in master.

This fix is required for 
supporting non-ascii vm name, but
libvirt does does not support
it yet.

See https://gerrit.ovirt.org/48570

Do you want to backport it?

Comment 13 Oved Ourfali 2016-01-03 08:28:26 UTC
Okay. Moving to 3.6.3, hoping this will get into libvirt by then.
If you can backport the merged patch that would be great.

Comment 14 Nir Soffer 2016-01-03 22:49:23 UTC
(In reply to Oved Ourfali from comment #13)
> Okay. Moving to 3.6.3, hoping this will get into libvirt by then.

If bug 1285720 is resolved, this fix is good enough to allow vms with non-acsii
names.

> If you can backport the merged patch that would be great.

Patch is ready for merging to 3.6.

Comment 15 Gonza 2016-01-21 12:57:03 UTC
Verified with:
vdsm-4.17.18-0.el7ev.noarch

From vdsm log:
Thread-47::INFO::2016-01-21 13:04:06,489::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='00000001-0001-0001-0001-0000000001a8', hostID=1, msdUUID='ad2d4bfe-5683-4d40-84e1-80522e030890', masterVersion=1, domainsMap={'ad2d4bfe-5683-4d40-84e1-80522e030890': 'active'}, options=None)


Note You need to log in before you can comment on or make changes to this bug.