Bug 1448334 - [GSS]glusterfind pre crashes with "UnicodeDecodeError: 'utf8' codec can't decode" error when the `--no-encode` is used
Summary: [GSS]glusterfind pre crashes with "UnicodeDecodeError: 'utf8' codec can't dec...
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfind
Version: rhgs-3.1
Hardware: All
OS: All
Target Milestone: ---
: RHGS 3.4.0
Assignee: Aravinda VK
QA Contact: Vinayak Papnoi
Whiteboard: rebase
Depends On:
Blocks: 1451724 RHGS-3.4-GSS-proposed-tracker 1503135 1572570
TreeView+ depends on / blocked
Reported: 2017-05-05 08:48 UTC by Riyas Abdulrasak
Modified: 2018-09-07 09:15 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1451724 (view as bug list)
Last Closed: 2018-09-04 06:32:21 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:33:53 UTC

Description Riyas Abdulrasak 2017-05-05 08:48:57 UTC
Description of problem:

glusterfind pre command crashes with below backtrace when it is used with "no-encode" option for Russian filenames. 

utf_8.py:16:decode:UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 402, in <module>
    actual_end = changelog_crawl(args.brick, start, end, args)
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 345, in changelog_crawl
    return get_changes(brick, working_dir, log_file, start, end, args)
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 296, in get_changes
    parse_changelog_to_db(changelog_data, change, args)
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 223, in parse_changelog_to_db
    changelog_data.when_create_mknod_mkdir(changelogfile, data)
  File "/usr/libexec/glusterfs/glusterfind/changelogdata.py", line 333, in when_create_mknod_mkdir
    bn1 = bn1.decode("utf-8").strip()
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte

Local variables in innermost frame:
input: '83477_201704261953_\xf0\xe5\xe7.csv'
errors: 'strict'


Version-Release number of selected component (if applicable):

Red Hat Gluster STorage 3.1.3

How reproducible:

Steps to Reproduce:

Will be added later. 

Actual results:

Glusterfind is crashing with --no-encode. 

Expected results:

glusterfind should not crash

Additional info:

count:          5
reason:         utf_8.py:16:decode:UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte
package:        glusterfs-server-3.7.9-12.el7rhgs
pkg_vendor:     Red Hat, Inc.
cmdline:        python /usr/libexec/glusterfs/glusterfind/changelog.py bacula-inc ria22_media /rhgs/ria22_media/brick /usr/var/lib/misc/glusterfsd/glusterfind/bacula-inc/ria22_media/tmp_output_1 1493315988 --output-prefix /mnt/rhs/rhg_ria22_media --no-encode
executable:     /usr/libexec/glusterfs/glusterfind/changelog.py
reporter:       libreport-

Comment 11 Aravinda VK 2017-05-17 11:34:14 UTC
Upstream patch sent for review

Comment 25 Aravinda VK 2017-06-16 10:22:30 UTC
RCA: All file names are quoted(urllib.quote_plus) by changelog parser. It also quoted some of the unicode chars too.

Simple reproducer:

import urllib

filename = u'c3db87ab-3cd8-4e85-921c-5f6e74b3c36d%2F227799_201705301938_%F0%E5%E7.csv'
print urllib.unquote_plus(filename.encode("utf-8")).decode("utf-8")

filename is how it is stored in parsed changelog.(Already quoted by libgfchangelog)

unquote_plus("%2F") is "/"

Note that part of file basename also quoted "201705301938_%F0%E5%E7.csv".

filename.encode("utf-8") has no effect since Unicode chars are quoted and encode can't understand it. This causes decode("utf-8") fail since it is not encode("utf-8")

Solution: Change the order of unquote_plus and encode("utf-8") should fix the issue. First unquote_plus and then encode if required to quote again.(Encode is required if quote_plus to be used later, otherwise it will fail with KeyError)

Note: I am yet to verify the solution.

Comment 27 Atin Mukherjee 2017-07-03 11:25:05 UTC
upstream patch : https://review.gluster.org/#/c/17674/

Comment 30 Vinayak Papnoi 2018-04-18 10:13:21 UTC
Build number : glusterfs-3.12.2-7.el7rhgs.x86_64

Created files and directories with russian filenames, spaces and newlines in the file/dir names.

Glusterfind pre works as expected with and without --no-encode option. No traceback seen. File/dir names are displayed in the outfile appropriately with/without --no-encode option.

[root@dhcp43-18 ~]# glusterfind pre alpha-sess-1 alpha --no-encode /tmp/out.txt
Generated output file /tmp/out.txt
[root@dhcp43-18 ~]# glusterfind pre alpha-sess-1 alpha --regenerate-outfile /tmp/out1.txt
Generated output file /tmp/out.txt

Hence, moving bug to verified.

Comment 31 errata-xmlrpc 2018-09-04 06:32:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.