1448334 – [GSS]glusterfind pre crashes with "UnicodeDecodeError: 'utf8' codec can't decode" error when the `--no-encode` is used

Bug 1448334 - [GSS]glusterfind pre crashes with "UnicodeDecodeError: 'utf8' codec can't decode" error when the `--no-encode` is used

Summary: [GSS]glusterfind pre crashes with "UnicodeDecodeError: 'utf8' codec can't dec...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfind
Sub Component:
Version:	rhgs-3.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Aravinda VK
QA Contact:	Vinayak Papnoi
Docs Contact:
URL:
Whiteboard:	rebase
Depends On:
Blocks:	1451724 RHGS-3.4-GSS-proposed-tracker 1503135 1572570
TreeView+	depends on / blocked

Reported:	2017-05-05 08:48 UTC by Riyas Abdulrasak
Modified:	2021-12-10 15:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.12.2-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1451724 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:32:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:33:53 UTC

Description Riyas Abdulrasak 2017-05-05 08:48:57 UTC

Description of problem:

glusterfind pre command crashes with below backtrace when it is used with "no-encode" option for Russian filenames. 

~~~~~~~~
utf_8.py:16:decode:UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 402, in <module>
    actual_end = changelog_crawl(args.brick, start, end, args)
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 345, in changelog_crawl
    return get_changes(brick, working_dir, log_file, start, end, args)
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 296, in get_changes
    parse_changelog_to_db(changelog_data, change, args)
  File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 223, in parse_changelog_to_db
    changelog_data.when_create_mknod_mkdir(changelogfile, data)
  File "/usr/libexec/glusterfs/glusterfind/changelogdata.py", line 333, in when_create_mknod_mkdir
    bn1 = bn1.decode("utf-8").strip()
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte

Local variables in innermost frame:
input: '83477_201704261953_\xf0\xe5\xe7.csv'
errors: 'strict'

~~~~~~~~


Version-Release number of selected component (if applicable):

Red Hat Gluster STorage 3.1.3

How reproducible:



Steps to Reproduce:

Will be added later. 

Actual results:

Glusterfind is crashing with --no-encode. 

Expected results:

glusterfind should not crash

Additional info:

count:          5
reason:         utf_8.py:16:decode:UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte
package:        glusterfs-server-3.7.9-12.el7rhgs
pkg_vendor:     Red Hat, Inc.
cmdline:        python /usr/libexec/glusterfs/glusterfind/changelog.py bacula-inc ria22_media /rhgs/ria22_media/brick /usr/var/lib/misc/glusterfsd/glusterfind/bacula-inc/ria22_media/tmp_output_1 1493315988 --output-prefix /mnt/rhs/rhg_ria22_media --no-encode
executable:     /usr/libexec/glusterfs/glusterfind/changelog.py
reporter:       libreport-2.1.11.1

Comment 11 Aravinda VK 2017-05-17 11:34:14 UTC

Upstream patch sent for review
https://review.gluster.org/17317

Comment 25 Aravinda VK 2017-06-16 10:22:30 UTC

RCA: All file names are quoted(urllib.quote_plus) by changelog parser. It also quoted some of the unicode chars too.

Simple reproducer:

import urllib

filename = u'c3db87ab-3cd8-4e85-921c-5f6e74b3c36d%2F227799_201705301938_%F0%E5%E7.csv'
print urllib.unquote_plus(filename.encode("utf-8")).decode("utf-8")

filename is how it is stored in parsed changelog.(Already quoted by libgfchangelog)

unquote_plus("%2F") is "/"

Note that part of file basename also quoted "201705301938_%F0%E5%E7.csv".

filename.encode("utf-8") has no effect since Unicode chars are quoted and encode can't understand it. This causes decode("utf-8") fail since it is not encode("utf-8")

Solution: Change the order of unquote_plus and encode("utf-8") should fix the issue. First unquote_plus and then encode if required to quote again.(Encode is required if quote_plus to be used later, otherwise it will fail with KeyError)

Note: I am yet to verify the solution.

Comment 27 Atin Mukherjee 2017-07-03 11:25:05 UTC

upstream patch : https://review.gluster.org/#/c/17674/

Comment 30 Vinayak Papnoi 2018-04-18 10:13:21 UTC

Build number : glusterfs-3.12.2-7.el7rhgs.x86_64

Created files and directories with russian filenames, spaces and newlines in the file/dir names.

Glusterfind pre works as expected with and without --no-encode option. No traceback seen. File/dir names are displayed in the outfile appropriately with/without --no-encode option.

[root@dhcp43-18 ~]# glusterfind pre alpha-sess-1 alpha --no-encode /tmp/out.txt
Generated output file /tmp/out.txt
[root@dhcp43-18 ~]# glusterfind pre alpha-sess-1 alpha --regenerate-outfile /tmp/out1.txt
Generated output file /tmp/out.txt


Hence, moving bug to verified.

Comment 31 errata-xmlrpc 2018-09-04 06:32:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.