Description of problem: glusterfind pre command crashes with below backtrace when it is used with "no-encode" option for Russian filenames. ~~~~~~~~ utf_8.py:16:decode:UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte Traceback (most recent call last): File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 402, in <module> actual_end = changelog_crawl(args.brick, start, end, args) File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 345, in changelog_crawl return get_changes(brick, working_dir, log_file, start, end, args) File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 296, in get_changes parse_changelog_to_db(changelog_data, change, args) File "/usr/libexec/glusterfs/glusterfind/changelog.py", line 223, in parse_changelog_to_db changelog_data.when_create_mknod_mkdir(changelogfile, data) File "/usr/libexec/glusterfs/glusterfind/changelogdata.py", line 333, in when_create_mknod_mkdir bn1 = bn1.decode("utf-8").strip() File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte Local variables in innermost frame: input: '83477_201704261953_\xf0\xe5\xe7.csv' errors: 'strict' ~~~~~~~~ Version-Release number of selected component (if applicable): Red Hat Gluster STorage 3.1.3 How reproducible: Steps to Reproduce: Will be added later. Actual results: Glusterfind is crashing with --no-encode. Expected results: glusterfind should not crash Additional info: count: 5 reason: utf_8.py:16:decode:UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 19: invalid continuation byte package: glusterfs-server-3.7.9-12.el7rhgs pkg_vendor: Red Hat, Inc. cmdline: python /usr/libexec/glusterfs/glusterfind/changelog.py bacula-inc ria22_media /rhgs/ria22_media/brick /usr/var/lib/misc/glusterfsd/glusterfind/bacula-inc/ria22_media/tmp_output_1 1493315988 --output-prefix /mnt/rhs/rhg_ria22_media --no-encode executable: /usr/libexec/glusterfs/glusterfind/changelog.py reporter: libreport-2.1.11.1
Upstream patch sent for review https://review.gluster.org/17317
RCA: All file names are quoted(urllib.quote_plus) by changelog parser. It also quoted some of the unicode chars too. Simple reproducer: import urllib filename = u'c3db87ab-3cd8-4e85-921c-5f6e74b3c36d%2F227799_201705301938_%F0%E5%E7.csv' print urllib.unquote_plus(filename.encode("utf-8")).decode("utf-8") filename is how it is stored in parsed changelog.(Already quoted by libgfchangelog) unquote_plus("%2F") is "/" Note that part of file basename also quoted "201705301938_%F0%E5%E7.csv". filename.encode("utf-8") has no effect since Unicode chars are quoted and encode can't understand it. This causes decode("utf-8") fail since it is not encode("utf-8") Solution: Change the order of unquote_plus and encode("utf-8") should fix the issue. First unquote_plus and then encode if required to quote again.(Encode is required if quote_plus to be used later, otherwise it will fail with KeyError) Note: I am yet to verify the solution.
upstream patch : https://review.gluster.org/#/c/17674/
Build number : glusterfs-3.12.2-7.el7rhgs.x86_64 Created files and directories with russian filenames, spaces and newlines in the file/dir names. Glusterfind pre works as expected with and without --no-encode option. No traceback seen. File/dir names are displayed in the outfile appropriately with/without --no-encode option. [root@dhcp43-18 ~]# glusterfind pre alpha-sess-1 alpha --no-encode /tmp/out.txt Generated output file /tmp/out.txt [root@dhcp43-18 ~]# glusterfind pre alpha-sess-1 alpha --regenerate-outfile /tmp/out1.txt Generated output file /tmp/out.txt Hence, moving bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607