Description of problem: A program generated from the following code snippet: -------------------------------------------------- #include <stdio.h> int main() { FILE *fp; char broken_path[16]; char broken_char = 4294967252; sprintf(broken_path, "/home/user/%c", broken_char); fp = fopen(broken_path, "w"); fclose(fp); } ------------------------------------------------- will write a file on the filesystem with a corrupted path. The FS yet allows that. When inspecting a disk containing such corrupted path using Python bindings, APIs listing the file name will produce a Segmentation Fault. Affected API tested so far: * find * ls * readdir * filesystem_walk After further investigation, it seems the affected function is PyUnicode_FromString(). PyString_FromString() seems not to be affected. Tested with: * libguestfs 1.34 * Python 3.5 * Debian Stretch
PyUnicode_FromFormat() does not segfault but has the side effect or mangling the string thus loosing the actual file name. PyBytes_FromString() does not segfault but it might break backward compatibility with previous Python 3 API. Yet this allows to get the correct (corrupted) file name.
Reproducer: $ guestfish -N fs -m /dev/sda1 touch "/$(echo -ne '\xd4')" $ guestfish --ro -a test1.img -m /dev/sda1 ls / lost+found � Now try to examine the disk from Python 3: $ python3 -c 'import guestfs; g = guestfs.GuestFS (); g.set_trace (1); g.add_drive_opts ("test1.img", format="raw", readonly=1); g.launch (); g.mount_ro ("/dev/sda1", "/"); files = g.ls ("/"); print (", ".join (files));' ... libguestfs: trace: ls "/" libguestfs: trace: ls0 "/" "/tmp/libguestfsspW42E/ls3" libguestfs: trace: ls0 = 0 libguestfs: trace: ls = ["lost+found", "\xd4"] Segmentation fault
The solution might be in the following bug report: http://bugs.python.org/issue29039 Did a quick search with GitHub in the code and didn't find any trace of Py_Initialize() being called.
As I understand it, extensions shouldn't call Py_Initialize. Only programs embedding the Python interpreter would call it, and libguestfs is not such a program In this case the function should be called by the python3 binary itself. BTW the test case in that bug also declares the string incorrectly.
You're right, the initialization is not necessary. The incorrect string declaration is a wrong copy-paste. My initial assumption was wrong, it's not PyUnicode_FromString() causing the segfault. The PyUnicode_FromString() returns NULL if it cannot create a Unicode object. Therefore the segfault is caused by PyDict_SetItemString() which receives a NULL pointer instead of the expected PyObject.
Not too easy to fix. A simple fix would be: diff --git a/python/handle.c b/python/handle.c index f4cc8ca..3c5c628 100644 --- a/python/handle.c +++ b/python/handle.c @@ -336,7 +336,7 @@ guestfs_int_py_put_string_list (char * const * const argv) #ifdef HAVE_PYSTRING_ASSTRING PyList_SetItem (list, i, PyString_FromString (argv[i])); #else - PyList_SetItem (list, i, PyUnicode_FromString (argv[i])); + PyList_SetItem (list, i, PyBytes_FromString (argv[i])); #endif } ie. returning bytes instead of unicode strings. Filenames are not necessarily unicode strings, so this is more correct. (We'd have to make similar changes throughout the python extension) However this greatly changes the API from the caller's point of view. For example you could no longer write (as in the reproducer): files = g.ls ("/") print (", ".join (files)) as that fails with: TypeError: sequence item 0: expected str instance, bytes found
That was the point I was trying to make few comments above. IMHO using bytes might be the correct approach but it would break Python 3 based code. Python 2 should not be affected. Yet looking at the problem from the Python side: $ touch "/$(echo -ne '/tmp/\xd4')" $ python3 Python 3.4.2 (default, Oct 8 2014, 10:45:20) [GCC 4.9.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.listdir('/tmp') [ ... '\udcd4', ... ] It seems that filenames (paths) are threated as unicode strings anyway. Therefore libguestfs would not behave as a User would expect.
Interesting. Apparently U+DCD4 is in the "low surrogates" block (https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates) As far as I can tell this is basically illegal in Unicode, but the Wikipedia page also notes that "In the Python programming language, individual surrogate codes are used to embed undecodable bytes in Unicode strings" and provides a link to PEP383 (https://www.python.org/dev/peps/pep-0383/) which also notes the use of the isolated (and illegal) low surrogate for this purpose. Anyway, if the Python core 'os' library is doing this, it's good enough for me too. It looks like we should be using one of the functions: https://docs.python.org/3/c-api/unicode.html#file-system-encoding Unfortunately that means we need to know if a string represents a filesystem path or some other string, and the current generator doesn't support that information for returned strings.
The following patch fixes the issue. diff --git a/python/handle.c b/python/handle.c index f4cc8ca0d..3633c9651 100644 --- a/python/handle.c +++ b/python/handle.c @@ -336,7 +336,7 @@ guestfs_int_py_put_string_list (char * const * const argv) #ifdef HAVE_PYSTRING_ASSTRING PyList_SetItem (list, i, PyString_FromString (argv[i])); #else - PyList_SetItem (list, i, PyUnicode_FromString (argv[i])); + PyList_SetItem (list, i, PyUnicode_DecodeFSDefault (argv[i])); #endif } $ python3 Python 3.5.3rc1 (default, Jan 3 2017, 04:40:57) [GCC 6.3.0 20161229] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import guestfs; g = guestfs.GuestFS (); g.set_trace (1); g.add_drive_opts ("test1.img", format="raw", readonly=1); g.launch (); g.mount_ro ("/dev/sda1", "/"); files = g.ls ("/"); libguestfs: trace: add_drive "test1.img" "readonly:true" "format:raw" libguestfs: trace: get_tmpdir libguestfs: trace: get_tmpdir = "/home/noxdafox/development/libguestfs/tmp" libguestfs: trace: disk_create "/home/noxdafox/development/libguestfs/tmp/libguestfsXDUL4H/overlay1" "qcow2" -1 "backingfile:/home/noxdafox/development/libguestfs/test1.img" "backingformat:raw" libguestfs: trace: disk_create = 0 libguestfs: trace: add_drive = 0 libguestfs: trace: launch libguestfs: trace: get_backend_setting "force_tcg" libguestfs: trace: get_backend_setting = NULL (error) libguestfs: trace: get_cachedir libguestfs: trace: get_cachedir = "/home/noxdafox/development/libguestfs/tmp" libguestfs: trace: get_cachedir libguestfs: trace: get_cachedir = "/home/noxdafox/development/libguestfs/tmp" libguestfs: trace: get_sockdir libguestfs: trace: get_sockdir = "/run/user/1000" libguestfs: trace: get_backend_setting "gdb" libguestfs: trace: get_backend_setting = NULL (error) libguestfs: trace: launch = 0 libguestfs: trace: mount_ro "/dev/sda1" "/" libguestfs: trace: mount_ro = 0 libguestfs: trace: ls "/" libguestfs: trace: ls0 "/" "/home/noxdafox/development/libguestfs/tmp/libguestfsXDUL4H/ls2" libguestfs: trace: ls0 = 0 libguestfs: trace: ls = ["lost+found", "\xd4"] >>> files ['lost+found', '\udcd4'] >>> More info here: https://www.python.org/dev/peps/pep-0383/ https://www.python.org/dev/peps/pep-0529/#id8
Sure, I agree that makes the problem go away, but it's not a fix for this issue since: (a) It's not complete -- many other places return strings. (b) There's not enough information in the API to make the decision on whether to do this transformation on the string. Point (b) is the hard one to fix.
I agree. A more generic approach would be using the function PyUnicode_DecodeLocale instead of PyUnicode_FromString. PyUnicode_FromString assumes the string to be already UTF8 which is not true. PyUnicode_DecodeLocale (string, "surrogateescape") works better in our case as it does not start from the assumption the input is UTF8 and reproduce Python 3 behaviour. It is better than PyUnicode_FromFormat as it looses information and it is better than PyBytes_FromString as it breaks backward compatibility. Yet it is a tricky change and requires some thinking.
This patch addresses the segfault issue. https://www.redhat.com/archives/libguestfs/2017-March/msg00216.html For a long term solution I propose the following. The PyUnicode_FromString expects UTF-8 encoded strings as input which is not always the case. The function would be replaced by PyUnicode_DecodeLocale which allows to set error policies. The default policy would remain the same (raise an error) thus preserving the API behaviour. Libguestfs would then expose a method to override the default policy for a given GuestFS handler in case the User would like to still get the output even when containing non UTF-8 encoded characters. The API could be something like: GuestFS.codec_error_handler() <-- returns the current policy GuestFS.codec_error_handler("surrogateescape") <-- sets the new policy Or to be more Pythonic: GuestFS.codec_error_handler <-- policy getter GuestFS.codec_error_handler = "surrogateescape" <-- policy setter
Fixed by Matteo's patch: https://github.com/libguestfs/libguestfs/commit/401c44563623f7badc2cfdd63f9ff6de569f3fcb which is in libguestfs >= 1.37.14.