Description of problem: Segmentation fault when using libguestfs Version-Release number of selected component (if applicable): 1.40.2 - libguestfs-python 1.40.2 - libguestfs How reproducible: Use guestfs_int_py_event_callback_wrapper to log events. Steps to Reproduce: 1. Use guestfs_int_py_event_callback_wrapper Actual results: Segmentation fault Expected results: No Segmentation fault Additional info: Hi, I believe that I am one of the few that are using g.set_event_callback(cb, guestfs.EVENT_ALL) in Python to log libguestfs as a part of my Python solution so this is why this bug was not revealed by now. I have a race condition where I get segmentation faults from time to time (depends on how I change my code) and I believe the problem is the implementation of guestfs_int_py_event_callback_wrapper in handle.c for Python. Here is the stack trace that I got: Program received signal SIGSEGV, Segmentation fault. 0x00000000005bd8fc in ?? () (gdb) bt #0 0x00000000005bd8fc in ?? () #1 0x00000000005b32bb in PyDict_Clear () #2 0x00000000005b33a9 in ?? () #3 0x00000000005265ce in ?? () #4 0x0000000000527b6e in ?? () #5 0x0000000000597b71 in PyTuple_New () #6 0x000000000053429e in ?? () #7 0x0000000000534ddc in ?? () #8 0x0000000000534eee in _Py_BuildValue_SizeT () #9 0x00007ffff6c0c2da in guestfs_int_py_event_callback_wrapper (g=<optimized out>, callback=0x7ffff737c588, event=128, event_handle=0, flags=<optimized out>, buf=0x7ffff6b8776c "get_backend_setting", buf_len=19, array=0x0, array_len=0) at handle.c:136 #10 0x00007ffff6b43875 in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0 #11 0x00007ffff6ad971c in guestfs_get_backend_setting () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0 #12 0x00007ffff6b49a68 in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0 #13 0x00007ffff6b4ecdc in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0 #14 0x00007ffff6b4dd12 in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0 #15 0x00007ffff6ada0d0 in guestfs_launch () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0 #16 0x00007ffff6bf8627 in guestfs_int_py_launch (self=<optimized out>, args=<optimized out>) at actions-3.c:2315 #17 0x00000000005d7a32 in _PyMethodDef_RawFastCallKeywords () #18 0x000000000054ae30 in ?? () #19 0x0000000000551f0a in _PyEval_EvalFrameDefault () #20 0x00000000005d847c in _PyFunction_FastCallKeywords () #21 0x000000000054e221 in _PyEval_EvalFrameDefault () #22 0x00000000005d847c in _PyFunction_FastCallKeywords () #23 0x000000000054dff0 in _PyEval_EvalFrameDefault () #24 0x00000000005d847c in _PyFunction_FastCallKeywords () #25 0x000000000054dff0 in _PyEval_EvalFrameDefault () #26 0x000000000054b7c2 in _PyEval_EvalCodeWithName () #27 0x000000000054dae3 in PyEval_EvalCode () #28 0x0000000000630af2 in ?? () #29 0x0000000000630ba7 in PyRun_FileExFlags () #30 0x000000000063180f in PyRun_SimpleFileExFlags () #31 0x0000000000653f7e in ?? () #32 0x00000000006542de in _Py_UnixMain () #33 0x00007ffff79df09b in __libc_start_main (main=0x4bc7b0 <main>, argc=4, argv=0x7fffffffe5c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe5b8) at ../csu/libc-start.c:308 #34 0x00000000005dfe9a in _start () I think that the problem is that in handle.c: guestfs_int_py_event_callback_wrapper we use if (PyEval_ThreadsInitialized ()) py_save = PyGILState_Ensure (); too late. Notice that we create run this part: py_array = PyList_New (array_len); for (i = 0; i < array_len; ++i) { a = PyLong_FromLongLong (array[i]); PyList_SET_ITEM (py_array, i, a); } /* XXX As with Perl we don't pass the guestfs_h handle here. */ args = Py_BuildValue ("(Kis#O)", (unsigned PY_LONG_LONG) event, event_handle, buf, buf_len, py_array); Py_INCREF (args); When the GIL is unlocked so this might corrupt internal python data structures. When I move PyGILState_Ensure to the beginning of the function the problem seems to go away.
Maybe also move: if (PyEval_ThreadsInitialized ()) PyGILState_Release (py_save); To bottom.
Python version is 3.7.3
The explanation could make sense. Do the crashes go away when you actually change the code like that? Do you have a minimal reproducer? (valgrind might help with a reproducer since it should cause the code to crash "reliably" if it's really doing something wrong).
Well I just implemented g.set_event_callback(cb, guestfs.EVENT_ALL) and I write the output to stdout (buf and array). Then I inspect any OS and I get a crash 95% of the time. If I add some logging in the callback itself (i.e. print to stdout) then the crash rate goes down 10% but still happens. In any case you should not be be allowed to call any Py_ functions without holding the GIL (unless specifically allowed) so it makes sense to fix this. I'll submit a fix on the mailing list Sam
I was able to reproduce this using a simple program which sets up a callback and inspects the OS: import guestfs def cb (ev, eh, buf, array): print ("hello cb") g = guestfs.GuestFS (python_return_dict=True) g.set_event_callback(cb, guestfs.EVENT_ALL) g.add_drive ("/var/tmp/fedora-31.img") g.launch () g.inspect_os () ./run python3 test.py core dumped (after many callbacks) with: #0 0x00007f099add1516 in new_threadstate () from /lib64/libpython3.7m.so.1.0 #1 0x00007f099acb1da0 in PyGILState_Ensure.cold () from /lib64/libpython3.7m.so.1.0 #2 0x00007f099a4d58f5 in guestfs_int_py_event_callback_wrapper () from /usr/lib64/python3.7/site-packages/libguestfsmod.cpython-37m-x86_64-linux-gnu.so #3 0x00007f0999d04407 in guestfs_int_call_callbacks_message (g=0x55f1dd8896b0, event=128, buf=0x7f0999d4ce97 "internal_autosync", buf_len=17) at events.c:117 #4 0x00007f0999ce2d05 in guestfs_internal_autosync (g=0x55f1dd8896b0) at actions-6.c:3125 #5 0x00007f0999d09a5c in shutdown_backend (g=g@entry=0x55f1dd8896b0, check_for_errors=check_for_errors@entry=0) at handle.c:467 #6 0x00007f0999d0a14f in guestfs_close (g=0x55f1dd8896b0) at handle.c:357 #7 0x00007f0999d0a20d in close_handles () at handle.c:500 #8 0x00007f099aface87 in __run_exit_handlers () from /lib64/libc.so.6 #9 0x00007f099afad040 in exit () from /lib64/libc.so.6 #10 0x00007f099af951aa in __libc_start_main () from /lib64/libc.so.6 #11 0x000055f1dbc4908e in _start () I tried your patch: https://www.redhat.com/archives/libguestfs/2019-November/msg00089.html It's still not quite right however. With your patch I see valgrind errors like: ==1978475== Invalid read of size 8 ==1978475== at 0x4BF6516: ??? (in /usr/lib64/libpython3.7m.so.1.0) ==1978475== by 0x4AD6D9F: ??? (in /usr/lib64/libpython3.7m.so.1.0) ==1978475== by 0x622FA64: guestfs_int_py_event_callback_wrapper (handle.c:125) ==1978475== by 0x630D406: guestfs_int_call_callbacks_message (events.c:117) ==1978475== by 0x62EBD04: guestfs_internal_autosync (actions-6.c:3125) ==1978475== by 0x6312A5B: shutdown_backend (handle.c:467) ==1978475== by 0x631314E: guestfs_close (handle.c:357) ==1978475== by 0x631320C: close_handles (handle.c:500) ==1978475== by 0x48BDE86: __run_exit_handlers (in /usr/lib64/libc-2.30.so) ==1978475== by 0x48BE03F: exit (in /usr/lib64/libc-2.30.so) ==1978475== by 0x48A61A9: (below main) (in /usr/lib64/libc-2.30.so) ==1978475== Address 0xa00 is not stack'd, malloc'd or (recently) free'd followed by a segfault later. I think your patch is going in the right direction, but there must be something else we haven't fixed yet. It's interesting that both crashes happened during guestfs_close. In OCaml there's a function you can call (from OCaml) to have it fully check the heap. I wonder if there's a similar thing in Python that we could insert into the callback so that we see the error at the earliest possible opportunity.
I think it's a different bug. Try calling g.close() in the end.
Ah yes it's a different bug, I think we must add: if (_Py_IsFinalizing()) return; In the callback. You didn't call g.close() - so when exiting (long after the Python interpreter has died) libguestfs writes a log message - but the callback is still configured. Then this callback calls Python code - but the interpreter no longer exists. _Py_IsFinalizing() exists just for this reason (https://docs.python.org/3/c-api/init.html#c.PyGILState_Ensure) So I think that we found and fixed two bugs :-) Can you try adding the snippet before PyGILState_Ensure. Sam
(In reply to Sam Eiderman from comment #6) > I think it's a different bug. > > Try calling g.close() in the end. Ah indeed it is. I pushed your original patch now. (In reply to Sam Eiderman from comment #7) > Ah yes it's a different bug, I think we must add: > > if (_Py_IsFinalizing()) > return; > > In the callback. Yes, this works thanks. I pushed a follow up to fix this as well.
Nice, thanks! Will it be possible to release a new version of guestfs-python in http://download.libguestfs.org/python/?
Eventually it'll appear in a stable release, but not soon. You can build the Python tarball yourself (make -C python setup-build IIRC) although personally I can never understand why people don't use the distro version.