Bug 1773520 - Segfault in python bindings for guestfs_int_py_event_callback_wrapper
Summary: Segfault in python bindings for guestfs_int_py_event_callback_wrapper
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libguestfs
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-18 10:49 UTC by Sam Eiderman
Modified: 2019-11-19 08:15 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-11-18 15:35:37 UTC
Embargoed:


Attachments (Terms of Use)

Description Sam Eiderman 2019-11-18 10:49:14 UTC
Description of problem:

Segmentation fault when using libguestfs

Version-Release number of selected component (if applicable):
1.40.2 - libguestfs-python
1.40.2 - libguestfs

How reproducible:
Use guestfs_int_py_event_callback_wrapper to log events.


Steps to Reproduce:
1. Use guestfs_int_py_event_callback_wrapper

Actual results:
Segmentation fault


Expected results:
No Segmentation fault


Additional info:

Hi,

I believe that I am one of the few that are using g.set_event_callback(cb, guestfs.EVENT_ALL) in Python to log libguestfs as a part of my Python solution so this is why this bug was not revealed by now.

I have a race condition where I get segmentation faults from time to time (depends on how I change my code) and I believe the problem is the implementation of guestfs_int_py_event_callback_wrapper in handle.c for Python.

Here is the stack trace that I got:

Program received signal SIGSEGV, Segmentation fault.
0x00000000005bd8fc in ?? ()
(gdb) bt
#0  0x00000000005bd8fc in ?? ()
#1  0x00000000005b32bb in PyDict_Clear ()
#2  0x00000000005b33a9 in ?? ()
#3  0x00000000005265ce in ?? ()
#4  0x0000000000527b6e in ?? ()
#5  0x0000000000597b71 in PyTuple_New ()
#6  0x000000000053429e in ?? ()
#7  0x0000000000534ddc in ?? ()
#8  0x0000000000534eee in _Py_BuildValue_SizeT ()
#9  0x00007ffff6c0c2da in guestfs_int_py_event_callback_wrapper (g=<optimized out>, callback=0x7ffff737c588, event=128, event_handle=0, flags=<optimized out>, 
    buf=0x7ffff6b8776c "get_backend_setting", buf_len=19, array=0x0, array_len=0) at handle.c:136
#10 0x00007ffff6b43875 in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0
#11 0x00007ffff6ad971c in guestfs_get_backend_setting () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0
#12 0x00007ffff6b49a68 in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0
#13 0x00007ffff6b4ecdc in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0
#14 0x00007ffff6b4dd12 in ?? () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0
#15 0x00007ffff6ada0d0 in guestfs_launch () from /usr/lib/x86_64-linux-gnu/libguestfs.so.0
#16 0x00007ffff6bf8627 in guestfs_int_py_launch (self=<optimized out>, args=<optimized out>) at actions-3.c:2315
#17 0x00000000005d7a32 in _PyMethodDef_RawFastCallKeywords ()
#18 0x000000000054ae30 in ?? ()
#19 0x0000000000551f0a in _PyEval_EvalFrameDefault ()
#20 0x00000000005d847c in _PyFunction_FastCallKeywords ()
#21 0x000000000054e221 in _PyEval_EvalFrameDefault ()
#22 0x00000000005d847c in _PyFunction_FastCallKeywords ()
#23 0x000000000054dff0 in _PyEval_EvalFrameDefault ()
#24 0x00000000005d847c in _PyFunction_FastCallKeywords ()
#25 0x000000000054dff0 in _PyEval_EvalFrameDefault ()
#26 0x000000000054b7c2 in _PyEval_EvalCodeWithName ()
#27 0x000000000054dae3 in PyEval_EvalCode ()
#28 0x0000000000630af2 in ?? ()
#29 0x0000000000630ba7 in PyRun_FileExFlags ()
#30 0x000000000063180f in PyRun_SimpleFileExFlags ()
#31 0x0000000000653f7e in ?? ()
#32 0x00000000006542de in _Py_UnixMain ()
#33 0x00007ffff79df09b in __libc_start_main (main=0x4bc7b0 <main>, argc=4, argv=0x7fffffffe5c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe5b8)
    at ../csu/libc-start.c:308
#34 0x00000000005dfe9a in _start ()

I think that the problem is that in handle.c: guestfs_int_py_event_callback_wrapper we use

  if (PyEval_ThreadsInitialized ())
    py_save = PyGILState_Ensure ();

too late.

Notice that we create run this part:

  py_array = PyList_New (array_len);
  for (i = 0; i < array_len; ++i) {
    a = PyLong_FromLongLong (array[i]);
    PyList_SET_ITEM (py_array, i, a);
  }

  /* XXX As with Perl we don't pass the guestfs_h handle here. */
  args = Py_BuildValue ("(Kis#O)",
                        (unsigned PY_LONG_LONG) event, event_handle,
                        buf, buf_len, py_array);
  Py_INCREF (args);

When the GIL is unlocked so this might corrupt internal python data structures.

When I move PyGILState_Ensure to the beginning of the function the problem seems to go away.

Comment 1 Sam Eiderman 2019-11-18 11:00:57 UTC
Maybe also move:

  if (PyEval_ThreadsInitialized ())
    PyGILState_Release (py_save);

To bottom.

Comment 2 Sam Eiderman 2019-11-18 11:05:50 UTC
Python version is 3.7.3

Comment 3 Richard W.M. Jones 2019-11-18 11:56:12 UTC
The explanation could make sense.  Do the crashes go away when you actually
change the code like that?  Do you have a minimal reproducer?  (valgrind might
help with a reproducer since it should cause the code to crash "reliably" if
it's really doing something wrong).

Comment 4 Sam Eiderman 2019-11-18 12:28:45 UTC
Well I just implemented g.set_event_callback(cb, guestfs.EVENT_ALL) and I write the output to stdout (buf and array).
Then I inspect any OS and I get a crash 95% of the time.
If I add some logging in the callback itself (i.e. print to stdout) then the crash rate goes down 10% but still happens.

In any case you should not be be allowed to call any Py_ functions without holding the GIL (unless specifically allowed) so it makes sense to fix this.

I'll submit a fix on the mailing list

Sam

Comment 5 Richard W.M. Jones 2019-11-18 13:23:46 UTC
I was able to reproduce this using a simple program which sets up a
callback and inspects the OS:

import guestfs

def cb (ev, eh, buf, array):
    print ("hello cb")

g = guestfs.GuestFS (python_return_dict=True)
g.set_event_callback(cb, guestfs.EVENT_ALL)
g.add_drive ("/var/tmp/fedora-31.img")
g.launch ()
g.inspect_os ()

./run python3 test.py

core dumped (after many callbacks) with:

#0  0x00007f099add1516 in new_threadstate () from /lib64/libpython3.7m.so.1.0
#1  0x00007f099acb1da0 in PyGILState_Ensure.cold () from /lib64/libpython3.7m.so.1.0
#2  0x00007f099a4d58f5 in guestfs_int_py_event_callback_wrapper () from /usr/lib64/python3.7/site-packages/libguestfsmod.cpython-37m-x86_64-linux-gnu.so
#3  0x00007f0999d04407 in guestfs_int_call_callbacks_message (g=0x55f1dd8896b0, event=128, buf=0x7f0999d4ce97 "internal_autosync", buf_len=17) at events.c:117
#4  0x00007f0999ce2d05 in guestfs_internal_autosync (g=0x55f1dd8896b0) at actions-6.c:3125
#5  0x00007f0999d09a5c in shutdown_backend (g=g@entry=0x55f1dd8896b0, check_for_errors=check_for_errors@entry=0) at handle.c:467
#6  0x00007f0999d0a14f in guestfs_close (g=0x55f1dd8896b0) at handle.c:357
#7  0x00007f0999d0a20d in close_handles () at handle.c:500
#8  0x00007f099aface87 in __run_exit_handlers () from /lib64/libc.so.6
#9  0x00007f099afad040 in exit () from /lib64/libc.so.6
#10 0x00007f099af951aa in __libc_start_main () from /lib64/libc.so.6
#11 0x000055f1dbc4908e in _start ()

I tried your patch:
https://www.redhat.com/archives/libguestfs/2019-November/msg00089.html

It's still not quite right however.  With your patch I see valgrind errors like:

==1978475== Invalid read of size 8
==1978475==    at 0x4BF6516: ??? (in /usr/lib64/libpython3.7m.so.1.0)
==1978475==    by 0x4AD6D9F: ??? (in /usr/lib64/libpython3.7m.so.1.0)
==1978475==    by 0x622FA64: guestfs_int_py_event_callback_wrapper (handle.c:125)
==1978475==    by 0x630D406: guestfs_int_call_callbacks_message (events.c:117)
==1978475==    by 0x62EBD04: guestfs_internal_autosync (actions-6.c:3125)
==1978475==    by 0x6312A5B: shutdown_backend (handle.c:467)
==1978475==    by 0x631314E: guestfs_close (handle.c:357)
==1978475==    by 0x631320C: close_handles (handle.c:500)
==1978475==    by 0x48BDE86: __run_exit_handlers (in /usr/lib64/libc-2.30.so)
==1978475==    by 0x48BE03F: exit (in /usr/lib64/libc-2.30.so)
==1978475==    by 0x48A61A9: (below main) (in /usr/lib64/libc-2.30.so)
==1978475==  Address 0xa00 is not stack'd, malloc'd or (recently) free'd

followed by a segfault later.

I think your patch is going in the right direction, but there must be something
else we haven't fixed yet.

It's interesting that both crashes happened during guestfs_close.

In OCaml there's a function you can call (from OCaml) to have it fully check
the heap.  I wonder if there's a similar thing in Python that we could insert
into the callback so that we see the error at the earliest possible opportunity.

Comment 6 Sam Eiderman 2019-11-18 15:08:13 UTC
I think it's a different bug.

Try calling g.close() in the end.

Comment 7 Sam Eiderman 2019-11-18 15:15:52 UTC
Ah yes it's a different bug, I think we must add:

    if (_Py_IsFinalizing())
        return;

In the callback.

You didn't call g.close() - so when exiting (long after the Python interpreter has died) libguestfs writes a log message - but the callback is still configured.
Then this callback calls Python code - but the interpreter no longer exists.

_Py_IsFinalizing() exists just for this reason (https://docs.python.org/3/c-api/init.html#c.PyGILState_Ensure)

So I think that we found and fixed two bugs :-)

Can you try adding the snippet before PyGILState_Ensure.

Sam

Comment 8 Richard W.M. Jones 2019-11-18 15:35:37 UTC
(In reply to Sam Eiderman from comment #6)
> I think it's a different bug.
> 
> Try calling g.close() in the end.

Ah indeed it is.

I pushed your original patch now.

(In reply to Sam Eiderman from comment #7)
> Ah yes it's a different bug, I think we must add:
> 
>     if (_Py_IsFinalizing())
>         return;
> 
> In the callback.

Yes, this works thanks.  I pushed a follow up to fix this as well.

Comment 9 Sam Eiderman 2019-11-18 15:37:58 UTC
Nice, thanks!

Will it be possible to release a new version of guestfs-python in http://download.libguestfs.org/python/?

Comment 10 Richard W.M. Jones 2019-11-19 08:15:00 UTC
Eventually it'll appear in a stable release, but not soon.  You can build the Python
tarball yourself (make -C python setup-build IIRC) although personally I can never understand why
people don't use the distro version.


Note You need to log in before you can comment on or make changes to this bug.