Bug 843188 - Ruby bindings segfault in event callbacks when the Ruby GC is invoked.
Ruby bindings segfault in event callbacks when the Ruby GC is invoked.
Status: CLOSED CANTFIX
Product: Virtualization Tools
Classification: Community
Component: libguestfs (Show other bugs)
unspecified
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Richard W.M. Jones
:
Depends On:
Blocks: 843199
  Show dependency treegraph
 
Reported: 2012-07-25 16:12 EDT by Richard W.M. Jones
Modified: 2014-05-13 09:06 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 843199 (view as bug list)
Environment:
Last Closed: 2012-11-30 09:49:35 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
bz843188.rb (537 bytes, text/plain)
2012-07-25 17:34 EDT, Richard W.M. Jones
no flags Details
bz843188.rb (227 bytes, text/plain)
2012-07-25 18:19 EDT, Richard W.M. Jones
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker BGBUILD-371 Major Open segfault in boxgrinder-build 0.10.2 after update to ruby-libguestfs.i686 1:1.16.28-1.fc16 2015-09-10 02:16:58 EDT

  None (edit)
Description Richard W.M. Jones 2012-07-25 16:12:22 EDT
Description of problem:

Initial report is here, with a partial stack trace:
https://issues.jboss.org/browse/BGBUILD-371

Version-Release number of selected component (if applicable):

libguestfs 1.16.28

How reproducible:

100%

Steps to Reproduce:
1. unknown at present
Comment 1 Richard W.M. Jones 2012-07-25 17:34:14 EDT
Created attachment 600404 [details]
bz843188.rb

This small script reproduces the bug.

Save it to a file, chmod +x the script and run it.  It will
cause a segfault.

If you comment out the line that calls g.set_event_callback
then it won't segfault, proving that the problem is with
events.
Comment 2 Richard W.M. Jones 2012-07-25 18:19:24 EDT
Created attachment 600405 [details]
bz843188.rb

Smaller and faster reproducer.

The crash happens during final cleanup.  Adding g.close()
at the end prevents the crash.

I'm not sure this is the same as the crash seen in the
downstream bug report, but this one certainly needs to be fixed ...
Comment 3 Richard W.M. Jones 2012-07-25 18:40:25 EDT
The following patch fixes this bug, although I don't
precisely understand why.

diff --git a/generator/generator_ruby.ml b/generator/generator_ruby.ml
index 7ab2177..36af4f9 100644
--- a/generator/generator_ruby.ml
+++ b/generator/generator_ruby.ml
@@ -276,9 +276,10 @@ ruby_event_callback_wrapper_wrapper (VALUE argvv)
   /* Check the Ruby callback still exists.  For reasons which are not
    * fully understood, even though we registered this as a global root,
    * it is still possible for the callback to go away (fn value remains
-   * but its type changes from T_DATA to T_NONE).  (RHBZ#733297)
+   * but its type changes from T_DATA to T_NONE or T_ZOMBIE).
+   * (RHBZ#733297, RHBZ#843188)
    */
-  if (rb_type (fn) != T_NONE) {
+  if (rb_type (fn) != T_NONE && rb_type (fn) != T_ZOMBIE) {
     eventv = argv[1];
     event_handlev = argv[2];
     bufv = argv[3];
Comment 4 Richard W.M. Jones 2012-07-27 03:28:11 EDT
Although the fix in comment 3 certainly fixes a bug, it cannot
possibly be the fix for this bug since T_ZOMBIE does not exist
in the older version of Ruby used in Fedora 16.  Therefore I
believe there are at least two bugs involved here.

I am not able to reproduce the original bug, by running
Boxgrinder directly, nor using test scripts that emulate what
BG appears to do.

Therefore I'm going back to the core dump and stack trace since
that is the only reliable data available.

Unfortunately the debuginfo package wasn't available when the core
dump was taken.  So I installed a F16 32 bit VM and retrieved
a complete stack trace:

#0  0xb7709424 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7709424 in __kernel_vsyscall ()
#1  0x4987998f in __GI_raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0x4987b2d5 in __GI_abort () at abort.c:91
#3  0x49ad6dda in rb_bug (fmt=0x49b79cf6 "Segmentation fault") at error.c:213
#4  0x49b47c1b in sigsegv (sig=<optimized out>) at signal.c:634
#5  sigsegv (sig=11) at signal.c:622
#6  <signal handler called>
#7  st_foreach (table=0xffffffff, func=0x49afc6a0 <mark_entry>, arg=0)
    at st.c:485
#8  0x49afbc7a in mark_tbl (tbl=<optimized out>, lev=<optimized out>)
    at gc.c:716
#9  0x49b0de33 in mark_load_arg (ptr=0xbfa71ccc) at marshal.c:841
#10 0x49afcbbb in gc_mark_children (ptr=3067931800, lev=1) at gc.c:1025
#11 0x49afc761 in mark_locations_array (x=0xbfa71ce4, n=<optimized out>)
    at gc.c:684
#12 0x49afd33f in garbage_collect () at gc.c:1476
#13 0x49afdd4d in ruby_xmalloc (size=<optimized out>) at gc.c:147
#14 0x49b4c8f9 in rb_str_buf_new (capa=128) at string.c:234
#15 0x49b489ac in rb_str_format (argc=7, argv=0xb44d1d4, fmt=3075887580)
    at sprintf.c:298
#16 0x49b4b6e2 in rb_str_format_m (str=3075836620, arg=3067886260)
    at string.c:469
#17 0x49ae4f4d in call_cfunc (argv=0xbfa6bea0, argc=1, len=1, recv=3075836620, 
    func=0x49b4b690 <rb_str_format_m>) at eval.c:5784
#18 rb_call0 (klass=3077477380, recv=3075836620, id=37, oid=37, argc=1, 
    argv=0xbfa6bea0, body=0xb76e9224, flags=0) at eval.c:5928
#19 0x49ae51e2 in rb_call (klass=3077477380, recv=3075836620, mid=37, argc=1, 
    argv=0xbfa6bea0, scope=0, self=<optimized out>) at eval.c:6176
#20 0x49aea9c7 in rb_eval (self=3069007920, n=0xb7564d90) at eval.c:3506
#21 0x49ae4511 in rb_call0 (klass=3075836720, recv=3069007920, id=5521, 
    oid=5521, argc=<optimized out>, argv=<optimized out>, body=0xb7564f5c, 
    flags=0) at eval.c:6079
#22 0x49ae51e2 in rb_call (klass=3075836720, recv=3069007920, mid=5521, 
    argc=4, argv=0xbfa6c620, scope=0, self=<optimized out>) at eval.c:6176
#23 0x49aea9c7 in rb_eval (self=3069007880, n=0xb7565330) at eval.c:3506
#24 0x49ae4511 in rb_call0 (klass=3075862520, recv=3069007880, id=22529, 
    oid=22529, argc=<optimized out>, argv=<optimized out>, body=0xb75652e0, 
    flags=2) at eval.c:6079
#25 0x49ae51e2 in rb_call (klass=3075862520, recv=3069007880, mid=22529, 
    argc=4, argv=0xbfa6cda0, scope=1, self=<optimized out>) at eval.c:6176
#26 0x49aeab24 in rb_eval (self=3069007880, n=0xb7566dfc) at eval.c:3521
#27 0x49aea911 in rb_eval (self=3069007880, n=0xb7566dd4) at eval.c:3501
#28 0x49ae4511 in rb_call0 (klass=3075862520, recv=3069007880, id=5561, 
    oid=5561, argc=<optimized out>, argv=<optimized out>, body=0xb75673b0, 
    flags=0) at eval.c:6079
#29 0x49ae51e2 in rb_call (klass=3075862520, recv=3069007880, mid=5561, 
    argc=3, argv=0xbfa6d990, scope=1, self=<optimized out>) at eval.c:6176
#30 0x49aeab24 in rb_eval (self=3069007880, n=0xb756beb0) at eval.c:3521
#31 0x49aeedfd in block_pass (self=3069007880, node=0xb756bec4) at eval.c:9193
#32 0x49aeb2b7 in rb_eval (self=3069007880, n=0xb756bec4) at eval.c:3222
#33 0x49ae4511 in rb_call0 (klass=3075862520, recv=3069007880, id=16065, 
    oid=16065, argc=<optimized out>, argv=<optimized out>, body=0xb756bf78, 
    flags=0) at eval.c:6079
#34 0x49ae51e2 in rb_call (klass=3075862520, recv=3069007880, mid=16065, 
    argc=1, argv=0xbfa6eaf4, scope=1, self=<optimized out>) at eval.c:6176
#35 0x49ae53ec in rb_f_send (argc=1, argv=0xbfa6eaf4, recv=3069007880)
    at eval.c:6224
#36 0x49ae4f7e in call_cfunc (argv=0xbfa6eaf0, argc=2, len=-1, 
    recv=3069007880, func=0x49ae5320 <rb_f_send>) at eval.c:5778
#37 rb_call0 (klass=3077487780, recv=3069007880, id=4225, oid=4225, argc=2, 
    argv=0xbfa6eaf0, body=0xb76e99e0, flags=0) at eval.c:5928
#38 0x49ae51e2 in rb_call (klass=3077487780, recv=3069007880, mid=4225, 
    argc=2, argv=0xbfa6eaf0, scope=0, self=<optimized out>) at eval.c:6176
#39 0x49aea9c7 in rb_eval (self=3069008180, n=0xb756a7b8) at eval.c:3506
#40 0x49ae4511 in rb_call0 (klass=3075611220, recv=3069008180, id=4057, 
    oid=4057, argc=<optimized out>, argv=<optimized out>, body=0xb756a754, 
    flags=0) at eval.c:6079
#41 0x49ae51e2 in rb_call (klass=3075611220, recv=3069008180, mid=4057, 
    argc=2, argv=0xbfa6f2b0, scope=1, self=<optimized out>) at eval.c:6176
#42 0x49ae559c in rb_funcall2 (recv=3069008180, mid=4057, argc=2, 
    argv=0xbfa6f2b0) at eval.c:6312
#43 0x49ae865b in method_missing (obj=3069008180, id=<optimized out>, 
    argc=<optimized out>, argv=0xbfa6f3c0, call_status=0) at eval.c:5758
#44 0x49ae5248 in rb_call (klass=3075611220, recv=3069008180, mid=16065, 
    argc=1, argv=0xbfa6f3c0, scope=0, self=<optimized out>) at eval.c:6155
#45 0x49aea9c7 in rb_eval (self=3068469600, n=0xb73e2814) at eval.c:3506
#46 0x49ae2fcc in rb_yield_0 (val=3067887080, self=3068469600, klass=0, 
    flags=2, avalue=2) at eval.c:5095
#47 0x49ae353b in proc_invoke (proc=3068466840, args=3067887080, self=6, 
    klass=0) at eval.c:8895
#48 0x49ae4fa7 in call_cfunc (argv=0xbfa6ff90, argc=4, len=-2, 
    recv=3068466840, func=0x49af15a0 <rb_proc_call>) at eval.c:5775
#49 rb_call0 (klass=3077420980, recv=3068466840, id=5521, oid=5521, argc=4, 
    argv=0xbfa6ff90, body=0xb76db700, flags=0) at eval.c:5928
#50 0x49ae51e2 in rb_call (klass=3077420980, recv=3068466840, mid=5521, 
    argc=4, argv=0xbfa6ff90, scope=1, self=<optimized out>) at eval.c:6176
#51 0x49ae598c in vafuncall (recv=3068466840, mid=5521, n=4, ar=0xbfa7000c)
    at eval.c:6253
#52 0x49ae5b48 in rb_funcall (recv=3068466840, mid=5521, n=4) at eval.c:6270
#53 0xb720167b in ruby_event_callback_wrapper_wrapper (argvv=3215393420)
    at _guestfs.c:272
#54 0x49adf288 in rb_rescue2 (
    b_proc=0xb7201600 <ruby_event_callback_wrapper_wrapper>, data1=3215393420, 
    r_proc=0xb72015b0 <ruby_event_callback_handle_exception>, data2=4)
    at eval.c:5491
#55 0x49adf488 in rb_rescue (
    b_proc=0xb7201600 <ruby_event_callback_wrapper_wrapper>, data1=3215393420, 
    r_proc=0xb72015b0 <ruby_event_callback_handle_exception>, data2=4)
    at eval.c:5527
#56 0xb72015a1 in ruby_event_callback_wrapper (g=0xa52fc68, data=0xa531198, 
    event=16, event_handle=0, flags=0, 
    buf=0xbfa70350 "guestfsd: receive_file: reading length word\r\nguestfsd: receive_file: got chunk: cancel = 0x0, len = 8192, buf = 0xdf979e8\r\nguestfsd: receive_file: reading length word\r\nx0, len = 8192, buf = 0xdf979e8\r"..., 
    buf_len=168, array=0x0, array_len=0) at _guestfs.c:249
#57 0xb7190afe in guestfs___call_callbacks_message (g=0xa52fc68, event=16, 
    buf=0xbfa70350 "guestfsd: receive_file: reading length word\r\nguestfsd: receive_file: got chunk: cancel = 0x0, len = 8192, buf = 0xdf979e8\r\nguestfsd: receive_file: reading length word\r\nx0, len = 8192, buf = 0xdf979e8\r"..., 
    buf_len=168) at events.c:109
#58 0xb719c7e0 in read_log_message_or_eof (g=0xa52fc68, fd=<optimized out>, 
    error_if_eof=0) at proto.c:286
#59 0xb719d025 in guestfs___send_to_daemon (g=0xa52fc68, v_buf=0xb45d820, 
    n=8204) at proto.c:474
#60 0xb719d361 in send_file_chunk (g=0xa52fc68, cancel=<optimized out>, 
    buf=0xbfa7285f "ffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffff"..., buflen=8192)
    at proto.c:1008
#61 0xb719e315 in send_file_data (len=8192, 
    buf=0xbfa7285f "ffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffff"..., g=0xa52fc68)
    at proto.c:947
#62 guestfs___send_file (g=0xa52fc68, 
    filename=0xb3a0118 "build/appliances/i686/centos/6/quarri-poq/1.0/centos-plugin/tmp/Packages.dump") at proto.c:907
#63 0xb714bdcd in guestfs_upload (g=0xa52fc68, 
    filename=0xb3a0118 "build/appliances/i686/centos/6/quarri-poq/1.0/centos-plugin/tmp/Packages.dump", remotefilename=0xa2a9340 "/tmp/Packages.dump")
    at actions.c:10879
#64 0xb72092d0 in ruby_guestfs_upload (gv=3068468020, filenamev=3067931000, 
    remotefilenamev=3067930780) at _guestfs.c:7118
#65 0x49ae4f31 in call_cfunc (argv=0xbfa74ca0, argc=2, len=2, recv=3068468020, 
    func=0xb7209270 <ruby_guestfs_upload>) at eval.c:5787
#66 rb_call0 (klass=3074266180, recv=3068468020, id=45073, oid=45073, argc=2, 
    argv=0xbfa74ca0, body=0xb73d793c, flags=0) at eval.c:5928

[more Ruby call frames elided]
Comment 5 Richard W.M. Jones 2012-08-03 08:34:35 EDT
The last thing that happens before it crashes is
it constructs a formatted string:

#13 0x49afdd4d in ruby_xmalloc (size=<optimized out>) at gc.c:147
#14 0x49b4c8f9 in rb_str_buf_new (capa=128) at string.c:234
#15 0x49b489ac in rb_str_format (argc=7, argv=0xb44d1d4, fmt=3075887580)
    at sprintf.c:298
#16 0x49b4b6e2 in rb_str_format_m (str=3075836620, arg=3067886260)
    at string.c:469

At frame #16, str is:

(gdb) print *(struct RString *)str
$4 = {basic = {flags = 8263, klass = 3077477380}, len = 26, 
  ptr = 0x9cad898 "%s, [%s#%d] %5s -- %s: %s\n", aux = {capa = -1219079716, 
    shared = 3075887580}}

This format string occurs in the ruby core Logger class
in a subclass called Formatter which constructs the string
to be logged.  Since LogHelper in Boxgrinder calls Logger,
I conclude that it's executing one of the log.* methods in
the event callback (default_callback).
Comment 6 Richard W.M. Jones 2012-08-03 09:30:29 EDT
I would say this is a fault in the Ruby garbage collector.

Some observations:

(1) Ruby scans the stack using an imprecise pointer detector
(is_pointer_to_heap in gc.c)

(2) At the point where the crash occurs, it is scanning an
8192 byte buffer that happens to be on the stack somewhere
in the middle of libguestfs, and happens to contains random
data related to the upload operation.  Ruby really should not
be scanning this buffer.

(3) The actual part of this buffer that is being scanned
(0xbfa71ce4) looks like a valid stack pointer.  It isn't -- it's
just some random data that happens to look like that.

(4) If we interpret this random data as a Ruby VALUE, it becomes
clearer what's going on:

(gdb) print/x *0xbfa71ce4
$15 = 0xb6dcec98
(gdb) print *(RVALUE *)0xb6dcec98
$17 = {
  as = {
    free = {
      flags = 98,   # to Ruby this looks like T_DATA
[...]
    data = {
      basic = {
        flags = 98, 
        klass = 3077482900
      }, 
      dmark = 0x49b0de10 <mark_load_arg>, 
      dfree = 0, 
      data = 0xbfa71ccc
    }, 
[...]

(5) As you can see from the stack trace, Ruby follows the
completely bogus "data" pointer 0xbfa71ccc:

#9  0x49b0de33 in mark_load_arg (ptr=0xbfa71ccc) at marshal.c:841

and eventually this causes a crash.

(6) WTF?
Comment 7 Richard W.M. Jones 2012-08-03 12:47:04 EDT
I get it now.  This issue is the same as
http://bugs.ruby-lang.org/issues/4339
and should be fixed by the same patch.
Comment 8 Richard W.M. Jones 2012-11-20 09:59:52 EST
(In reply to comment #7)
> I get it now.  This issue is the same as
> http://bugs.ruby-lang.org/issues/4339
> and should be fixed by the same patch.

Ignore comment 7.  This patch does not fix the problem.

The solution is either:

(a) Update to Ruby 1.9.  It has a different GC impl.

(b) Don't keep buffers of data on the stack and cross
your fingers that nothing else on the stack looks like
a root.
Comment 9 Richard W.M. Jones 2012-11-30 09:49:35 EST
This is a fundamental bug in Ruby 1.8.  On this basis, I'm
closing as CANTFIX.
Comment 10 Lukas Zapletal 2014-04-29 05:12:34 EDT
Hey Richard,

when I was trying to reproduce this one in RHEL 6.5 the reproducer (the latter script) sometimes runs nice, but sometimes it exits with weird error:

-bash-4.1# ruby test.rb
-bash-4.1# ruby test.rb
-bash-4.1# ruby test.rb
test.rb:4:in `new': expecting 0 or 1 arguments (ArgumentError)
        from test.rb:4

Is this behavior expected today?
Comment 11 Richard W.M. Jones 2014-04-29 05:29:16 EDT
(In reply to Lukas Zapletal from comment #10)
> Hey Richard,
> 
> when I was trying to reproduce this one in RHEL 6.5 the reproducer (the
> latter script) sometimes runs nice, but sometimes it exits with weird error:
> 
> -bash-4.1# ruby test.rb
> -bash-4.1# ruby test.rb
> -bash-4.1# ruby test.rb
> test.rb:4:in `new': expecting 0 or 1 arguments (ArgumentError)
>         from test.rb:4
> 
> Is this behavior expected today?

That's probably this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1072062

which should be fixed in RHEL 6.6.  I don't believe we have
a package for RHEL 6.6 that fixes it right now, but you could
ask ptoscano about it.
Comment 12 Vít Ondruch 2014-05-13 03:54:11 EDT
Rich, have you noticed this change?

https://github.com/ruby/ruby/commit/fa4c17e71283e72100c12b8824aafe176372241d

It was not exported, but it was used: https://github.com/ruby/ruby/blob/ruby_1_8_7/gc.c#L1124

Not sure if that might be relevant.
Comment 13 Lukas Zapletal 2014-05-13 04:00:37 EDT
Info for googlers: To reproduce this bug on a host with large free memory, add this line to the script:

GC.stress = true

And then you are able to reproduce everytime:

[root@hp-ml370g4-01 ~]# ruby test.rb
libguestfs: exception in callback!
Comment 14 Vít Ondruch 2014-05-13 07:33:05 EDT
Also this post might be helpful:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/60741

It suggest to use RB_GC_GUARD preferably to volatile.
Comment 15 Richard W.M. Jones 2014-05-13 09:06:49 EDT
(In reply to Vít Ondruch from comment #14)
> Also this post might be helpful:
> 
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/60741
> 
> It suggest to use RB_GC_GUARD preferably to volatile.

Thanks, I have filed a bug to do this:
https://bugzilla.redhat.com/show_bug.cgi?id=1097272

Note You need to log in before you can comment on or make changes to this bug.