Description of problem: Initial report is here, with a partial stack trace: https://issues.jboss.org/browse/BGBUILD-371 Version-Release number of selected component (if applicable): libguestfs 1.16.28 How reproducible: 100% Steps to Reproduce: 1. unknown at present
Created attachment 600404 [details] bz843188.rb This small script reproduces the bug. Save it to a file, chmod +x the script and run it. It will cause a segfault. If you comment out the line that calls g.set_event_callback then it won't segfault, proving that the problem is with events.
Created attachment 600405 [details] bz843188.rb Smaller and faster reproducer. The crash happens during final cleanup. Adding g.close() at the end prevents the crash. I'm not sure this is the same as the crash seen in the downstream bug report, but this one certainly needs to be fixed ...
The following patch fixes this bug, although I don't precisely understand why. diff --git a/generator/generator_ruby.ml b/generator/generator_ruby.ml index 7ab2177..36af4f9 100644 --- a/generator/generator_ruby.ml +++ b/generator/generator_ruby.ml @@ -276,9 +276,10 @@ ruby_event_callback_wrapper_wrapper (VALUE argvv) /* Check the Ruby callback still exists. For reasons which are not * fully understood, even though we registered this as a global root, * it is still possible for the callback to go away (fn value remains - * but its type changes from T_DATA to T_NONE). (RHBZ#733297) + * but its type changes from T_DATA to T_NONE or T_ZOMBIE). + * (RHBZ#733297, RHBZ#843188) */ - if (rb_type (fn) != T_NONE) { + if (rb_type (fn) != T_NONE && rb_type (fn) != T_ZOMBIE) { eventv = argv[1]; event_handlev = argv[2]; bufv = argv[3];
Although the fix in comment 3 certainly fixes a bug, it cannot possibly be the fix for this bug since T_ZOMBIE does not exist in the older version of Ruby used in Fedora 16. Therefore I believe there are at least two bugs involved here. I am not able to reproduce the original bug, by running Boxgrinder directly, nor using test scripts that emulate what BG appears to do. Therefore I'm going back to the core dump and stack trace since that is the only reliable data available. Unfortunately the debuginfo package wasn't available when the core dump was taken. So I installed a F16 32 bit VM and retrieved a complete stack trace: #0 0xb7709424 in __kernel_vsyscall () (gdb) bt #0 0xb7709424 in __kernel_vsyscall () #1 0x4987998f in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #2 0x4987b2d5 in __GI_abort () at abort.c:91 #3 0x49ad6dda in rb_bug (fmt=0x49b79cf6 "Segmentation fault") at error.c:213 #4 0x49b47c1b in sigsegv (sig=<optimized out>) at signal.c:634 #5 sigsegv (sig=11) at signal.c:622 #6 <signal handler called> #7 st_foreach (table=0xffffffff, func=0x49afc6a0 <mark_entry>, arg=0) at st.c:485 #8 0x49afbc7a in mark_tbl (tbl=<optimized out>, lev=<optimized out>) at gc.c:716 #9 0x49b0de33 in mark_load_arg (ptr=0xbfa71ccc) at marshal.c:841 #10 0x49afcbbb in gc_mark_children (ptr=3067931800, lev=1) at gc.c:1025 #11 0x49afc761 in mark_locations_array (x=0xbfa71ce4, n=<optimized out>) at gc.c:684 #12 0x49afd33f in garbage_collect () at gc.c:1476 #13 0x49afdd4d in ruby_xmalloc (size=<optimized out>) at gc.c:147 #14 0x49b4c8f9 in rb_str_buf_new (capa=128) at string.c:234 #15 0x49b489ac in rb_str_format (argc=7, argv=0xb44d1d4, fmt=3075887580) at sprintf.c:298 #16 0x49b4b6e2 in rb_str_format_m (str=3075836620, arg=3067886260) at string.c:469 #17 0x49ae4f4d in call_cfunc (argv=0xbfa6bea0, argc=1, len=1, recv=3075836620, func=0x49b4b690 <rb_str_format_m>) at eval.c:5784 #18 rb_call0 (klass=3077477380, recv=3075836620, id=37, oid=37, argc=1, argv=0xbfa6bea0, body=0xb76e9224, flags=0) at eval.c:5928 #19 0x49ae51e2 in rb_call (klass=3077477380, recv=3075836620, mid=37, argc=1, argv=0xbfa6bea0, scope=0, self=<optimized out>) at eval.c:6176 #20 0x49aea9c7 in rb_eval (self=3069007920, n=0xb7564d90) at eval.c:3506 #21 0x49ae4511 in rb_call0 (klass=3075836720, recv=3069007920, id=5521, oid=5521, argc=<optimized out>, argv=<optimized out>, body=0xb7564f5c, flags=0) at eval.c:6079 #22 0x49ae51e2 in rb_call (klass=3075836720, recv=3069007920, mid=5521, argc=4, argv=0xbfa6c620, scope=0, self=<optimized out>) at eval.c:6176 #23 0x49aea9c7 in rb_eval (self=3069007880, n=0xb7565330) at eval.c:3506 #24 0x49ae4511 in rb_call0 (klass=3075862520, recv=3069007880, id=22529, oid=22529, argc=<optimized out>, argv=<optimized out>, body=0xb75652e0, flags=2) at eval.c:6079 #25 0x49ae51e2 in rb_call (klass=3075862520, recv=3069007880, mid=22529, argc=4, argv=0xbfa6cda0, scope=1, self=<optimized out>) at eval.c:6176 #26 0x49aeab24 in rb_eval (self=3069007880, n=0xb7566dfc) at eval.c:3521 #27 0x49aea911 in rb_eval (self=3069007880, n=0xb7566dd4) at eval.c:3501 #28 0x49ae4511 in rb_call0 (klass=3075862520, recv=3069007880, id=5561, oid=5561, argc=<optimized out>, argv=<optimized out>, body=0xb75673b0, flags=0) at eval.c:6079 #29 0x49ae51e2 in rb_call (klass=3075862520, recv=3069007880, mid=5561, argc=3, argv=0xbfa6d990, scope=1, self=<optimized out>) at eval.c:6176 #30 0x49aeab24 in rb_eval (self=3069007880, n=0xb756beb0) at eval.c:3521 #31 0x49aeedfd in block_pass (self=3069007880, node=0xb756bec4) at eval.c:9193 #32 0x49aeb2b7 in rb_eval (self=3069007880, n=0xb756bec4) at eval.c:3222 #33 0x49ae4511 in rb_call0 (klass=3075862520, recv=3069007880, id=16065, oid=16065, argc=<optimized out>, argv=<optimized out>, body=0xb756bf78, flags=0) at eval.c:6079 #34 0x49ae51e2 in rb_call (klass=3075862520, recv=3069007880, mid=16065, argc=1, argv=0xbfa6eaf4, scope=1, self=<optimized out>) at eval.c:6176 #35 0x49ae53ec in rb_f_send (argc=1, argv=0xbfa6eaf4, recv=3069007880) at eval.c:6224 #36 0x49ae4f7e in call_cfunc (argv=0xbfa6eaf0, argc=2, len=-1, recv=3069007880, func=0x49ae5320 <rb_f_send>) at eval.c:5778 #37 rb_call0 (klass=3077487780, recv=3069007880, id=4225, oid=4225, argc=2, argv=0xbfa6eaf0, body=0xb76e99e0, flags=0) at eval.c:5928 #38 0x49ae51e2 in rb_call (klass=3077487780, recv=3069007880, mid=4225, argc=2, argv=0xbfa6eaf0, scope=0, self=<optimized out>) at eval.c:6176 #39 0x49aea9c7 in rb_eval (self=3069008180, n=0xb756a7b8) at eval.c:3506 #40 0x49ae4511 in rb_call0 (klass=3075611220, recv=3069008180, id=4057, oid=4057, argc=<optimized out>, argv=<optimized out>, body=0xb756a754, flags=0) at eval.c:6079 #41 0x49ae51e2 in rb_call (klass=3075611220, recv=3069008180, mid=4057, argc=2, argv=0xbfa6f2b0, scope=1, self=<optimized out>) at eval.c:6176 #42 0x49ae559c in rb_funcall2 (recv=3069008180, mid=4057, argc=2, argv=0xbfa6f2b0) at eval.c:6312 #43 0x49ae865b in method_missing (obj=3069008180, id=<optimized out>, argc=<optimized out>, argv=0xbfa6f3c0, call_status=0) at eval.c:5758 #44 0x49ae5248 in rb_call (klass=3075611220, recv=3069008180, mid=16065, argc=1, argv=0xbfa6f3c0, scope=0, self=<optimized out>) at eval.c:6155 #45 0x49aea9c7 in rb_eval (self=3068469600, n=0xb73e2814) at eval.c:3506 #46 0x49ae2fcc in rb_yield_0 (val=3067887080, self=3068469600, klass=0, flags=2, avalue=2) at eval.c:5095 #47 0x49ae353b in proc_invoke (proc=3068466840, args=3067887080, self=6, klass=0) at eval.c:8895 #48 0x49ae4fa7 in call_cfunc (argv=0xbfa6ff90, argc=4, len=-2, recv=3068466840, func=0x49af15a0 <rb_proc_call>) at eval.c:5775 #49 rb_call0 (klass=3077420980, recv=3068466840, id=5521, oid=5521, argc=4, argv=0xbfa6ff90, body=0xb76db700, flags=0) at eval.c:5928 #50 0x49ae51e2 in rb_call (klass=3077420980, recv=3068466840, mid=5521, argc=4, argv=0xbfa6ff90, scope=1, self=<optimized out>) at eval.c:6176 #51 0x49ae598c in vafuncall (recv=3068466840, mid=5521, n=4, ar=0xbfa7000c) at eval.c:6253 #52 0x49ae5b48 in rb_funcall (recv=3068466840, mid=5521, n=4) at eval.c:6270 #53 0xb720167b in ruby_event_callback_wrapper_wrapper (argvv=3215393420) at _guestfs.c:272 #54 0x49adf288 in rb_rescue2 ( b_proc=0xb7201600 <ruby_event_callback_wrapper_wrapper>, data1=3215393420, r_proc=0xb72015b0 <ruby_event_callback_handle_exception>, data2=4) at eval.c:5491 #55 0x49adf488 in rb_rescue ( b_proc=0xb7201600 <ruby_event_callback_wrapper_wrapper>, data1=3215393420, r_proc=0xb72015b0 <ruby_event_callback_handle_exception>, data2=4) at eval.c:5527 #56 0xb72015a1 in ruby_event_callback_wrapper (g=0xa52fc68, data=0xa531198, event=16, event_handle=0, flags=0, buf=0xbfa70350 "guestfsd: receive_file: reading length word\r\nguestfsd: receive_file: got chunk: cancel = 0x0, len = 8192, buf = 0xdf979e8\r\nguestfsd: receive_file: reading length word\r\nx0, len = 8192, buf = 0xdf979e8\r"..., buf_len=168, array=0x0, array_len=0) at _guestfs.c:249 #57 0xb7190afe in guestfs___call_callbacks_message (g=0xa52fc68, event=16, buf=0xbfa70350 "guestfsd: receive_file: reading length word\r\nguestfsd: receive_file: got chunk: cancel = 0x0, len = 8192, buf = 0xdf979e8\r\nguestfsd: receive_file: reading length word\r\nx0, len = 8192, buf = 0xdf979e8\r"..., buf_len=168) at events.c:109 #58 0xb719c7e0 in read_log_message_or_eof (g=0xa52fc68, fd=<optimized out>, error_if_eof=0) at proto.c:286 #59 0xb719d025 in guestfs___send_to_daemon (g=0xa52fc68, v_buf=0xb45d820, n=8204) at proto.c:474 #60 0xb719d361 in send_file_chunk (g=0xa52fc68, cancel=<optimized out>, buf=0xbfa7285f "ffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffff"..., buflen=8192) at proto.c:1008 #61 0xb719e315 in send_file_data (len=8192, buf=0xbfa7285f "ffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffffffd8ffff"..., g=0xa52fc68) at proto.c:947 #62 guestfs___send_file (g=0xa52fc68, filename=0xb3a0118 "build/appliances/i686/centos/6/quarri-poq/1.0/centos-plugin/tmp/Packages.dump") at proto.c:907 #63 0xb714bdcd in guestfs_upload (g=0xa52fc68, filename=0xb3a0118 "build/appliances/i686/centos/6/quarri-poq/1.0/centos-plugin/tmp/Packages.dump", remotefilename=0xa2a9340 "/tmp/Packages.dump") at actions.c:10879 #64 0xb72092d0 in ruby_guestfs_upload (gv=3068468020, filenamev=3067931000, remotefilenamev=3067930780) at _guestfs.c:7118 #65 0x49ae4f31 in call_cfunc (argv=0xbfa74ca0, argc=2, len=2, recv=3068468020, func=0xb7209270 <ruby_guestfs_upload>) at eval.c:5787 #66 rb_call0 (klass=3074266180, recv=3068468020, id=45073, oid=45073, argc=2, argv=0xbfa74ca0, body=0xb73d793c, flags=0) at eval.c:5928 [more Ruby call frames elided]
The last thing that happens before it crashes is it constructs a formatted string: #13 0x49afdd4d in ruby_xmalloc (size=<optimized out>) at gc.c:147 #14 0x49b4c8f9 in rb_str_buf_new (capa=128) at string.c:234 #15 0x49b489ac in rb_str_format (argc=7, argv=0xb44d1d4, fmt=3075887580) at sprintf.c:298 #16 0x49b4b6e2 in rb_str_format_m (str=3075836620, arg=3067886260) at string.c:469 At frame #16, str is: (gdb) print *(struct RString *)str $4 = {basic = {flags = 8263, klass = 3077477380}, len = 26, ptr = 0x9cad898 "%s, [%s#%d] %5s -- %s: %s\n", aux = {capa = -1219079716, shared = 3075887580}} This format string occurs in the ruby core Logger class in a subclass called Formatter which constructs the string to be logged. Since LogHelper in Boxgrinder calls Logger, I conclude that it's executing one of the log.* methods in the event callback (default_callback).
I would say this is a fault in the Ruby garbage collector. Some observations: (1) Ruby scans the stack using an imprecise pointer detector (is_pointer_to_heap in gc.c) (2) At the point where the crash occurs, it is scanning an 8192 byte buffer that happens to be on the stack somewhere in the middle of libguestfs, and happens to contains random data related to the upload operation. Ruby really should not be scanning this buffer. (3) The actual part of this buffer that is being scanned (0xbfa71ce4) looks like a valid stack pointer. It isn't -- it's just some random data that happens to look like that. (4) If we interpret this random data as a Ruby VALUE, it becomes clearer what's going on: (gdb) print/x *0xbfa71ce4 $15 = 0xb6dcec98 (gdb) print *(RVALUE *)0xb6dcec98 $17 = { as = { free = { flags = 98, # to Ruby this looks like T_DATA [...] data = { basic = { flags = 98, klass = 3077482900 }, dmark = 0x49b0de10 <mark_load_arg>, dfree = 0, data = 0xbfa71ccc }, [...] (5) As you can see from the stack trace, Ruby follows the completely bogus "data" pointer 0xbfa71ccc: #9 0x49b0de33 in mark_load_arg (ptr=0xbfa71ccc) at marshal.c:841 and eventually this causes a crash. (6) WTF?
I get it now. This issue is the same as http://bugs.ruby-lang.org/issues/4339 and should be fixed by the same patch.
(In reply to comment #7) > I get it now. This issue is the same as > http://bugs.ruby-lang.org/issues/4339 > and should be fixed by the same patch. Ignore comment 7. This patch does not fix the problem. The solution is either: (a) Update to Ruby 1.9. It has a different GC impl. (b) Don't keep buffers of data on the stack and cross your fingers that nothing else on the stack looks like a root.
This is a fundamental bug in Ruby 1.8. On this basis, I'm closing as CANTFIX.
Hey Richard, when I was trying to reproduce this one in RHEL 6.5 the reproducer (the latter script) sometimes runs nice, but sometimes it exits with weird error: -bash-4.1# ruby test.rb -bash-4.1# ruby test.rb -bash-4.1# ruby test.rb test.rb:4:in `new': expecting 0 or 1 arguments (ArgumentError) from test.rb:4 Is this behavior expected today?
(In reply to Lukas Zapletal from comment #10) > Hey Richard, > > when I was trying to reproduce this one in RHEL 6.5 the reproducer (the > latter script) sometimes runs nice, but sometimes it exits with weird error: > > -bash-4.1# ruby test.rb > -bash-4.1# ruby test.rb > -bash-4.1# ruby test.rb > test.rb:4:in `new': expecting 0 or 1 arguments (ArgumentError) > from test.rb:4 > > Is this behavior expected today? That's probably this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1072062 which should be fixed in RHEL 6.6. I don't believe we have a package for RHEL 6.6 that fixes it right now, but you could ask ptoscano about it.
Rich, have you noticed this change? https://github.com/ruby/ruby/commit/fa4c17e71283e72100c12b8824aafe176372241d It was not exported, but it was used: https://github.com/ruby/ruby/blob/ruby_1_8_7/gc.c#L1124 Not sure if that might be relevant.
Info for googlers: To reproduce this bug on a host with large free memory, add this line to the script: GC.stress = true And then you are able to reproduce everytime: [root@hp-ml370g4-01 ~]# ruby test.rb libguestfs: exception in callback!
Also this post might be helpful: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/60741 It suggest to use RB_GC_GUARD preferably to volatile.
(In reply to Vít Ondruch from comment #14) > Also this post might be helpful: > > http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/60741 > > It suggest to use RB_GC_GUARD preferably to volatile. Thanks, I have filed a bug to do this: https://bugzilla.redhat.com/show_bug.cgi?id=1097272