Hide Forgot
Description of problem: >> Writing PID to /var/run/aeolus-conductor/thin.pid >> Changing process privilege to aeolus:aeolus Using gem require instead of bundler /usr/share/aeolus-conductor/app/util/deployable_xml.rb:172: warning: already initialized constant DEPLOYABLE_VERSION >> Thin web server (v1.2.11 codename Bat-Shit Crazy) >> Maximum connections set to 1024 >> Listening on 127.0.0.1:3000, CTRL+C to stop >> Stopping ... >> Exiting! >> Writing PID to /var/run/aeolus-conductor/thin.pid >> Changing process privilege to aeolus:aeolus Using gem require instead of bundler /usr/share/aeolus-conductor/app/util/deployable_xml.rb:172: warning: already initialized constant DEPLOYABLE_VERSION >> Thin web server (v1.2.11 codename Bat-Shit Crazy) >> Maximum connections set to 1024 >> Listening on 127.0.0.1:3000, CTRL+C to stop /usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0.beta.4/lib/nokogiri/xml/node_set.rb:237: [BUG] rb_gc_mark(): unknown data type 0x0(0x3d435d0) non object ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux] [root@qeblade30 ~]# rpm -qa | grep aeolus aeolus-configure-2.5.0-11.el6.noarch aeolus-all-0.8.0-17.el6.noarch rubygem-aeolus-image-0.3.0-7.el6.noarch aeolus-conductor-0.8.0-17.el6.noarch rubygem-aeolus-cli-0.3.0-7.el6.noarch aeolus-conductor-doc-0.8.0-17.el6.noarch aeolus-conductor-daemons-0.8.0-17.el6.noarch [root@qeblade30 ~]# rpm -qa | grep ruby rubygem-rails_warden-0.5.5-1.el6.noarch ruby-augeas-0.3.0-1.el6.x86_64 rubygem-nokogiri-1.5.0-0.5.beta4.el6.x86_64 rubygem-tzinfo-0.3.26-1.el6_0.noarch rubygem-sass-3.1.4-4.el6.noarch rubygem-treetop-1.4.9-2.el6.noarch rubygem-erubis-2.6.6-1.el6.noarch rubygem-http_connection-1.4.1-3.el6_1.noarch rubygem-actionmailer-3.0.10-1.el6.noarch rubygem-addressable-2.2.6-2.el6.noarch rubygem-arel-2.0.9-1.el6_0.noarch rubygem-xml-simple-1.0.12-3.el6_0.noarch ruby-qpid-qmf-0.12-6.el6.x86_64 rubygem-builder-2.1.2-4.el6_0.noarch rubygem-oauth-0.4.4-1.el6.noarch rubygem-rack-restful_submit-1.2.1-4.el6.noarch rubygem-sinatra-1.2.6-2.el6.noarch rubygem-daemons-1.0.10-3.el6_0.noarch rubygem-yard-0.7.2-1.el6.noarch rubygem-haml-3.1.2-2.el6.noarch rubygem-highline-1.5.1-4.el6.noarch rubygem-aeolus-image-0.3.0-7.el6.noarch ruby-1.8.7.352-4.el6_2.x86_64 rubygems-1.8.10-1.el6.noarch rubygem-activesupport-3.0.10-1.el6.noarch rubygem-uuidtools-2.1.1-2.el6_0.noarch rubygem-json-1.4.3-3.el6_0.x86_64 rubygem-rest-client-1.6.1-2.el6_0.noarch rubygem-will_paginate-3.0-0.1.pre2.el6.noarch rubygem-ruby-net-ldap-0.0.4-5.el6.noarch rubygem-tilt-1.3.2-3.el6.noarch rubygem-vcr-1.10.0-1.el6.noarch rubygem-eventmachine-0.12.10-5.el6_0.x86_64 rubygem-trollop-1.16.2-1.el6_0.noarch rubygem-crack-0.1.8-2.el6.noarch rubygem-actionpack-3.0.10-1.el6.noarch rubygem-maruku-0.6.0-4.el6.noarch rubygem-sqlite3-1.3.3-5.el6.x86_64 rubygem-bundler-1.0.15-1.el6.noarch rubygem-aws-2.5.5-3.1.el6_1.noarch libselinux-ruby-2.0.94-5.2.el6.x86_64 rubygem-imagefactory-console-0.4.0-1.el6.noarch rubygem-rails-3.0.10-1.el6.noarch ruby-libs-1.8.7.352-4.el6_2.x86_64 ruby-rdoc-1.8.7.352-4.el6_2.x86_64 rubygem-i18n-0.5.0-1.el6_0.noarch rubygem-activemodel-3.0.10-1.el6.noarch rubygem-rack-test-0.6.0-2.el6.noarch rubygem-mime-types-1.16-4.el6_0.noarch rubygem-rdoc-3.8-3.el6.noarch rubygem-warden-1.0.5-1.el6.noarch rubygem-mail-2.3.0-1.el6.noarch rubygem-hpricot-0.8.4-1.el6.x86_64 rubygem-net-ssh-2.0.23-6.el6_0.noarch rubygem-curb-0.7.10-1.el6.x86_64 rubygem-fastercsv-1.5.1-2.el6.noarch rubygem-rack-mount-0.7.1-6.el6.noarch rubygem-syntax-1.0.0-4.el6.noarch rubygem-webmock-1.6.4-1.el6.noarch rubygem-thor-0.14.6-3.el6.noarch rubygem-activerecord-3.0.10-2.el6.noarch rubygem-aeolus-cli-0.3.0-7.el6.noarch ruby-irb-1.8.7.352-4.el6_2.x86_64 rubygem-rake-0.9.2-1.el6.noarch rubygem-fssm-0.2.7-1.el6.noarch rubygem-rack-accept-0.4.3-6.el6_0.noarch rubygem-text-format-1.0.0-4.el6.noarch rubygem-thin-1.2.11-3.el6.x86_64 rubygem-regin-0.3.8-2.el6.noarch rubygem-diff-lcs-1.1.2-4.el6_0.noarch rubygem-timecop-0.3.5-2.el6.noarch ruby-shadow-1.4.1-13.el6.x86_64 rubygem-rack-1.3.0-1.el6.noarch rubygem-activeresource-3.0.10-1.el6.noarch rubygem-deltacloud-client-0.5.0-1.rc2.el6.noarch rubygem-polyglot-0.3.1-1.el6.noarch rubygem-abstract-1.0.0-3.el6.noarch rubygem-rbvmomi-1.2.3-3.el6.noarch rubygem-simple-navigation-3.0.0-3.el6_0.noarch rubygem-pg-0.11.0-5.el6.x86_64 rubygem-railties-3.0.10-1.el6.noarch
*** This bug has been marked as a duplicate of bug 782573 ***
This is not a duplicate of bug 782573, which is now fixed, and which affected deltacloud. This one affects conductor/thin. I too confused them and added the following comment+backtrace to that other bug, whereas it is relevant only to this one: https://bugzilla.redhat.com/show_bug.cgi?id=782573#c20 Here's the top of that backtrace: Program received signal SIGSEGV, Segmentation fault. gc_mark (ptr=125780103885412, lev=1) at gc.c:804 804 if (obj->as.basic.flags == 0) return; /* free cell */ (gdb) source /root/.meyering/.gdbinit-ruby Really redefine built-in command "eval"? (y or n) [answered Y; input not from terminal] **** Loaded Ruby Macros Successfully **** (gdb) bt #0 gc_mark (ptr=125780103885412, lev=1) at gc.c:804 #1 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 #2 0x00000036d822ee32 in blk_mark (data=0x4a1b270) at eval.c:8526 #3 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 #4 0x00000036d8250d4f in mark_locations_array (x=<value optimized out>, n=76152) at gc.c:690 #5 0x00000036d82515f7 in garbage_collect () at gc.c:1493 #6 0x00000036d825226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153 #7 0x00000036d822e742 in frame_dup (frame=0x574bfe0) at eval.c:8575 #8 0x00000036d822f881 in blk_copy_prev (block=0x54d4cb0) at eval.c:8594 #9 0x00000036d8231878 in proc_alloc (klass=139637772402520, proc=0) at eval.c:8820 #10 0x00000036d823857b in rb_eval (self=139637682160400, n=0x7efff1dff410) at eval.c:3793
*** Bug 788392 has been marked as a duplicate of this bug. ***
This doesn't meet the beta blocker or urgent criteria -- it does not prevent customers or testers from using the product. Moving to RC blocker/high severity.
I'm running thin via rdebug like this: RAILS_ENV=production rdebug -- /usr/bin/thin start -c /usr/share/aeolus-conductor -l /var/log/aeolus-conductor/thin.log -P /var/run/aeolus-conductor/thin.pid -a 127.0.0.1 -e production --user aeolus -g aeolus --prefix=/conductor --rackup /usr/share/aeolus-conductor/config.ru [then hit "c" to continue it] after manually starting these services: service mongod start service iwhd start service postgresql start service httpd start service qpidd start service deltacloud-core start service libvirtd start Then, I manually started the remaining services: service conductor-dbomatic start service imagefactory start service ntpd start However, now, when I visit the UI, currently here (admin/password) https://qeblade31.rhq.lab.eng.bos.redhat.com/conductor/pools the UI renders as if javascript were disabled, which means I can't use it to delete an instance. I'll attach a .pdf of what it looks like. [back story, I need ruby debug access, since gdb-debugging is not feasible due to consistent stack corruption. The usual methods of printing stack trace info do not work. At least with the preceding ruby-segfault bug, I could usually get a rb_bt (i.e., readable) backtrace. ]
Created attachment 560682 [details] unusable UI, when thin is invoked via rdebug
Thanks to help from Chris Alfonso, I discovered that I must also set the RAILS_RELATIVE_URL_ROOT envvar. With this, it now works: RAILS_RELATIVE_URL_ROOT=/conductor RAILS_ENV=production rdebug -- /usr/bin/thin start -c /usr/share/aeolus-conductor -l /var/log/aeolus-conductor/thin.log -P /var/run/aeolus-conductor/thin.pid -a 127.0.0.1 -e production --user aeolus -g aeolus --prefix=/conductor --rackup /usr/share/aeolus-conductor/config.ru
FYI, running under rdebug, I stopped at /usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:248 turned on linetracing ("tr on"), but it went through the deletion process without a segfault. However, the deletion did fail: "Errors: The application a could not be deleted" So I tried again, this time without enabling linetracing mode (which, by definition changes how ruby works, and probably influences GC and/or thread timing enough not to trigger the failure). On the other hand, since I'll have no idea where in ruby land it dies if I trigger the failure, I've also attached gdb to the rdebug process that's driving thin. This time I did hit the expected GC bug (though here it's an abort not segv). (rdb:1) n /usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:249 failed = [] (rdb:1) /usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:250 Deployment.find(params[:deployments_selected] || []).each do |deployment| (rdb:1) /usr/lib/ruby/gems/1.8/gems/rest-client-1.6.1/lib/restclient/net_http_ext.rb:12: [BUG] rb_gc_mark(): unknown data type 0x13(0x4e24fd0) non object ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux] Full backtrace attached.
Created attachment 560872 [details] backtrace ("bt full" from gdb)
Created attachment 560909 [details] a *useful* backtrace (pre-corruption) I found that approximately the 7th garbage_collect call after delete-button-press is the one that fails. So here's a stack trace from there. As shown, completing that function via gdb's "finish" command evoked the usual segfault: This implicates nokogiri and its XML-parsing code: #0 garbage_collect () at gc.c:1441 #1 0x00000036d825226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153 #2 0x00000036e165bd7c in xmlHashCreate () from /usr/lib64/libxml2.so.2 #3 0x00000036e1686de6 in xmlXPathNewContext () from /usr/lib64/libxml2.so.2 #4 0x00007fa4f6a95893 in ?? () from /usr/lib64/ruby/site_ruby/1.8/x86_64-linux/nokogiri/nokogiri.so #5 0x00000036d823c6e5 in rb_call0 (klass=140346496273280, recv=140346496273320, id=3361, oid=<value optimized out>, argc=1, argv=0x7fff68a83230, body=0x7fa4f70e4b30, flags=0) at eval.c:5968 ... "bt full" output attached
Created attachment 560928 [details] ruby-backtrace, (but not immediately- pre-segv) for the record, this was not the GC call that failed, so it merely gives an idea of what such a stack trace looks like. I have to disable the gc breakpoint in order to run rb_bt (which produces this style of backtrace), and forgot to reenable it afterwards.
See the partial backtrace, below. Note the contents of buf: /usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb:237 The code there is not interesting: # Iterate over each node, yielding to +block+ def each(&block) 0.upto(length - 1) do |x| yield self[x] end end But the fact that once again it's in nokogiri/xml land is. (gdb) bt full #0 0x00000036d8632885 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00000036d8634065 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00000036d822db63 in rb_bug ( fmt=0x36d82b1948 "rb_gc_mark(): unknown data type 0x%lx(0x%lx) %s") at error.c:213 buf = "/usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb:237: ", '\000' <repeats 11 times>"\203, \016%\330\066\000\000\000T\033+\330\066\000\000\000pt\223\002\000\000\000\000\320o\223\002\000\000\000\000\000\023%\330\066", '\000' <repeats 19 times>, "H\000\000\000\000\000\000\000\016\023%\330\066\000\000\000T\033+\330\066\000\000\000]\224)\330\066", '\000' <repeats 11 times>"\321, \r%\330\v\000\000\000T\033+\330\066\000\000\000\240o\223\002\000\000\000\000\360m\223\002\000\000\000\000\000\023%\330\066\000\000\000\003", '\000' <repeats 15 times>... args = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x7fffa255ff20, reg_save_area = 0x7fffa255fe40}} out = 0x36d899b860 len = 77 #3 0x00000036d8250e16 in gc_mark_children (ptr=79817872, lev=2) at gc.c:1080 obj = 0x4c1ec90 #4 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 n = <value optimized out> vars = 0x4352ee0 obj = <value optimized out> #5 0x00000036d822ee32 in blk_mark (data=0x46dd940) at eval.c:8526 No locals. #6 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 n = <value optimized out> vars = 0x3ee0b08 obj = <value optimized out> #7 0x00000036d8250d4f in mark_locations_array (x=<value optimized out>, n=76152) at gc.c:690 v = <value optimized out> #8 0x00000036d82515f7 in garbage_collect () at gc.c:1493 list = <value optimized out> frame = 0x0 save_regs_gc_mark = {{__jmpbuf = {80, 7611797768948620192, 0, 0, 0, 0, 7611797768998951840, -7624456336105333856}, __mask_was_saved = 0, __saved_mask = {__val = {235557375328, 140342351364096, 140735916934864, 0, 128, 0, 1, 0, 128, 0, 0, 0, 210453397510, 532575944795, 472446402679, 532575944795}}}} stack_end = 0x7fffa2560010
Created attachment 560935 [details] ruby-backtrace, (this time just prior to segv)
Created attachment 560945 [details] avoid one Nokogiri/xml use, patch by Francesco Vollero
Created attachment 560946 [details] ruby backtrace, using fvollero's patch Just before yet another segfault. Note that now it is no longer happening in Nokogiri::XML. Could simply be difference in GC, or it could be that we're avoiding a bug in the eliminated code.
As a test, I shut thin down and ran aeolus-conductor using WEBrick (rails server). We still see the segv issue. This is clearly not occurring due to thin.
For the record, here's a backtrace from Chris' system: #0 0x0000003c79a32885 in raise () from /lib64/libc.so.6 #1 0x0000003c79a34065 in abort () from /lib64/libc.so.6 #2 0x0000003c7a62db63 in rb_bug (fmt=0x3c7a6c634c "Segmentation fault") at error.c:213 #3 0x0000003c7a69602c in sigsegv (sig=11) at signal.c:634 #4 <signal handler called> #5 gc_mark (ptr=10575260, lev=1) at gc.c:804 #6 0x0000003c7a6511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 #7 0x0000003c7a62ee32 in blk_mark (data=0x34de650) at eval.c:8526 #8 0x0000003c7a6511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 #9 0x0000003c7a650d4f in mark_locations_array (x=<value optimized out>, n=79228) at gc.c:690 #10 0x0000003c7a6515f7 in garbage_collect () at gc.c:1493 #11 0x0000003c7a65226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153 #12 0x0000003c7a62e742 in frame_dup (frame=0x4015cd0) at eval.c:8575 #13 0x0000003c7a62f881 in blk_copy_prev (block=0x4011990) at eval.c:8594 #14 0x0000003c7a63316a in proc_clone (self=139880507237440) at eval.c:8636 #15 0x0000003c7a6448ef in rb_mod_define_method (argc=<value optimized out>, argv=<value optimized out>, mod=139880561592160) at eval.c:10075 #16 0x0000003c7a63c6e5 in rb_call0 (klass=139880602065440, recv=139880561592160, id=4393, oid=<value optimized out>, argc=1, argv=0x7fff3e5727d0, body=0x7f387d988c00, flags=2) at eval.c:5968 #17 0x0000003c7a63cb07 in rb_call (klass=139880602065440, recv=139880561592160, mid=4393, argc=1, argv=0x7fff3e5727d0, scope=1, self=<value optimized out>) at eval.c:6251 ...
As it turns out there were issues in our Ruby code that were tickling the flaw in the interpreter that produced the segfault. The segfault should still be fixed, but it's no longer a Cloudforms-1.0.0 issue -- so pushing to cloudforms-1.1.0
For the record, I found that MBARI 1..3 patches from https://sites.google.com/site/brentsrubypatches/ when applied to our 1.8.7 rpm solved this problem. Here's the SRPM I tested: http://people.redhat.com/meyering/tmp/ruby-1.8.7.352-11.fc16.src.rpm
aeolus moving to ruby 1.9... this should be retested in a version of ruby higher than 1.8.7
Can we get verification of whether or not this is still an issue? We don't seem to be hitting it any longer as far as I can tell.
QE haven't ran into this issue for a while, so it's pretty safe to say this issue is resolved by: [root@ibm-x3690x5-01 ~]# rpm -qa | grep "conductor" aeolus-conductor-doc-0.13.14-1.el6cf.noarch aeolus-conductor-0.13.14-1.el6cf.noarch aeolus-conductor-daemons-0.13.14-1.el6cf.noarch