Bug 786432

Summary: aeolus-conductor thin crash
Product: [Retired] CloudForms Cloud Engine Reporter: wes hayutin <whayutin>
Component: aeolus-conductorAssignee: Jim Meyering <meyering>
Status: CLOSED CURRENTRELEASE QA Contact: Rehana <aeolus-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.0.0CC: akarol, calfonso, deltacloud-maint, dgao, hbrock, lutter, morazi, ssachdev, vondruch
Target Milestone: rcKeywords: Reopened, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-13 19:49:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
unusable UI, when thin is invoked via rdebug
none
backtrace ("bt full" from gdb)
none
a *useful* backtrace (pre-corruption)
none
ruby-backtrace, (but not immediately- pre-segv)
none
ruby-backtrace, (this time just prior to segv)
none
avoid one Nokogiri/xml use, patch by Francesco Vollero
none
ruby backtrace, using fvollero's patch none

Description wes hayutin 2012-02-01 12:43:44 UTC
Description of problem:
>> Writing PID to /var/run/aeolus-conductor/thin.pid
>> Changing process privilege to aeolus:aeolus
Using gem require instead of bundler
/usr/share/aeolus-conductor/app/util/deployable_xml.rb:172: warning: already initialized constant DEPLOYABLE_VERSION
>> Thin web server (v1.2.11 codename Bat-Shit Crazy)
>> Maximum connections set to 1024
>> Listening on 127.0.0.1:3000, CTRL+C to stop
>> Stopping ...
>> Exiting!
>> Writing PID to /var/run/aeolus-conductor/thin.pid
>> Changing process privilege to aeolus:aeolus
Using gem require instead of bundler
/usr/share/aeolus-conductor/app/util/deployable_xml.rb:172: warning: already initialized constant DEPLOYABLE_VERSION
>> Thin web server (v1.2.11 codename Bat-Shit Crazy)
>> Maximum connections set to 1024
>> Listening on 127.0.0.1:3000, CTRL+C to stop
/usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0.beta.4/lib/nokogiri/xml/node_set.rb:237: [BUG] rb_gc_mark(): unknown data type 0x0(0x3d435d0) non object
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

[root@qeblade30 ~]# rpm -qa | grep aeolus
aeolus-configure-2.5.0-11.el6.noarch
aeolus-all-0.8.0-17.el6.noarch
rubygem-aeolus-image-0.3.0-7.el6.noarch
aeolus-conductor-0.8.0-17.el6.noarch
rubygem-aeolus-cli-0.3.0-7.el6.noarch
aeolus-conductor-doc-0.8.0-17.el6.noarch
aeolus-conductor-daemons-0.8.0-17.el6.noarch
[root@qeblade30 ~]# rpm -qa | grep ruby
rubygem-rails_warden-0.5.5-1.el6.noarch
ruby-augeas-0.3.0-1.el6.x86_64
rubygem-nokogiri-1.5.0-0.5.beta4.el6.x86_64
rubygem-tzinfo-0.3.26-1.el6_0.noarch
rubygem-sass-3.1.4-4.el6.noarch
rubygem-treetop-1.4.9-2.el6.noarch
rubygem-erubis-2.6.6-1.el6.noarch
rubygem-http_connection-1.4.1-3.el6_1.noarch
rubygem-actionmailer-3.0.10-1.el6.noarch
rubygem-addressable-2.2.6-2.el6.noarch
rubygem-arel-2.0.9-1.el6_0.noarch
rubygem-xml-simple-1.0.12-3.el6_0.noarch
ruby-qpid-qmf-0.12-6.el6.x86_64
rubygem-builder-2.1.2-4.el6_0.noarch
rubygem-oauth-0.4.4-1.el6.noarch
rubygem-rack-restful_submit-1.2.1-4.el6.noarch
rubygem-sinatra-1.2.6-2.el6.noarch
rubygem-daemons-1.0.10-3.el6_0.noarch
rubygem-yard-0.7.2-1.el6.noarch
rubygem-haml-3.1.2-2.el6.noarch
rubygem-highline-1.5.1-4.el6.noarch
rubygem-aeolus-image-0.3.0-7.el6.noarch
ruby-1.8.7.352-4.el6_2.x86_64
rubygems-1.8.10-1.el6.noarch
rubygem-activesupport-3.0.10-1.el6.noarch
rubygem-uuidtools-2.1.1-2.el6_0.noarch
rubygem-json-1.4.3-3.el6_0.x86_64
rubygem-rest-client-1.6.1-2.el6_0.noarch
rubygem-will_paginate-3.0-0.1.pre2.el6.noarch
rubygem-ruby-net-ldap-0.0.4-5.el6.noarch
rubygem-tilt-1.3.2-3.el6.noarch
rubygem-vcr-1.10.0-1.el6.noarch
rubygem-eventmachine-0.12.10-5.el6_0.x86_64
rubygem-trollop-1.16.2-1.el6_0.noarch
rubygem-crack-0.1.8-2.el6.noarch
rubygem-actionpack-3.0.10-1.el6.noarch
rubygem-maruku-0.6.0-4.el6.noarch
rubygem-sqlite3-1.3.3-5.el6.x86_64
rubygem-bundler-1.0.15-1.el6.noarch
rubygem-aws-2.5.5-3.1.el6_1.noarch
libselinux-ruby-2.0.94-5.2.el6.x86_64
rubygem-imagefactory-console-0.4.0-1.el6.noarch
rubygem-rails-3.0.10-1.el6.noarch
ruby-libs-1.8.7.352-4.el6_2.x86_64
ruby-rdoc-1.8.7.352-4.el6_2.x86_64
rubygem-i18n-0.5.0-1.el6_0.noarch
rubygem-activemodel-3.0.10-1.el6.noarch
rubygem-rack-test-0.6.0-2.el6.noarch
rubygem-mime-types-1.16-4.el6_0.noarch
rubygem-rdoc-3.8-3.el6.noarch
rubygem-warden-1.0.5-1.el6.noarch
rubygem-mail-2.3.0-1.el6.noarch
rubygem-hpricot-0.8.4-1.el6.x86_64
rubygem-net-ssh-2.0.23-6.el6_0.noarch
rubygem-curb-0.7.10-1.el6.x86_64
rubygem-fastercsv-1.5.1-2.el6.noarch
rubygem-rack-mount-0.7.1-6.el6.noarch
rubygem-syntax-1.0.0-4.el6.noarch
rubygem-webmock-1.6.4-1.el6.noarch
rubygem-thor-0.14.6-3.el6.noarch
rubygem-activerecord-3.0.10-2.el6.noarch
rubygem-aeolus-cli-0.3.0-7.el6.noarch
ruby-irb-1.8.7.352-4.el6_2.x86_64
rubygem-rake-0.9.2-1.el6.noarch
rubygem-fssm-0.2.7-1.el6.noarch
rubygem-rack-accept-0.4.3-6.el6_0.noarch
rubygem-text-format-1.0.0-4.el6.noarch
rubygem-thin-1.2.11-3.el6.x86_64
rubygem-regin-0.3.8-2.el6.noarch
rubygem-diff-lcs-1.1.2-4.el6_0.noarch
rubygem-timecop-0.3.5-2.el6.noarch
ruby-shadow-1.4.1-13.el6.x86_64
rubygem-rack-1.3.0-1.el6.noarch
rubygem-activeresource-3.0.10-1.el6.noarch
rubygem-deltacloud-client-0.5.0-1.rc2.el6.noarch
rubygem-polyglot-0.3.1-1.el6.noarch
rubygem-abstract-1.0.0-3.el6.noarch
rubygem-rbvmomi-1.2.3-3.el6.noarch
rubygem-simple-navigation-3.0.0-3.el6_0.noarch
rubygem-pg-0.11.0-5.el6.x86_64
rubygem-railties-3.0.10-1.el6.noarch

Comment 1 wes hayutin 2012-02-01 12:45:41 UTC

*** This bug has been marked as a duplicate of bug 782573 ***

Comment 2 Jim Meyering 2012-02-08 08:13:03 UTC
This is not a duplicate of bug 782573, which is now fixed, and which affected deltacloud.  This one affects conductor/thin.

I too confused them and added the following comment+backtrace to that
other bug, whereas it is relevant only to this one:
https://bugzilla.redhat.com/show_bug.cgi?id=782573#c20

Here's the top of that backtrace:

Program received signal SIGSEGV, Segmentation fault.
gc_mark (ptr=125780103885412, lev=1) at gc.c:804
804         if (obj->as.basic.flags == 0) return;       /* free cell */
(gdb) source /root/.meyering/.gdbinit-ruby
Really redefine built-in command "eval"? (y or n) [answered Y; input not from terminal]
**** Loaded Ruby Macros Successfully ****
(gdb) bt
#0  gc_mark (ptr=125780103885412, lev=1) at gc.c:804
#1  0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
    at gc.c:1063
#2  0x00000036d822ee32 in blk_mark (data=0x4a1b270) at eval.c:8526
#3  0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
    at gc.c:1063
#4  0x00000036d8250d4f in mark_locations_array (x=<value optimized out>,
    n=76152) at gc.c:690
#5  0x00000036d82515f7 in garbage_collect () at gc.c:1493
#6  0x00000036d825226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153
#7  0x00000036d822e742 in frame_dup (frame=0x574bfe0) at eval.c:8575
#8  0x00000036d822f881 in blk_copy_prev (block=0x54d4cb0) at eval.c:8594
#9  0x00000036d8231878 in proc_alloc (klass=139637772402520, proc=0)
    at eval.c:8820
#10 0x00000036d823857b in rb_eval (self=139637682160400, n=0x7efff1dff410)
    at eval.c:3793

Comment 3 wes hayutin 2012-02-08 13:30:39 UTC
*** Bug 788392 has been marked as a duplicate of this bug. ***

Comment 4 Hugh Brock 2012-02-08 15:10:59 UTC
This doesn't meet the beta blocker or urgent criteria  -- it does not prevent customers or testers from using the product. Moving to RC blocker/high severity.

Comment 5 Jim Meyering 2012-02-09 18:29:53 UTC
I'm running thin via rdebug like this:

RAILS_ENV=production rdebug -- /usr/bin/thin start -c /usr/share/aeolus-conductor -l /var/log/aeolus-conductor/thin.log -P /var/run/aeolus-conductor/thin.pid -a 127.0.0.1 -e production --user aeolus -g aeolus --prefix=/conductor --rackup /usr/share/aeolus-conductor/config.ru
[then hit "c" to continue it]

after manually starting these services:

service mongod start
service iwhd start
service postgresql start
service httpd start
service qpidd start
service deltacloud-core start
service libvirtd start

Then, I manually started the remaining services:

service conductor-dbomatic start
service imagefactory start
service ntpd start

However, now, when I visit the UI, currently here (admin/password)
https://qeblade31.rhq.lab.eng.bos.redhat.com/conductor/pools
the UI renders as if javascript were disabled, which means
I can't use it to delete an instance.  I'll attach a .pdf of what it looks like.

[back story, I need ruby debug access, since gdb-debugging is not feasible due to consistent stack corruption.  The usual methods of printing stack trace info do not work.  At least with the preceding ruby-segfault bug, I could usually get a rb_bt (i.e., readable) backtrace. ]

Comment 6 Jim Meyering 2012-02-09 18:32:31 UTC
Created attachment 560682 [details]
unusable UI, when thin is invoked via rdebug

Comment 7 Jim Meyering 2012-02-09 19:48:26 UTC
Thanks to help from Chris Alfonso, I discovered that
I must also set the RAILS_RELATIVE_URL_ROOT envvar.
With this, it now works:

RAILS_RELATIVE_URL_ROOT=/conductor RAILS_ENV=production rdebug -- /usr/bin/thin start -c /usr/share/aeolus-conductor -l /var/log/aeolus-conductor/thin.log -P /var/run/aeolus-conductor/thin.pid -a 127.0.0.1 -e production --user aeolus -g aeolus --prefix=/conductor --rackup /usr/share/aeolus-conductor/config.ru

Comment 8 Jim Meyering 2012-02-10 10:59:35 UTC
FYI, running under rdebug, I stopped at
/usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:248
turned on linetracing ("tr on"), but it went through the deletion process
without a segfault.  However, the deletion did fail:

   "Errors: The application a could not be deleted"

So I tried again, this time without enabling linetracing mode
(which, by definition changes how ruby works, and probably
influences GC and/or thread timing enough not to trigger the failure).
On the other hand, since I'll have no idea where in ruby land it dies
if I trigger the failure, I've also attached gdb to the rdebug process
that's driving thin.

This time I did hit the expected GC bug (though here it's an abort not segv).

    (rdb:1) n
    /usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:249
    failed = []
    (rdb:1)
    /usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:250
    Deployment.find(params[:deployments_selected] || []).each do |deployment|
    (rdb:1)
    /usr/lib/ruby/gems/1.8/gems/rest-client-1.6.1/lib/restclient/net_http_ext.rb:12: [BUG] rb_gc_mark(): unknown data type 0x13(0x4e24fd0) non object
    ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

Full backtrace attached.

Comment 9 Jim Meyering 2012-02-10 11:00:30 UTC
Created attachment 560872 [details]
backtrace ("bt full" from gdb)

Comment 10 Jim Meyering 2012-02-10 14:56:13 UTC
Created attachment 560909 [details]
a *useful* backtrace (pre-corruption)

I found that approximately the 7th garbage_collect call after
delete-button-press is the one that fails.
So here's a stack trace from there.
As shown, completing that function via gdb's "finish"
command evoked the usual segfault:

This implicates nokogiri and its XML-parsing code:

#0  garbage_collect () at gc.c:1441
#1  0x00000036d825226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153
#2  0x00000036e165bd7c in xmlHashCreate () from /usr/lib64/libxml2.so.2
#3  0x00000036e1686de6 in xmlXPathNewContext () from /usr/lib64/libxml2.so.2
#4  0x00007fa4f6a95893 in ?? ()
   from /usr/lib64/ruby/site_ruby/1.8/x86_64-linux/nokogiri/nokogiri.so
#5  0x00000036d823c6e5 in rb_call0 (klass=140346496273280,
    recv=140346496273320, id=3361, oid=<value optimized out>, argc=1,
    argv=0x7fff68a83230, body=0x7fa4f70e4b30, flags=0) at eval.c:5968
...

"bt full" output attached

Comment 11 Jim Meyering 2012-02-10 15:54:27 UTC
Created attachment 560928 [details]
ruby-backtrace, (but not immediately- pre-segv)

for the record, this was not the GC call that failed, so it merely gives an idea of what such a stack trace looks like.  I have to disable the gc breakpoint in order to run rb_bt (which produces this style of backtrace), and forgot to reenable it afterwards.

Comment 12 Jim Meyering 2012-02-10 17:07:49 UTC
See the partial backtrace, below.
Note the contents of buf:

  /usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb:237

The code there is not interesting:

      # Iterate over each node, yielding  to +block+
      def each(&block)
        0.upto(length - 1) do |x|
          yield self[x]
        end
      end

But the fact that once again it's in nokogiri/xml land is.



(gdb) bt full
#0  0x00000036d8632885 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000036d8634065 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000036d822db63 in rb_bug (
    fmt=0x36d82b1948 "rb_gc_mark(): unknown data type 0x%lx(0x%lx) %s")
    at error.c:213
        buf = "/usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb:237: ", '\000' <repeats 11 times>"\203, \016%\330\066\000\000\000T\033+\330\066\000\000\000pt\223\002\000\000\000\000\320o\223\002\000\000\000\000\000\023%\330\066", '\000' <repeats 19 times>, "H\000\000\000\000\000\000\000\016\023%\330\066\000\000\000T\033+\330\066\000\000\000]\224)\330\066", '\000' <repeats 11 times>"\321, \r%\330\v\000\000\000T\033+\330\066\000\000\000\240o\223\002\000\000\000\000\360m\223\002\000\000\000\000\000\023%\330\066\000\000\000\003", '\000' <repeats 15 times>...
        args = {{gp_offset = 32, fp_offset = 48,
            overflow_arg_area = 0x7fffa255ff20, reg_save_area = 0x7fffa255fe40}}
        out = 0x36d899b860
        len = 77
#3  0x00000036d8250e16 in gc_mark_children (ptr=79817872, lev=2) at gc.c:1080
        obj = 0x4c1ec90
#4  0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
    at gc.c:1063
        n = <value optimized out>
        vars = 0x4352ee0
        obj = <value optimized out>
#5  0x00000036d822ee32 in blk_mark (data=0x46dd940) at eval.c:8526
No locals.
#6  0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
    at gc.c:1063
        n = <value optimized out>
        vars = 0x3ee0b08
        obj = <value optimized out>
#7  0x00000036d8250d4f in mark_locations_array (x=<value optimized out>,
    n=76152) at gc.c:690
        v = <value optimized out>
#8  0x00000036d82515f7 in garbage_collect () at gc.c:1493
        list = <value optimized out>
        frame = 0x0
        save_regs_gc_mark = {{__jmpbuf = {80, 7611797768948620192, 0, 0, 0, 0,
              7611797768998951840, -7624456336105333856}, __mask_was_saved = 0,
            __saved_mask = {__val = {235557375328, 140342351364096,
                140735916934864, 0, 128, 0, 1, 0, 128, 0, 0, 0, 210453397510,
                532575944795, 472446402679, 532575944795}}}}
        stack_end = 0x7fffa2560010

Comment 13 Jim Meyering 2012-02-10 17:22:15 UTC
Created attachment 560935 [details]
ruby-backtrace, (this time just prior to segv)

Comment 14 Jim Meyering 2012-02-10 18:28:45 UTC
Created attachment 560945 [details]
avoid one Nokogiri/xml use, patch by Francesco Vollero

Comment 15 Jim Meyering 2012-02-10 18:30:36 UTC
Created attachment 560946 [details]
ruby backtrace, using fvollero's patch

Just before yet another segfault.
Note that now it is no longer happening in Nokogiri::XML.
Could simply be difference in GC, or it could be that we're
avoiding a bug in the eliminated code.

Comment 16 chris alfonso 2012-02-10 21:13:43 UTC
As a test, I shut thin down and ran aeolus-conductor using WEBrick (rails server).  We still see the segv issue.  This is clearly not occurring due to thin.

Comment 17 Jim Meyering 2012-02-10 21:37:14 UTC
For the record, here's a backtrace from Chris' system:

#0  0x0000003c79a32885 in raise () from /lib64/libc.so.6
#1  0x0000003c79a34065 in abort () from /lib64/libc.so.6
#2  0x0000003c7a62db63 in rb_bug (fmt=0x3c7a6c634c "Segmentation fault")
    at error.c:213
#3  0x0000003c7a69602c in sigsegv (sig=11) at signal.c:634
#4  <signal handler called>
#5  gc_mark (ptr=10575260, lev=1) at gc.c:804
#6  0x0000003c7a6511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
    at gc.c:1063
#7  0x0000003c7a62ee32 in blk_mark (data=0x34de650) at eval.c:8526
#8  0x0000003c7a6511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
    at gc.c:1063
#9  0x0000003c7a650d4f in mark_locations_array (x=<value optimized out>,
    n=79228) at gc.c:690
#10 0x0000003c7a6515f7 in garbage_collect () at gc.c:1493
#11 0x0000003c7a65226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153
#12 0x0000003c7a62e742 in frame_dup (frame=0x4015cd0) at eval.c:8575
#13 0x0000003c7a62f881 in blk_copy_prev (block=0x4011990) at eval.c:8594
#14 0x0000003c7a63316a in proc_clone (self=139880507237440) at eval.c:8636
#15 0x0000003c7a6448ef in rb_mod_define_method (argc=<value optimized out>,
    argv=<value optimized out>, mod=139880561592160) at eval.c:10075
#16 0x0000003c7a63c6e5 in rb_call0 (klass=139880602065440,
    recv=139880561592160, id=4393, oid=<value optimized out>, argc=1,
    argv=0x7fff3e5727d0, body=0x7f387d988c00, flags=2) at eval.c:5968
#17 0x0000003c7a63cb07 in rb_call (klass=139880602065440, recv=139880561592160,
    mid=4393, argc=1, argv=0x7fff3e5727d0, scope=1, self=<value optimized out>)
    at eval.c:6251
...

Comment 18 Hugh Brock 2012-02-22 13:46:40 UTC
As it turns out there were issues in our Ruby code that were tickling the flaw in the interpreter that produced the segfault. The segfault should still be fixed, but it's no longer a Cloudforms-1.0.0 issue -- so pushing to cloudforms-1.1.0

Comment 19 Jim Meyering 2012-03-05 10:02:28 UTC
For the record, I found that MBARI 1..3 patches from
https://sites.google.com/site/brentsrubypatches/
when applied to our 1.8.7 rpm solved this problem.

Here's the SRPM I tested:
http://people.redhat.com/meyering/tmp/ruby-1.8.7.352-11.fc16.src.rpm

Comment 20 wes hayutin 2012-05-07 14:50:36 UTC
aeolus moving to ruby 1.9... this should be retested in a version of ruby higher than 1.8.7

Comment 21 Mike Orazi 2012-08-15 15:38:40 UTC
Can we get verification of whether or not this is still an issue?  We don't seem to be hitting it any longer as far as I can tell.

Comment 22 dgao 2012-09-25 13:28:00 UTC
QE haven't ran into this issue for a while, so it's pretty safe to say this issue is resolved by:


[root@ibm-x3690x5-01 ~]# rpm -qa | grep "conductor"
aeolus-conductor-doc-0.13.14-1.el6cf.noarch
aeolus-conductor-0.13.14-1.el6cf.noarch
aeolus-conductor-daemons-0.13.14-1.el6cf.noarch