| Summary: | aeolus-conductor thin crash | ||
|---|---|---|---|
| Product: | [Retired] CloudForms Cloud Engine | Reporter: | wes hayutin <whayutin> |
| Component: | aeolus-conductor | Assignee: | Jim Meyering <meyering> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Rehana <aeolus-qa-list> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 1.0.0 | CC: | akarol, calfonso, deltacloud-maint, dgao, hbrock, lutter, morazi, ssachdev, vondruch |
| Target Milestone: | rc | Keywords: | Reopened, Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-12-13 19:49:38 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Attachments: | |||
|
Description
wes hayutin
2012-02-01 12:43:44 UTC
*** This bug has been marked as a duplicate of bug 782573 *** This is not a duplicate of bug 782573, which is now fixed, and which affected deltacloud. This one affects conductor/thin. I too confused them and added the following comment+backtrace to that other bug, whereas it is relevant only to this one: https://bugzilla.redhat.com/show_bug.cgi?id=782573#c20 Here's the top of that backtrace: Program received signal SIGSEGV, Segmentation fault. gc_mark (ptr=125780103885412, lev=1) at gc.c:804 804 if (obj->as.basic.flags == 0) return; /* free cell */ (gdb) source /root/.meyering/.gdbinit-ruby Really redefine built-in command "eval"? (y or n) [answered Y; input not from terminal] **** Loaded Ruby Macros Successfully **** (gdb) bt #0 gc_mark (ptr=125780103885412, lev=1) at gc.c:804 #1 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 #2 0x00000036d822ee32 in blk_mark (data=0x4a1b270) at eval.c:8526 #3 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1) at gc.c:1063 #4 0x00000036d8250d4f in mark_locations_array (x=<value optimized out>, n=76152) at gc.c:690 #5 0x00000036d82515f7 in garbage_collect () at gc.c:1493 #6 0x00000036d825226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153 #7 0x00000036d822e742 in frame_dup (frame=0x574bfe0) at eval.c:8575 #8 0x00000036d822f881 in blk_copy_prev (block=0x54d4cb0) at eval.c:8594 #9 0x00000036d8231878 in proc_alloc (klass=139637772402520, proc=0) at eval.c:8820 #10 0x00000036d823857b in rb_eval (self=139637682160400, n=0x7efff1dff410) at eval.c:3793 *** Bug 788392 has been marked as a duplicate of this bug. *** This doesn't meet the beta blocker or urgent criteria -- it does not prevent customers or testers from using the product. Moving to RC blocker/high severity. I'm running thin via rdebug like this: RAILS_ENV=production rdebug -- /usr/bin/thin start -c /usr/share/aeolus-conductor -l /var/log/aeolus-conductor/thin.log -P /var/run/aeolus-conductor/thin.pid -a 127.0.0.1 -e production --user aeolus -g aeolus --prefix=/conductor --rackup /usr/share/aeolus-conductor/config.ru [then hit "c" to continue it] after manually starting these services: service mongod start service iwhd start service postgresql start service httpd start service qpidd start service deltacloud-core start service libvirtd start Then, I manually started the remaining services: service conductor-dbomatic start service imagefactory start service ntpd start However, now, when I visit the UI, currently here (admin/password) https://qeblade31.rhq.lab.eng.bos.redhat.com/conductor/pools the UI renders as if javascript were disabled, which means I can't use it to delete an instance. I'll attach a .pdf of what it looks like. [back story, I need ruby debug access, since gdb-debugging is not feasible due to consistent stack corruption. The usual methods of printing stack trace info do not work. At least with the preceding ruby-segfault bug, I could usually get a rb_bt (i.e., readable) backtrace. ] Created attachment 560682 [details]
unusable UI, when thin is invoked via rdebug
Thanks to help from Chris Alfonso, I discovered that I must also set the RAILS_RELATIVE_URL_ROOT envvar. With this, it now works: RAILS_RELATIVE_URL_ROOT=/conductor RAILS_ENV=production rdebug -- /usr/bin/thin start -c /usr/share/aeolus-conductor -l /var/log/aeolus-conductor/thin.log -P /var/run/aeolus-conductor/thin.pid -a 127.0.0.1 -e production --user aeolus -g aeolus --prefix=/conductor --rackup /usr/share/aeolus-conductor/config.ru FYI, running under rdebug, I stopped at
/usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:248
turned on linetracing ("tr on"), but it went through the deletion process
without a segfault. However, the deletion did fail:
"Errors: The application a could not be deleted"
So I tried again, this time without enabling linetracing mode
(which, by definition changes how ruby works, and probably
influences GC and/or thread timing enough not to trigger the failure).
On the other hand, since I'll have no idea where in ruby land it dies
if I trigger the failure, I've also attached gdb to the rdebug process
that's driving thin.
This time I did hit the expected GC bug (though here it's an abort not segv).
(rdb:1) n
/usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:249
failed = []
(rdb:1)
/usr/share/aeolus-conductor/app/controllers/deployments_controller.rb:250
Deployment.find(params[:deployments_selected] || []).each do |deployment|
(rdb:1)
/usr/lib/ruby/gems/1.8/gems/rest-client-1.6.1/lib/restclient/net_http_ext.rb:12: [BUG] rb_gc_mark(): unknown data type 0x13(0x4e24fd0) non object
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
Full backtrace attached.
Created attachment 560872 [details]
backtrace ("bt full" from gdb)
Created attachment 560909 [details]
a *useful* backtrace (pre-corruption)
I found that approximately the 7th garbage_collect call after
delete-button-press is the one that fails.
So here's a stack trace from there.
As shown, completing that function via gdb's "finish"
command evoked the usual segfault:
This implicates nokogiri and its XML-parsing code:
#0 garbage_collect () at gc.c:1441
#1 0x00000036d825226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153
#2 0x00000036e165bd7c in xmlHashCreate () from /usr/lib64/libxml2.so.2
#3 0x00000036e1686de6 in xmlXPathNewContext () from /usr/lib64/libxml2.so.2
#4 0x00007fa4f6a95893 in ?? ()
from /usr/lib64/ruby/site_ruby/1.8/x86_64-linux/nokogiri/nokogiri.so
#5 0x00000036d823c6e5 in rb_call0 (klass=140346496273280,
recv=140346496273320, id=3361, oid=<value optimized out>, argc=1,
argv=0x7fff68a83230, body=0x7fa4f70e4b30, flags=0) at eval.c:5968
...
"bt full" output attached
Created attachment 560928 [details]
ruby-backtrace, (but not immediately- pre-segv)
for the record, this was not the GC call that failed, so it merely gives an idea of what such a stack trace looks like. I have to disable the gc breakpoint in order to run rb_bt (which produces this style of backtrace), and forgot to reenable it afterwards.
See the partial backtrace, below.
Note the contents of buf:
/usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb:237
The code there is not interesting:
# Iterate over each node, yielding to +block+
def each(&block)
0.upto(length - 1) do |x|
yield self[x]
end
end
But the fact that once again it's in nokogiri/xml land is.
(gdb) bt full
#0 0x00000036d8632885 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000036d8634065 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00000036d822db63 in rb_bug (
fmt=0x36d82b1948 "rb_gc_mark(): unknown data type 0x%lx(0x%lx) %s")
at error.c:213
buf = "/usr/lib/ruby/gems/1.8/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb:237: ", '\000' <repeats 11 times>"\203, \016%\330\066\000\000\000T\033+\330\066\000\000\000pt\223\002\000\000\000\000\320o\223\002\000\000\000\000\000\023%\330\066", '\000' <repeats 19 times>, "H\000\000\000\000\000\000\000\016\023%\330\066\000\000\000T\033+\330\066\000\000\000]\224)\330\066", '\000' <repeats 11 times>"\321, \r%\330\v\000\000\000T\033+\330\066\000\000\000\240o\223\002\000\000\000\000\360m\223\002\000\000\000\000\000\023%\330\066\000\000\000\003", '\000' <repeats 15 times>...
args = {{gp_offset = 32, fp_offset = 48,
overflow_arg_area = 0x7fffa255ff20, reg_save_area = 0x7fffa255fe40}}
out = 0x36d899b860
len = 77
#3 0x00000036d8250e16 in gc_mark_children (ptr=79817872, lev=2) at gc.c:1080
obj = 0x4c1ec90
#4 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
at gc.c:1063
n = <value optimized out>
vars = 0x4352ee0
obj = <value optimized out>
#5 0x00000036d822ee32 in blk_mark (data=0x46dd940) at eval.c:8526
No locals.
#6 0x00000036d82511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
at gc.c:1063
n = <value optimized out>
vars = 0x3ee0b08
obj = <value optimized out>
#7 0x00000036d8250d4f in mark_locations_array (x=<value optimized out>,
n=76152) at gc.c:690
v = <value optimized out>
#8 0x00000036d82515f7 in garbage_collect () at gc.c:1493
list = <value optimized out>
frame = 0x0
save_regs_gc_mark = {{__jmpbuf = {80, 7611797768948620192, 0, 0, 0, 0,
7611797768998951840, -7624456336105333856}, __mask_was_saved = 0,
__saved_mask = {__val = {235557375328, 140342351364096,
140735916934864, 0, 128, 0, 1, 0, 128, 0, 0, 0, 210453397510,
532575944795, 472446402679, 532575944795}}}}
stack_end = 0x7fffa2560010
Created attachment 560935 [details]
ruby-backtrace, (this time just prior to segv)
Created attachment 560945 [details]
avoid one Nokogiri/xml use, patch by Francesco Vollero
Created attachment 560946 [details]
ruby backtrace, using fvollero's patch
Just before yet another segfault.
Note that now it is no longer happening in Nokogiri::XML.
Could simply be difference in GC, or it could be that we're
avoiding a bug in the eliminated code.
As a test, I shut thin down and ran aeolus-conductor using WEBrick (rails server). We still see the segv issue. This is clearly not occurring due to thin. For the record, here's a backtrace from Chris' system:
#0 0x0000003c79a32885 in raise () from /lib64/libc.so.6
#1 0x0000003c79a34065 in abort () from /lib64/libc.so.6
#2 0x0000003c7a62db63 in rb_bug (fmt=0x3c7a6c634c "Segmentation fault")
at error.c:213
#3 0x0000003c7a69602c in sigsegv (sig=11) at signal.c:634
#4 <signal handler called>
#5 gc_mark (ptr=10575260, lev=1) at gc.c:804
#6 0x0000003c7a6511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
at gc.c:1063
#7 0x0000003c7a62ee32 in blk_mark (data=0x34de650) at eval.c:8526
#8 0x0000003c7a6511a2 in gc_mark_children (ptr=<value optimized out>, lev=1)
at gc.c:1063
#9 0x0000003c7a650d4f in mark_locations_array (x=<value optimized out>,
n=79228) at gc.c:690
#10 0x0000003c7a6515f7 in garbage_collect () at gc.c:1493
#11 0x0000003c7a65226f in ruby_xmalloc (size=<value optimized out>) at gc.c:153
#12 0x0000003c7a62e742 in frame_dup (frame=0x4015cd0) at eval.c:8575
#13 0x0000003c7a62f881 in blk_copy_prev (block=0x4011990) at eval.c:8594
#14 0x0000003c7a63316a in proc_clone (self=139880507237440) at eval.c:8636
#15 0x0000003c7a6448ef in rb_mod_define_method (argc=<value optimized out>,
argv=<value optimized out>, mod=139880561592160) at eval.c:10075
#16 0x0000003c7a63c6e5 in rb_call0 (klass=139880602065440,
recv=139880561592160, id=4393, oid=<value optimized out>, argc=1,
argv=0x7fff3e5727d0, body=0x7f387d988c00, flags=2) at eval.c:5968
#17 0x0000003c7a63cb07 in rb_call (klass=139880602065440, recv=139880561592160,
mid=4393, argc=1, argv=0x7fff3e5727d0, scope=1, self=<value optimized out>)
at eval.c:6251
...
As it turns out there were issues in our Ruby code that were tickling the flaw in the interpreter that produced the segfault. The segfault should still be fixed, but it's no longer a Cloudforms-1.0.0 issue -- so pushing to cloudforms-1.1.0 For the record, I found that MBARI 1..3 patches from https://sites.google.com/site/brentsrubypatches/ when applied to our 1.8.7 rpm solved this problem. Here's the SRPM I tested: http://people.redhat.com/meyering/tmp/ruby-1.8.7.352-11.fc16.src.rpm aeolus moving to ruby 1.9... this should be retested in a version of ruby higher than 1.8.7 Can we get verification of whether or not this is still an issue? We don't seem to be hitting it any longer as far as I can tell. QE haven't ran into this issue for a while, so it's pretty safe to say this issue is resolved by: [root@ibm-x3690x5-01 ~]# rpm -qa | grep "conductor" aeolus-conductor-doc-0.13.14-1.el6cf.noarch aeolus-conductor-0.13.14-1.el6cf.noarch aeolus-conductor-daemons-0.13.14-1.el6cf.noarch |