| Summary: | deltacloud-core segfaults | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] CloudForms Cloud Engine | Reporter: | wes hayutin <whayutin> | ||||||
| Component: | deltacloud-core | Assignee: | Jim Meyering <meyering> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | wes hayutin <whayutin> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 1.0.0 | CC: | cpelland, hbrock, jeckersb, jguiditt, jlaska, lutter, morazi, rananda, rjones, rlandy, whayutin | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 789120 (view as bug list) | Environment: | |||||||
| Last Closed: | 2012-05-15 20:32:39 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Bug Depends On: | 786229 | ||||||||
| Bug Blocks: | 772708, 783195, 789120 | ||||||||
| Attachments: |
|
||||||||
|
Description
wes hayutin
2012-01-17 18:25:51 UTC
So.. from what I can tell.. restarting dc w/ just the init script works.. restart w/ aeolus-restart-services and it fails.. wth? hrm.. it may not be aeolus-restart-services.. which didnt make sense to me anyway looking into it.. adding some color to the bug.. prior to detecting that deltacloud segfaulted I was removing and adding datastores/clusters to the rhevm realm Created attachment 555838 [details]
rails log
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]:
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mMockESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]:
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mMockESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]:
Completed in ESC[1;29m0.000000ESC[0m | 0.000000 | 200 | ESC[1;36mapplication/xmlESC[0m | http://localhost:3002/api
thin server (localhost:3002) [deltacloud-mock][28856]:
Completed in ESC[1;29m0.000000ESC[0m | 0.000000 | 200 | ESC[1;36mapplication/xmlESC[0m | http://localhost:3002/api
thin server (localhost:3002) [deltacloud-mock][28856]:
Completed in ESC[1;29m0.000000ESC[0m | 0.000000 | 200 | ESC[1;36mapplication/xmlESC[0m | http://localhost:3002/api
thin server (localhost:3002) [deltacloud-mock][28856]:
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mEC2
ESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]:
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mRHEVMESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]:
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mVSphereESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
/usr/lib/ruby/gems/1.8/gems/aws-2.4.5/lib/awsbase/errors.rb:32: [BUG] Segmentation fault
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
Just using the deltacloud init script shows the following status for the service: >> Start service [root@localhost /]# /etc/init.d/deltacloud-core start Starting deltacloud-core (via systemctl): [ OK ] [root@localhost /]# /etc/init.d/deltacloud-core status deltacloud-core.service - LSB: Deltacloud Core API deamon Loaded: loaded (/etc/rc.d/init.d/deltacloud-core) Active: active (running) since Tue, 17 Jan 2012 08:03:43 -0500; 5h 46min ago Process: 1177 ExecStart=/etc/rc.d/init.d/deltacloud-core start (code=exited, status=0/SUCCESS) Main PID: 1379 (deltacloudd) CGroup: name=systemd:/system/deltacloud-core.service └ 1379 thin server (localhost:3002) [deltacloud-ec2] ... >> Stop service [root@localhost /]# /etc/init.d/deltacloud-core stop Stopping deltacloud-core (via systemctl): [ OK ] [root@localhost /]# /etc/init.d/deltacloud-core status deltacloud-core.service - LSB: Deltacloud Core API deamon Loaded: loaded (/etc/rc.d/init.d/deltacloud-core) Active: failed since Tue, 17 Jan 2012 13:50:23 -0500; 3s ago Process: 6232 ExecStop=/etc/rc.d/init.d/deltacloud-core stop (code=exited, status=0/SUCCESS) Process: 1177 ExecStart=/etc/rc.d/init.d/deltacloud-core start (code=exited, status=0/SUCCESS) Main PID: 1379 (code=killed, signal=KILL) CGroup: name=systemd:/system/deltacloud-core.service >> Restart service [root@localhost /]# /etc/init.d/deltacloud-core start Starting deltacloud-core (via systemctl): [ OK ] [root@localhost /]# /etc/init.d/deltacloud-core status deltacloud-core.service - LSB: Deltacloud Core API deamon Loaded: loaded (/etc/rc.d/init.d/deltacloud-core) Active: active (exited) since Tue, 17 Jan 2012 13:51:06 -0500; 7s ago Process: 6232 ExecStop=/etc/rc.d/init.d/deltacloud-core stop (code=exited, status=0/SUCCESS) Process: 6261 ExecStart=/etc/rc.d/init.d/deltacloud-core start (code=exited, status=0/SUCCESS) Main PID: 6265 (code=exited, status=0/SUCCESS) CGroup: name=systemd:/system/deltacloud-core.service └ 6266 thin server (localhost:3002) [deltacloud-ec2] ... Init script tested has 'kill -9' in the stop code: stop() { echo -n $"Shutting down $prog: " ... kill -9 `cat $PIDFILE` retval=$? if [ $retval -eq 0 ] && rm -f $LOCKFILE $PIDFILE; then ... fi } Playing with gdb using fvollero reveals these details: #0 0x0000003b01232885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003b01234065 in abort () at abort.c:92 #2 0x0000003b00e2dac3 in rb_bug (fmt=0x3b00ec608c "Segmentation fault") at error.c:213 #3 0x0000003b00e95ddc in sigsegv (sig=11) at signal.c:634 #4 <signal handler called> #5 0x0000003b00e991c4 in st_foreach (table=0x3b00e38bdc, func=0x3b00e51260 <mark_entry>, arg=0) at st.c:486 #6 0x0000003b00e50caf in mark_locations_array (x=<value optimized out>, n=55194) at gc.c:690 #7 0x0000003b00e2ec39 in thread_mark (th=0x2929cb0) at eval.c:10541 #8 0x0000003b00e50fb0 in gc_mark_children (ptr=<value optimized out>, lev=10) at gc.c:1012 #9 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712 #10 0x0000003b00e991dd in st_foreach (table=0x244e3b0, func=0x3b00e51260 <mark_entry>, arg=9) at st.c:487 #11 0x0000003b00e50e88 in mark_tbl (ptr=140710230565000, lev=9) at gc.c:722 #12 gc_mark_children (ptr=140710230565000, lev=9) at gc.c:998 #13 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712 #14 0x0000003b00e991dd in st_foreach (table=0x217f710, func=0x3b00e51260 <mark_entry>, arg=8) at st.c:487 #15 0x0000003b00e50e88 in mark_tbl (ptr=140710237645400, lev=8) at gc.c:722 #16 gc_mark_children (ptr=140710237645400, lev=8) at gc.c:998 #17 0x0000003b00e50efe in gc_mark_children (ptr=140710237499080, lev=7) at gc.c:882 #18 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712 #19 0x0000003b00e991dd in st_foreach (table=0x217fa30, func=0x3b00e51260 <mark_entry>, arg=6) at st.c:487 #20 0x0000003b00e50e70 in mark_tbl (ptr=140710237645320, lev=6) at gc.c:722 #21 gc_mark_children (ptr=140710237645320, lev=6) at gc.c:997 #22 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712 #23 0x0000003b00e991dd in st_foreach (table=0x217ffa0, func=0x3b00e51260 <mark_entry>, arg=5) at st.c:487 #24 0x0000003b00e50e88 in mark_tbl (ptr=140710237645200, lev=5) at gc.c:722 #25 gc_mark_children (ptr=140710237645200, lev=5) at gc.c:998 #26 0x0000003b00e50d31 in gc_mark_children (ptr=140710237624280, lev=4) at gc.c:992 #27 0x0000003b00e50d31 in gc_mark_children (ptr=140710237624320, lev=3) at gc.c:992 #28 0x0000003b00e50d31 in gc_mark_children (ptr=140710204844320, lev=2) at gc.c:992 #29 0x0000003b00e50efe in gc_mark_children (ptr=140710204844360, lev=1) at gc.c:882 #30 0x0000003b00e514a8 in garbage_collect () at gc.c:1473 #31 0x0000003b00e521cf in ruby_xmalloc (size=<value optimized out>) at gc.c:153 #32 0x0000003b00e9ac80 in rb_str_buf_new (capa=128) at string.c:234 #33 0x0000003b00e581cc in appendline (fptr=0x3a91cd0, delim=10, strp=0x7fff89b26ab8) at io.c:1567 #34 0x0000003b00e5833f in rb_io_getline_fast (fptr=0x3a91cd0, delim=10 '\n') at io.c:1690 #35 0x0000003b00e7a597 in lex_getline () at parse.y:2743 #36 nextc () at parse.y:2795 #37 ruby_yylex () at parse.y:3469 #38 ruby_yyparse () at parse.c:4376 #39 0x0000003b00e7f11b in yycompile (f=0x3a8dd30 "/usr/lib/ruby/gems/1.8/gems/activesupport-3.0.10/lib/active_support/core_ext/module/synchronization.rb", line=<value optimized out>) at parse.y:2692 #40 0x0000003b00e95791 in load_file (fname=0x3a8dd30 "/usr/lib/ruby/gems/1.8/gems/activesupport-3.0.10/lib/active_support/core_ext/module/synchronization.rb", script=0) at ruby.c:982 #41 0x0000003b00e4a1ef in rb_load (fname=140710230753280, wrap=<value optimized out>) at eval.c:7135 #42 0x0000003b00e4a90a in rb_require_safe (fname=140710204844320, safe=0) at eval.c:7519 For us it seems like a 'ruby' bug in garbage collector. No clue about how to fix it :-( Any clues? Michal found this blog post, it appears there is a problem with mri ruby itself in how it interacts with gcc, or at least the api it provides for c extensions. This appears to be very bad news: http://timetobleed.com/the-broken-promises-of-mrireeyarv/ might be related https://bugzilla.redhat.com/show_bug.cgi?id=772708 I've got that narrowed down to just the ruby interpreter - on the one machine where this is reproducible, it happens when running the Deltacloud server under WEBRick and using json/pure Investigation still ongoing *** Bug 785166 has been marked as a duplicate of this bug. *** *** Bug 786432 has been marked as a duplicate of this bug. *** *** Bug 786499 has been marked as a duplicate of this bug. *** additional details from QE in the three above closed bugs currently one can probably recreate the segfault by starting and stopping applications in the conductor ui Moving back to assigned -- Wes found another reproducer. My error - Jim informs me that this one is in fact fixed, and that there is a different issue, https://bugzilla.redhat.com/show_bug.cgi?id=786432, that we should be tracking. Moving this back to on_qa. Created attachment 560050 [details]
thin C backtrace
Wes H. set this up.
Provoked by running aeolus-restart-services, and then
"delete an ec2 instance from the web-UI"
TL;DR thin's app-deletion code has been fixed so that it no longer triggers the still-present bug in ruby. ---------------------- We've been plagued by ruby-1.8.7 (RHEL-6.2) bugs. First it was one that made deltacloud segfault consistently. That was fixed by this (year-old!) marshalling patch: http://bugs.ruby-lang.org/issues/4339 [resolved in CE by http://bugzilla.redhat.com/782573] Then there was a harder-to-reproduce one. One way is the to run a conductor test repeatedly: while :; do cucumber features/provider.feature; sleep 2; done Another way to reproduce it is to set up all services, provision and create an EC2 application, and delete that app via the UI. Just like the preceding segfaults, it happens hundreds of levels down the stack (400-600) in garbage collection, usually during the marking phase. Turning on GC-always makes the tool so slow that it is unusable. At first, we suspected that some gem-with-C-compiled bits was making the same mistake fixed by the marshalling patch. Last I heard, David Lutterkort had eliminated most such gems as the possible cause. I also audited ruby itself, looking for similar abuse to what was fixed by the marshalling patch. prelude.c had one, but it's very special, and in fact, ok. Michal Fojtik pointed me to this site: https://sites.google.com/site/brentsrubypatches/ which seemed very promising. I've applied 4 patches (1,1a,2,3) from his git repo and made a new rhel6 rpm. With it, there is no more segfault. [ FTR, applying just 1 and 1a did not help. I don't remember if I tried also with just 1..2, so it's quite possible that patch 3 is not needed. ] With that segfault out of the way, we found that we could now exercise the app-deletion code, which had a problem: The deletion request that was segfaulting before now completes normally, but takes 2m44s, during which thin spawns over 5000 threads. Fixing the thread storm (done by http://bugzilla.redhat.com/791195) makes it so this ruby problem is no longer triggered. Here's description from the brentsrubypatches link above for the patch that seems most likely to be fixing our problem: MBARI3.patch: Ruby's conservative garbage collector cannot tell whether machine words on the 'C' stack are object pointers or integers, etc. because there is no type information associated with them. A conservative collector works by "conserving" every object to which there could possibly be a reference. In the 1.8 and 1.6 series Ruby implementations, this means scanning the stack of each Thread and Continuation assuming that every word is an object pointer if it has a value could be so interpreted. In practice, this is not as bad is it may seem, as Ruby's collector does not consider pointers "inside" an object to be valid -- only those that point to its exact base address. So, even assuming thousands of live objects, a 32-bit address space will remain very sparsely populated with valid object pointers. The garbage collector's leaking memory is not really its own fault. The trouble is that the 'C' machine stack is filled with object references. The main reason for this is that gcc compilers create overly large stack frames and do not initialize many values in them. Certain 'C' constructs used in the Ruby interpreter's core recursive expression evaluator generate especially large, sparse stack frames. The function rb_eval() is the worst offender, creating kilobyte sized stack frames for each invocation of a function that may call itself hundreds of times. This results in stacks that are hundreds of kilobytes, often full of old, dead object references that may never go away. If there were a gcc compiler option to initialize all local variables to zero whever a new stack frame is built, that would let the collector do its work properly, but no such option exists. The MBARI3 patch tracks the maximum depth of the 'C' stack at critical points in the interpreter's execution so that, whenever the stack shrinks, its old contents can be replaced with zeros. One might think this would have no effect because each new stack frame would simply overwrite all the zeros. But, as noted in the previous paragraph, gcc does not optimize that way, rather it optimizes for time by writing only those words in each frame actually needed for a specific execution path through the function. If those values are left uninitialized, when the garbage collector runs, the collector interprets some of them as valid object pointers that must be preserved. Even though these "ghost values" were written by previous iterations of functions that are no longer active. This patch ensures that almost all those hitherto uninitialized values are zero when the collector runs. In practice, for MBARI's ESP application, I observed that the process size increased slowly without bound over many hours of operation. After patching, the process size stayed below 10MB for days. Before, I had seen it increase to >24Mbytes. (Our ARM targets have only 32MB of DRAM!) Your mileage may vary, but I expect every long running application will see significant reduction in process size although some may run slightly slower due to the overhead of clearing the stack. In case anyone wants to play, RPMs (src and x86_64) of the patched ruby are here: http://people.redhat.com/meyering/tmp/ unable to recreate in ^C [root@qeblade30 ~]# rpm -qa | grep aeolus rubygem-aeolus-image-0.3.0-10.el6.noarch aeolus-conductor-doc-0.8.0-36.el6.noarch rubygem-aeolus-cli-0.3.0-10.el6.noarch aeolus-all-0.8.0-36.el6.noarch aeolus-conductor-0.8.0-36.el6.noarch aeolus-configure-2.5.0-15.el6.noarch aeolus-conductor-daemons-0.8.0-36.el6.noarch [root@qeblade30 ~]# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2012-0587.html This is another good reason not to use Ruby 1.8.7: https://bugzilla.redhat.com/show_bug.cgi?id=843188#c6 |