Bug 782573 (dc-ruby-segfault)

Summary: deltacloud-core segfaults
Product: [Retired] CloudForms Cloud Engine Reporter: wes hayutin <whayutin>
Component: deltacloud-coreAssignee: Jim Meyering <meyering>
Status: CLOSED ERRATA QA Contact: wes hayutin <whayutin>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 1.0.0CC: cpelland, hbrock, jeckersb, jguiditt, jlaska, lutter, morazi, rananda, rjones, rlandy, whayutin
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 789120 (view as bug list) Environment:
Last Closed: 2012-05-15 20:32:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 786229    
Bug Blocks: 772708, 783195, 789120    
Attachments:
Description Flags
rails log
none
thin C backtrace none

Description wes hayutin 2012-01-17 18:25:51 UTC
Description of problem:

recreate:


[root@qeblade31 ~]# /etc/init.d/deltacloud-core start


Starting deltacloud-core-mock: >> Deleting stale PID file /var/run/deltacloud/deltacloud-core-mock.pid
[  OK  ]
[root@qeblade31 ~]# ll /var/run/deltacloud/deltacloud-core-mock.pid
-rw-r--r--. 1 root root 5 Jan 17 13:21 /var/run/deltacloud/deltacloud-core-mock.pid
[root@qeblade31 ~]# ps -ef |grep deltacloud
nobody   27041     1  2 13:21 ?        00:00:00 thin server (localhost:3002) [deltacloud-mock]                                                                                                                 
root     27046 24791  0 13:22 pts/0    00:00:00 grep deltacloud


[root@qeblade31 ~]# tail /var/log/deltacloud-core/mock.log 
thin server (localhost:3002) [deltacloud-mock][26967]: 127.0.0.1 - - [17/Jan/2012 13:20:11] "GET /api/instances/ HTTP/1.1" 404 18 0.0041
/usr/lib/ruby/1.8/set.rb:201: [BUG] Segmentation fault
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

>> Writing PID to /var/run/deltacloud/deltacloud-core-mock.pid
>> Changing process privilege to nobody:nobody
thin server (localhost:3002) [deltacloud-mock][27041]: >> Thin web server (v1.2.11 codename Bat-Shit Crazy)
thin server (localhost:3002) [deltacloud-mock][27041]: >> Debugging ON
thin server (localhost:3002) [deltacloud-mock][27041]: >> Maximum connections set to 1024
thin server (localhost:3002) [deltacloud-mock][27041]: >> Listening on localhost:3002, CTRL+C to stop
[root@qeblade31 ~]# 
[root@qeblade31 ~]# 
[root@qeblade31 ~]# /etc/init.d/deltacloud-core restart
Shutting down deltacloud-core: [  OK  ]
Starting deltacloud-core-mock: [  OK  ]

[root@qeblade31 ~]# tail /var/log/deltacloud-core/mock.log 
thin server (localhost:3002) [deltacloud-mock][27041]: >> Debugging ON
thin server (localhost:3002) [deltacloud-mock][27041]: >> Maximum connections set to 1024
thin server (localhost:3002) [deltacloud-mock][27041]: >> Listening on localhost:3002, CTRL+C to stop
thin server (localhost:3002) [deltacloud-mock][27041]: >> Exiting!
>> Writing PID to /var/run/deltacloud/deltacloud-core-mock.pid
>> Changing process privilege to nobody:nobody
thin server (localhost:3002) [deltacloud-mock][27065]: >> Thin web server (v1.2.11 codename Bat-Shit Crazy)
thin server (localhost:3002) [deltacloud-mock][27065]: >> Debugging ON
thin server (localhost:3002) [deltacloud-mock][27065]: >> Maximum connections set to 1024
thin server (localhost:3002) [deltacloud-mock][27065]: >> Listening on localhost:3002, CTRL+C to stop

[root@qeblade31 ~]# ps -ef | grep deltacloud
nobody   27065     1  5 13:22 ?        00:00:00 thin server (localhost:3002) [deltacloud-mock]                                                                                                                 
root     27069 24791  0 13:22 pts/0    00:00:00 grep deltacloud
[root@qeblade31 ~]# 
[root@qeblade31 ~]# 
[root@qeblade31 ~]# 
[root@qeblade31 ~]# 
[root@qeblade31 ~]# ####### NOW HERE"S THE PROBLEM ###########################^C
[root@qeblade31 ~]# 
[root@qeblade31 ~]# aeolus-restart-services 

Stoping ntpd ...
 Success: Shutting down ntpd: [  OK  ]

Stoping imagefactory ...
 Success: Stopping imagefactory: [  OK  ]

Stoping conductor-dbomatic ...
[  OK  ]: Shutting down conductor-dbomatic: [  OK  ]

Stoping aeolus-conductor ...
 Success: Shutting down aeolus-conductor: [  OK  ]

Stoping libvirtd ...
 Success: Stopping libvirtd daemon: [  OK  ]

Stoping deltacloud-core ...
 Success: Shutting down deltacloud-core: [  OK  ]

Stoping qpidd ...
 Success: Stopping Qpid AMQP daemon: [  OK  ]

Stoping httpd ...
 Success: Stopping httpd: [  OK  ]

Stoping postgresql ...
 Success: Stopping postgresql service: [  OK  ]

Stoping iwhd ...
[  OK  ]: Stopping iwhd daemon:  [  OK  ]

Stoping mongod ...
 Success: Stopping mongod: [  OK  ]

Starting mongod ...
 Success: Starting mongod: [  OK  ]

Starting iwhd ...
 Success: waiting for mongod to listen on localhost:27017[  OK  ]
Starting iwhd daemon: [  OK  ]

Starting postgresql ...
 Success: Starting postgresql service: [  OK  ]

Starting httpd ...
 Success: Starting httpd: [  OK  ]

Starting qpidd ...
 Success: Starting Qpid AMQP daemon: [  OK  ]

Starting deltacloud-core ...
 Success: Starting deltacloud-core-mock: [  OK  ]

Starting libvirtd ...
 Success: Starting libvirtd daemon: [  OK  ]

Starting aeolus-conductor ...
 Success: Starting aeolus-conductor: [  OK  ]

Starting conductor-dbomatic ...
[  OK  ]: Starting conductor-dbomatic: [  OK  ]

Starting imagefactory ...
 Success: Starting imagefactory: [  OK  ]

Starting ntpd ...
 Success: Starting ntpd: [  OK  ]

Statusing mongod ...
 Success: mongod (pid 27258) is running...
[root@qeblade31 ~]# tail /var/log/deltacloud-core/mock.log 
thin server (localhost:3002) [deltacloud-mock][27065]: >> Debugging ON
thin server (localhost:3002) [deltacloud-mock][27065]: >> Maximum connections set to 1024
thin server (localhost:3002) [deltacloud-mock][27065]: >> Listening on localhost:3002, CTRL+C to stop
thin server (localhost:3002) [deltacloud-mock][27065]: >> Exiting!
>> Writing PID to /var/run/deltacloud/deltacloud-core-mock.pid
>> Changing process privilege to nobody:nobody
thin server (localhost:3002) [deltacloud-mock][27421]: >> Thin web server (v1.2.11 codename Bat-Shit Crazy)
thin server (localhost:3002) [deltacloud-mock][27421]: >> Debugging ON
thin server (localhost:3002) [deltacloud-mock][27421]: >> Maximum connections set to 1024
thin server (localhost:3002) [deltacloud-mock][27421]: >> Listening on localhost:3002, CTRL+C to stop
[root@qeblade31 ~]# 
[root@qeblade31 ~]# 
[root@qeblade31 ~]# 
[root@qeblade31 ~]# /etc/init.d/deltacloud-core status
deltacloud-core dead but pid file exists
[root@qeblade31 ~]# tail /var/log/deltacloud-core/mock.log 
thin server (localhost:3002) [deltacloud-mock][27065]: >> Exiting!
>> Writing PID to /var/run/deltacloud/deltacloud-core-mock.pid
>> Changing process privilege to nobody:nobody
thin server (localhost:3002) [deltacloud-mock][27421]: >> Thin web server (v1.2.11 codename Bat-Shit Crazy)
thin server (localhost:3002) [deltacloud-mock][27421]: >> Debugging ON
thin server (localhost:3002) [deltacloud-mock][27421]: >> Maximum connections set to 1024
thin server (localhost:3002) [deltacloud-mock][27421]: >> Listening on localhost:3002, CTRL+C to stop
/usr/lib/ruby/gems/1.8/gems/aws-2.4.5/lib/awsbase/utils.rb:100: [BUG] Segmentation fault
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

[root@qeblade31 ~]# 


[root@qeblade31 ~]# rpm -qa | grep aeolus
aeolus-conductor-daemons-0.8.0-7.el6.noarch
aeolus-conductor-doc-0.8.0-7.el6.noarch
rubygem-aeolus-cli-0.3.0-3.el6.noarch
rubygem-aeolus-image-0.3.0-2.el6.noarch
aeolus-all-0.8.0-7.el6.noarch
aeolus-configure-2.5.0-4.el6.noarch
aeolus-conductor-0.8.0-7.el6.noarch
[root@qeblade31 ~]# rpm -qa | grep delta
deltacloud-core-ec2-0.5.0-1.rc1.el6.noarch
deltacloud-core-vsphere-0.5.0-1.rc1.el6.noarch
deltacloud-core-0.5.0-1.rc1.el6.noarch
rubygem-deltacloud-client-0.4.0-3.el6.noarch
deltacloud-core-rhevm-0.5.0-1.rc1.el6.noarch

Comment 1 wes hayutin 2012-01-17 18:27:02 UTC
So.. from what I can tell.. restarting dc w/ just the init script works..

restart w/ aeolus-restart-services and it fails.. wth?

Comment 2 wes hayutin 2012-01-17 18:46:01 UTC
hrm.. it may not be aeolus-restart-services.. which didnt make sense to me anyway

looking into it..

Comment 3 wes hayutin 2012-01-17 18:49:41 UTC
adding some color to the bug..

prior to detecting that deltacloud segfaulted I was removing and adding datastores/clusters to the rhevm realm

Comment 4 wes hayutin 2012-01-17 18:51:33 UTC
Created attachment 555838 [details]
rails log

Comment 5 wes hayutin 2012-01-17 18:54:53 UTC
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]: 
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mMockESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]: 
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mMockESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]: 
Completed in ESC[1;29m0.000000ESC[0m | 0.000000 | 200 | ESC[1;36mapplication/xmlESC[0m | http://localhost:3002/api
thin server (localhost:3002) [deltacloud-mock][28856]: 
Completed in ESC[1;29m0.000000ESC[0m | 0.000000 | 200 | ESC[1;36mapplication/xmlESC[0m | http://localhost:3002/api
thin server (localhost:3002) [deltacloud-mock][28856]: 
Completed in ESC[1;29m0.000000ESC[0m | 0.000000 | 200 | ESC[1;36mapplication/xmlESC[0m | http://localhost:3002/api
thin server (localhost:3002) [deltacloud-mock][28856]: 
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mEC2
ESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]: 
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mRHEVMESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
thin server (localhost:3002) [deltacloud-mock][28856]: 
ESC[1;29mProcessing /apiESC[0m (for 127.0.0.1 at Tue Jan 17 13:53:34 -0500 2012) [GET] [ESC[1;29mVSphereESC[0m]
thin server (localhost:3002) [deltacloud-mock][28856]: Parameters: {}
thin server (localhost:3002) [deltacloud-mock][28856]: Authentication: Basic
thin server (localhost:3002) [deltacloud-mock][28856]: Server: thin 1.2.11 codename Bat-Shit Crazy
thin server (localhost:3002) [deltacloud-mock][28856]: Accept: application/xml
/usr/lib/ruby/gems/1.8/gems/aws-2.4.5/lib/awsbase/errors.rb:32: [BUG] Segmentation fault
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

Comment 6 Ronelle Landy 2012-01-17 19:01:07 UTC
Just using the deltacloud init script shows the following status for the
service:

>> Start service

[root@localhost /]# /etc/init.d/deltacloud-core start
Starting deltacloud-core (via systemctl):                  [  OK  ]

[root@localhost /]# /etc/init.d/deltacloud-core status
deltacloud-core.service - LSB: Deltacloud Core API deamon
   Loaded: loaded (/etc/rc.d/init.d/deltacloud-core)
   Active: active (running) since Tue, 17 Jan 2012 08:03:43 -0500; 5h 46min ago
  Process: 1177 ExecStart=/etc/rc.d/init.d/deltacloud-core start (code=exited,
status=0/SUCCESS)
 Main PID: 1379 (deltacloudd)
   CGroup: name=systemd:/system/deltacloud-core.service
    └ 1379 thin server (localhost:3002) [deltacloud-ec2]     ...

>> Stop service

[root@localhost /]# /etc/init.d/deltacloud-core stop
Stopping deltacloud-core (via systemctl):                  [  OK  ]

[root@localhost /]# /etc/init.d/deltacloud-core status
deltacloud-core.service - LSB: Deltacloud Core API deamon
   Loaded: loaded (/etc/rc.d/init.d/deltacloud-core)
   Active: failed since Tue, 17 Jan 2012 13:50:23 -0500; 3s ago
  Process: 6232 ExecStop=/etc/rc.d/init.d/deltacloud-core stop (code=exited,
status=0/SUCCESS)
  Process: 1177 ExecStart=/etc/rc.d/init.d/deltacloud-core start (code=exited,
status=0/SUCCESS)
 Main PID: 1379 (code=killed, signal=KILL)
   CGroup: name=systemd:/system/deltacloud-core.service

>> Restart service

[root@localhost /]# /etc/init.d/deltacloud-core start
Starting deltacloud-core (via systemctl):                  [  OK  ]

[root@localhost /]# /etc/init.d/deltacloud-core status
deltacloud-core.service - LSB: Deltacloud Core API deamon
   Loaded: loaded (/etc/rc.d/init.d/deltacloud-core)
   Active: active (exited) since Tue, 17 Jan 2012 13:51:06 -0500; 7s ago
  Process: 6232 ExecStop=/etc/rc.d/init.d/deltacloud-core stop (code=exited,
status=0/SUCCESS)
  Process: 6261 ExecStart=/etc/rc.d/init.d/deltacloud-core start (code=exited,
status=0/SUCCESS)
 Main PID: 6265 (code=exited, status=0/SUCCESS)
   CGroup: name=systemd:/system/deltacloud-core.service
    └ 6266 thin server (localhost:3002) [deltacloud-ec2]     ...

Init script tested has 'kill -9' in the stop code:

stop() {
    echo -n $"Shutting down $prog: "
    ...

      kill -9 `cat $PIDFILE`
      retval=$?
      if [ $retval -eq 0 ] && rm -f $LOCKFILE $PIDFILE; then
       ...
    fi
}

Comment 7 Michal Fojtik 2012-01-18 16:52:15 UTC
Playing with gdb using fvollero reveals these details:

#0  0x0000003b01232885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003b01234065 in abort () at abort.c:92
#2  0x0000003b00e2dac3 in rb_bug (fmt=0x3b00ec608c "Segmentation fault") at error.c:213
#3  0x0000003b00e95ddc in sigsegv (sig=11) at signal.c:634
#4  <signal handler called>
#5  0x0000003b00e991c4 in st_foreach (table=0x3b00e38bdc, func=0x3b00e51260 <mark_entry>, arg=0) at st.c:486
#6  0x0000003b00e50caf in mark_locations_array (x=<value optimized out>, n=55194) at gc.c:690
#7  0x0000003b00e2ec39 in thread_mark (th=0x2929cb0) at eval.c:10541
#8  0x0000003b00e50fb0 in gc_mark_children (ptr=<value optimized out>, lev=10) at gc.c:1012
#9  0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712
#10 0x0000003b00e991dd in st_foreach (table=0x244e3b0, func=0x3b00e51260 <mark_entry>, arg=9) at st.c:487
#11 0x0000003b00e50e88 in mark_tbl (ptr=140710230565000, lev=9) at gc.c:722
#12 gc_mark_children (ptr=140710230565000, lev=9) at gc.c:998
#13 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712
#14 0x0000003b00e991dd in st_foreach (table=0x217f710, func=0x3b00e51260 <mark_entry>, arg=8) at st.c:487
#15 0x0000003b00e50e88 in mark_tbl (ptr=140710237645400, lev=8) at gc.c:722
#16 gc_mark_children (ptr=140710237645400, lev=8) at gc.c:998
#17 0x0000003b00e50efe in gc_mark_children (ptr=140710237499080, lev=7) at gc.c:882
#18 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712
#19 0x0000003b00e991dd in st_foreach (table=0x217fa30, func=0x3b00e51260 <mark_entry>, arg=6) at st.c:487
#20 0x0000003b00e50e70 in mark_tbl (ptr=140710237645320, lev=6) at gc.c:722
#21 gc_mark_children (ptr=140710237645320, lev=6) at gc.c:997
#22 0x0000003b00e5126e in mark_entry (key=<value optimized out>, value=<value optimized out>, lev=<value optimized out>) at gc.c:712
#23 0x0000003b00e991dd in st_foreach (table=0x217ffa0, func=0x3b00e51260 <mark_entry>, arg=5) at st.c:487
#24 0x0000003b00e50e88 in mark_tbl (ptr=140710237645200, lev=5) at gc.c:722
#25 gc_mark_children (ptr=140710237645200, lev=5) at gc.c:998
#26 0x0000003b00e50d31 in gc_mark_children (ptr=140710237624280, lev=4) at gc.c:992
#27 0x0000003b00e50d31 in gc_mark_children (ptr=140710237624320, lev=3) at gc.c:992
#28 0x0000003b00e50d31 in gc_mark_children (ptr=140710204844320, lev=2) at gc.c:992
#29 0x0000003b00e50efe in gc_mark_children (ptr=140710204844360, lev=1) at gc.c:882
#30 0x0000003b00e514a8 in garbage_collect () at gc.c:1473
#31 0x0000003b00e521cf in ruby_xmalloc (size=<value optimized out>) at gc.c:153
#32 0x0000003b00e9ac80 in rb_str_buf_new (capa=128) at string.c:234
#33 0x0000003b00e581cc in appendline (fptr=0x3a91cd0, delim=10, strp=0x7fff89b26ab8) at io.c:1567
#34 0x0000003b00e5833f in rb_io_getline_fast (fptr=0x3a91cd0, delim=10 '\n') at io.c:1690
#35 0x0000003b00e7a597 in lex_getline () at parse.y:2743
#36 nextc () at parse.y:2795
#37 ruby_yylex () at parse.y:3469
#38 ruby_yyparse () at parse.c:4376
#39 0x0000003b00e7f11b in yycompile (f=0x3a8dd30 "/usr/lib/ruby/gems/1.8/gems/activesupport-3.0.10/lib/active_support/core_ext/module/synchronization.rb", line=<value optimized out>) at parse.y:2692
#40 0x0000003b00e95791 in load_file (fname=0x3a8dd30 "/usr/lib/ruby/gems/1.8/gems/activesupport-3.0.10/lib/active_support/core_ext/module/synchronization.rb", script=0) at ruby.c:982
#41 0x0000003b00e4a1ef in rb_load (fname=140710230753280, wrap=<value optimized out>) at eval.c:7135
#42 0x0000003b00e4a90a in rb_require_safe (fname=140710204844320, safe=0) at eval.c:7519


For us it seems like a 'ruby' bug in garbage collector. No clue about how to fix it :-( Any clues?

Comment 8 Jason Guiditta 2012-01-24 16:13:40 UTC
Michal found this blog post, it appears there is a problem with mri ruby itself in how it interacts with gcc, or at least the api it provides for c extensions.  This appears to be very bad news:
http://timetobleed.com/the-broken-promises-of-mrireeyarv/

Comment 9 wes hayutin 2012-01-24 17:21:05 UTC
might be related https://bugzilla.redhat.com/show_bug.cgi?id=772708

Comment 11 David Lutterkort 2012-01-24 20:57:10 UTC
I've got that narrowed down to just the ruby interpreter - on the one machine where this is reproducible, it happens when running the Deltacloud server under WEBRick and using json/pure

Investigation still ongoing

Comment 12 wes hayutin 2012-01-27 14:00:34 UTC
*** Bug 785166 has been marked as a duplicate of this bug. ***

Comment 13 wes hayutin 2012-02-01 12:45:41 UTC
*** Bug 786432 has been marked as a duplicate of this bug. ***

Comment 14 wes hayutin 2012-02-01 15:48:46 UTC
*** Bug 786499 has been marked as a duplicate of this bug. ***

Comment 15 wes hayutin 2012-02-01 15:49:28 UTC
additional details from QE in the three above closed bugs

Comment 16 wes hayutin 2012-02-01 15:58:31 UTC
currently one can probably recreate the segfault by starting and stopping applications in the conductor ui

Comment 18 Hugh Brock 2012-02-07 20:17:08 UTC
Moving back to assigned -- Wes found another reproducer.

Comment 19 Hugh Brock 2012-02-07 20:26:03 UTC
My error - Jim informs me that this one is in fact fixed, and that there is a different issue, https://bugzilla.redhat.com/show_bug.cgi?id=786432, that we should be tracking. Moving this back to on_qa.

Comment 20 Jim Meyering 2012-02-07 20:33:37 UTC
Created attachment 560050 [details]
thin C backtrace

Wes H. set this up.
Provoked by running aeolus-restart-services, and then
"delete an ec2 instance from the web-UI"

Comment 21 Jim Meyering 2012-02-27 15:09:34 UTC
TL;DR thin's app-deletion code has been fixed so that
it no longer triggers the still-present bug in ruby.
----------------------

We've been plagued by ruby-1.8.7 (RHEL-6.2) bugs.
First it was one that made deltacloud segfault consistently.
That was fixed by this (year-old!) marshalling patch:

  http://bugs.ruby-lang.org/issues/4339
  [resolved in CE by http://bugzilla.redhat.com/782573]

Then there was a harder-to-reproduce one.
One way is the to run a conductor test repeatedly:

  while :; do cucumber features/provider.feature; sleep 2; done

Another way to reproduce it is to set up all services, provision
and create an EC2 application, and delete that app via the UI.
Just like the preceding segfaults, it happens hundreds of levels
down the stack (400-600) in garbage collection, usually during the
marking phase.  Turning on GC-always makes the tool so slow that
it is unusable.  At first, we suspected that some gem-with-C-compiled
bits was making the same mistake fixed by the marshalling patch.
Last I heard, David Lutterkort had eliminated most such gems as
the possible cause.  I also audited ruby itself, looking for similar
abuse to what was fixed by the marshalling patch.  prelude.c had one,
but it's very special, and in fact, ok.

Michal Fojtik pointed me to this site:
  https://sites.google.com/site/brentsrubypatches/
which seemed very promising.
I've applied 4 patches (1,1a,2,3) from his git repo
and made a new rhel6 rpm.  With it, there is no more segfault.
[ FTR, applying just 1 and 1a did not help.
  I don't remember if I tried also with just 1..2,
  so it's quite possible that patch 3 is not needed. ]

With that segfault out of the way, we found that we could now
exercise the app-deletion code, which had a problem:

  The deletion request that was segfaulting before now completes
  normally, but takes 2m44s, during which thin spawns over 5000 threads.

Fixing the thread storm (done by http://bugzilla.redhat.com/791195)
makes it so this ruby problem is no longer triggered.

Here's description from the brentsrubypatches link above
for the patch that seems most likely to be fixing our problem:

  MBARI3.patch:
  Ruby's conservative garbage collector cannot tell whether machine words
  on the 'C' stack are object pointers or integers, etc. because there is no
  type information associated with them.  A conservative collector works by
  "conserving" every object to which there could possibly be a reference.
  In the 1.8 and 1.6 series Ruby implementations, this means scanning the
  stack of each Thread and Continuation assuming that every word is an
  object pointer if it has a value could be so interpreted.  In practice,
  this is not as bad is it may seem, as Ruby's collector does not consider
  pointers "inside" an object to be valid -- only those that point to
  its exact base address.  So, even assuming thousands of live objects,
  a 32-bit address space will remain very sparsely populated with valid
  object pointers.

  The garbage collector's leaking memory is not really its own fault.  The
  trouble is that the 'C' machine stack is filled with object references.
  The main reason for this is that gcc compilers create overly large
  stack frames and do not initialize many values in them.  Certain 'C'
  constructs used in the Ruby interpreter's core recursive expression
  evaluator generate especially large, sparse stack frames.  The function
  rb_eval() is the worst offender, creating kilobyte sized stack frames for
  each invocation of a function that may call itself hundreds of times.
  This results in stacks that are hundreds of kilobytes, often full of
  old, dead object references that may never go away.  If there were a gcc
  compiler option to initialize all local variables to zero whever a new
  stack frame is built, that would let the collector do its work properly,
  but no such option exists.

  The MBARI3 patch tracks the maximum depth of the 'C' stack at critical
  points in the interpreter's execution so that, whenever the stack
  shrinks, its old contents can be replaced with zeros.  One might think
  this would have no effect because each new stack frame would simply
  overwrite all the zeros.  But, as noted in the previous paragraph,
  gcc does not optimize that way, rather it optimizes for time by writing
  only those words in each frame actually needed for a specific execution
  path through the function.  If those values are left uninitialized,
  when the garbage collector runs, the collector interprets some of them
  as valid object pointers that must be preserved.  Even though these
  "ghost values"  were written by previous iterations of functions that
  are no longer active.  This patch ensures that almost all those hitherto
  uninitialized values are zero when the collector runs.

  In practice, for MBARI's ESP application, I observed that the process
  size increased slowly without bound over many hours of operation.
  After patching, the process size stayed below 10MB for days.  Before,
  I had seen it increase to >24Mbytes.  (Our ARM targets have only
  32MB of DRAM!)  Your mileage may vary, but I expect every long running
  application will see significant reduction in process size although some
  may run slightly slower due to the overhead of clearing the stack.

In case anyone wants to play, RPMs (src and x86_64) of the patched
ruby are here:

  http://people.redhat.com/meyering/tmp/

Comment 22 wes hayutin 2012-02-27 19:22:42 UTC
unable to recreate in ^C
[root@qeblade30 ~]# rpm -qa | grep aeolus
rubygem-aeolus-image-0.3.0-10.el6.noarch
aeolus-conductor-doc-0.8.0-36.el6.noarch
rubygem-aeolus-cli-0.3.0-10.el6.noarch
aeolus-all-0.8.0-36.el6.noarch
aeolus-conductor-0.8.0-36.el6.noarch
aeolus-configure-2.5.0-15.el6.noarch
aeolus-conductor-daemons-0.8.0-36.el6.noarch
[root@qeblade30 ~]#

Comment 23 errata-xmlrpc 2012-05-15 20:32:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2012-0587.html

Comment 24 Richard W.M. Jones 2012-08-03 14:31:32 UTC
This is another good reason not to use Ruby 1.8.7:
https://bugzilla.redhat.com/show_bug.cgi?id=843188#c6