Bug 848639 - Semaphore Sets of gears that no longer exist are left around causing semaphore set exhaustion
Semaphore Sets of gears that no longer exist are left around causing semaphor...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Origin
Classification: Red Hat
Component: Containers (Show other bugs)
1.x
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Rob Millner
libra bugs
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-16 01:47 EDT by Xiaoli Tian
Modified: 2013-11-17 19:42 EST (History)
3 users (show)

See Also:
Fixed In Version: libra_ami #2037
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-09-17 17:29:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Xiaoli Tian 2012-08-16 01:47:16 EDT
Description of problem:

While creating any types of app on INT, it will output the following errors:
Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support.

#rhc app create -a 7v0kxgnaag -t python-2.6 -l xtian+int1@redhat.com -p 123456 -d 
Submitting form:
debug: true
rhlogin: xtian+int1@redhat.com
Contacting https://int.openshift.redhat.com
Creating application: 7v0kxgnaag in olipvj2585
Contacting https://int.openshift.redhat.com
Problem reported from server. Response code was 500.
DEBUG:
Initialized empty Git repository in /var/lib/stickshift/ef685948d2de4c179a07270179e465c6/git/7v0kxgnaag.git/
/var/lib/stickshift/ef685948d2de4c179a07270179e465c6/git/7v0kxgnaag.git /tmp
/tmp
httpd not running, trying to start
Failed to restart master httpd, please contact support
Cartridge return code: 120

Exit Code: 143
broker_c: namespacerhloginsshapp_uuiddebugaltercartridgecart_typeactionapp_nameapi
api_c: placeholder
API version: 1.1.3

RESULT:
Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support.


Version-Release number of selected component (if applicable):
INT(devenv_1987)

How reproducible:
80%

Steps to Reproduce:
1.Create any kind of scaling or non-scaling app in current INT (note this only happens on INT now)
  
Actual results:
Meet error: "Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support."

Expected results:
It should be created successfully

Additional info:
Comment 1 Xiaoli Tian 2012-08-17 02:29:28 EDT
Have not met this error today, it's fixed.
Comment 2 Xiaoli Tian 2012-08-17 02:34:01 EDT
Tested on latest INT (devenv_2007)
#rhc app create -a 6x52j3uib3 -t perl-5.10 -l xtian+test6@redhat.com -p 123456 -d 
Submitting form:
rhlogin: xtian+test6@redhat.com
debug: true
Contacting https://int.openshift.redhat.com
Creating application: 6x52j3uib3 in 83sdqiw02s
Contacting https://int.openshift.redhat.com
Response from server:
DEBUG:

Exit Code: 0
broker_c: ["namespace", "rhlogin", "ssh", "app_uuid", "debug", "alter", "cartridge", "cart_type", "action", "app_name", "api"]
api_c: ["placeholder"]
API version: 1.1.3
DEBUG: '6x52j3uib3' creation returned success.
Now your new domain name is being propagated worldwide (this might take a minute)...
Pulling new repo down
git clone --quiet ssh://9ee7f9387b4541a1bd2edd6bfe3403d2@6x52j3uib3-83sdqiw02s.int.rhcloud.com/~/git/6x52j3uib3.git/ 6x52j3uib3
Warning: Permanently added '6x52j3uib3-83sdqiw02s.int.rhcloud.com' (RSA) to the list of known hosts. 
Checking if the application is available #1
Application 6x52j3uib3 is available at: http://6x52j3uib3-83sdqiw02s.int.rhcloud.com/
Git URL: ssh://9ee7f9387b4541a1bd2edd6bfe3403d2@6x52j3uib3-83sdqiw02s.int.rhcloud.com/~/git/6x52j3uib3.git/
To make changes to '6x52j3uib3', commit to 6x52j3uib3/.
Successfully created application: 6x52j3uib3r
Comment 3 Rony Gong 2012-08-20 01:36:29 EDT
Reopen it since it can be reproduced on INT(devenv_2021), same error
Comment 4 Thomas Wiest 2012-08-20 11:08:37 EDT
This is caused because gears that were destroyed still have their semaphore sets still listed in ipcs.

Eventually, the ex-node runs out of semaphore sets and then httpd can't restart because it can't create a semaphore set.

Here's the log from the gear creation of ef685948d2de4c179a07270179e465c6 (from the description above):

I, [2012-08-16T01:17:56.430621 #23636]  INFO -- : stickshift.rb:373:in `cartridge_do_action' cartridge_
do_action ERROR (120)
------
Initialized empty Git repository in /var/lib/stickshift/ef685948d2de4c179a07270179e465c6/git/7v0kxgnaag
.git/
/var/lib/stickshift/ef685948d2de4c179a07270179e465c6/git/7v0kxgnaag.git /tmp
/tmp
httpd not running, trying to start
Failed to restart master httpd, please contact support

------)




Here's the http error log:
[Thu Aug 16 01:17:56 2012] [crit] (28)No space left on device: mod_rewrite: Parent could not create RewriteLock file /dev/shm/httpd-rewritelock.lock

^--- This is actually caused by semaphore set exhaustion as there is plenty of space on /dev/shm.




Here's a snippet of ipcs:

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
[... SNIP ...]
0x00000000 213549990  3029       600        1         
0x00000000 213746599  3054       600        1         
0x00000000 214565800  3049       600        1         
0x00000000 214008745  3061       600        1         
0x00000000 214139818  3056       600        1         
0x00000000 214270891  ca717c02ef 600        1    # <--- this gear still exists, the rest in this list don't
0x00000000 214598572  3039       600        1         
0x00000000 214729645  3066       600        1         
0x00000000 215385006  3075       600        1         
0x00000000 215417775  3083       600        1         
0x00000000 216859568  3084       600        1         
0x00000000 216892337  3088       600        1         
0x00000000 216925106  3076       600        1        
[... SNIP ...]




Steps to reproduce:
1) Create a lot of gears
2) run ipcs -s
3) notice that all of the semaphores have proper owners
4) Remove most of the gears
5) run ipcs -s
6) notice that the semaphore sets of the gears that were removed are still in the list
Comment 5 Thomas Wiest 2012-08-20 12:48:57 EDT
Ok, so this seems to be a sporadic bug in that just creating an app and then destroying it doesn't leave around a semaphore set.

Here is a list of cartridges that create semaphore sets when created:
Ruby-1.8
Ruby-1.9
Python-2.6
Php-5.3
Perl-5.10


As you can see, it's basically all of the cartridges that use httpd inside the gear.


So, the steps above to reproduce might not work.

This might only happen when the gear httpd fails to stop properly during a gear destroy and leaves around the semaphore set.
Comment 6 Rob Millner 2012-08-20 14:14:32 EDT
The gear destroy now removes IPC entities owned by the gear uuid.

https://github.com/openshift/crankcase/pull/407
Comment 7 Rob Millner 2012-08-20 14:37:22 EDT
Updated pull request.
https://github.com/openshift/crankcase/pull/408
Comment 8 Rob Millner 2012-08-20 19:17:50 EDT
Pull request merged.
Comment 9 Xiaoli Tian 2012-08-21 06:29:15 EDT
Not meet above error again after a bunch of testing against INT, move it to verified.
Comment 10 Xiaoli Tian 2012-08-21 21:37:03 EDT
This bug happened again now in INT after updating:
#rhc app create -a g1ergtshrg -t nodejs-0.6 -l xtian+test2@redhat.com -p 123456 -d 
Submitting form:
rhlogin: xtian+test2@redhat.com
debug: true
Contacting https://int.openshift.redhat.com
Creating application: g1ergtshrg in kqnjeqs4yw
Contacting https://int.openshift.redhat.com
Problem reported from server. Response code was 500.

DEBUG:
Initialized empty Git repository in /var/lib/stickshift/af30fe4947144907ad020ccf25311ee5/git/g1ergtshrg.git/
/var/lib/stickshift/af30fe4947144907ad020ccf25311ee5/git/g1ergtshrg.git /tmp
/tmp
Failed to restart master httpd, please contact support
Cartridge return code: 120

Exit Code: 143
api_c: ["placeholder"]
broker_c: ["namespace", "rhlogin", "ssh", "app_uuid", "debug", "alter", "cartridge", "cart_type", "action", "app_name", "api"]
API version: 1.1.3

RESULT:
Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support.

Command Return: 143
WARNING: the 1st attempt failed => let's try it again.

Running Command - rm -rf g1ergtshrg && rhc app create -a g1ergtshrg -t nodejs-0.6 -l xtian+test2@redhat.com -p 123456 -d 
Submitting form:
rhlogin: xtian+test2@redhat.com
debug: true
Contacting https://int.openshift.redhat.com
Creating application: g1ergtshrg in kqnjeqs4yw
Contacting https://int.openshift.redhat.com
Problem reported from server. Response code was 500.

DEBUG:
Initialized empty Git repository in /var/lib/stickshift/00e7cb6d83d24f549d137a58ee7eae91/git/g1ergtshrg.git/
/var/lib/stickshift/00e7cb6d83d24f549d137a58ee7eae91/git/g1ergtshrg.git /tmp
/tmp
Failed to restart master httpd, please contact support
Cartridge return code: 120

Exit Code: 143
broker_c: ["namespace", "rhlogin", "ssh", "app_uuid", "debug", "alter", "cartridge", "cart_type", "action", "app_name", "api"]
api_c: ["placeholder"]
API version: 1.1.3

RESULT:
Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support.

Command Return: 143
Comment 11 Xiaoli Tian 2012-08-21 22:01:25 EDT
After confirmed with Thomas, this is not caused by semaphore set exhaustion, but by some other configure error in httpd, move it to ON_QA to test it again since the error is already fixed by Thomas.
Comment 12 Xiaoli Tian 2012-08-22 05:14:24 EDT
It works in current INT now.
Comment 13 Thomas Wiest 2012-08-22 10:36:14 EDT
Just to add more confirmation, yesterday we cleared out all of the unowned semaphore sets in INT.

This morning, I checked and there were 0 unowned semaphore sets in INT, so I believe this bug has been fixed.

Note You need to log in before you can comment on or make changes to this bug.