838365 – "rhc app force-stop -a {appName} throws error consistently for one of our users: Node execution failure (invalid exit code from node)

Bug 838365 - "rhc app force-stop -a {appName} throws error consistently for one of our users: Node execution failure (invalid exit code from node)

Summary: "rhc app force-stop -a {appName} throws error consistently for one of our use...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Rob Millner
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	839086 844736 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-08 20:49 UTC by Nam Duong
Modified:	2015-05-14 22:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-08-07 20:42:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nam Duong 2012-07-08 20:49:45 UTC

Description of problem:
See forum post: 
https://openshift.redhat.com/community/forums/openshift/not-able-to-access-app-getting-error-500#comment-22270

I worked with the user for a short while on IRC and wasn't able to get to his gear.  He is having issues reaching his app in the following ways:
1) rhc app force-stop -d -a main
RESULT:
Node execution failure (invalid exit code from node).  If the problem persists please contact Red Hat support.
See http://pastebin.com/A4G9R7Mm

2) rhc app tidy -d -a main
RESULT:
Node execution failure (invalid exit code from node).  If the problem persists please contact Red Hat support.
See http://pastebin.com/h8082puk

3) ssh -v 96903524327743dfa45ff9a2f54df967.com
It immediately exits
See ssh -v 96903524327743dfa45ff9a2f54df967.com

4) https://main-pingbox.rhcloud.com/ is only responsive on the login splash screen.  Subsequent db related txns will fail with 500 errors.

Usually, in cases where we get a node execution failure, we try to stop the app and run tidy to clear resources before trying to ssh onto the machine to do some debugging (review app/db logs, etc).  We're not able to do any of that.  Please reach out to gbabun for more details.  

In the meantime, I've contacted Ops (rharrison/mmcgrath) to try to restart his app and it was suggested to open a bug as well.

Comment 1 Clayton Coleman 2012-07-09 14:07:37 UTC

NEF

Comment 2 Rob Millner 2012-07-10 23:56:49 UTC

Its likely the app in question exceeded either the limit on the number of its processes or the memory limits for its gear size.

Comment 3 Rob Millner 2012-07-11 00:07:09 UTC

Followed up on the thread.

Comment 4 Rob Millner 2012-07-11 01:20:24 UTC

A useful fix on our end would be for the force-stop functionality to eventually kill all processes owned by the gear's user.

That way, an app can be brought down to the point where its manageable and diagnosable by the end user even if it has swamped its resources.

Comment 5 Rob Millner 2012-07-11 01:22:01 UTC

*** Bug 839086 has been marked as a duplicate of this bug. ***

Comment 6 Rob Millner 2012-07-11 23:14:16 UTC

Crankcase commit 6f99f9015 changes the force-stop function so that it doesn't fail if the gear UID is out of resources.  It also makes setting the application state not rely on being able to run as the user.

Will submit a pull request after the STG cut.

Comment 7 Rob Millner 2012-07-17 17:17:31 UTC

Pull request #245.

Comment 8 Rob Millner 2012-07-17 21:39:43 UTC

Pull request accepted into Crankcase.

Comment 9 Jianwei Hou 2012-07-18 07:52:35 UTC

verified on devenv_1899

steps;
1.create an application
  rhc app create -a app1 -t diy-0.1
  rhc app cartridge add -a app1 -c postgresql-8.4
2.add test script to app and git push(According to case:[US1155][rhc-cartridge] Implement Force Stop to kill apps)
2.ssh into application, run script to exceed app's process limit
  ps -ef
  node down UID of all processes
  ./multifork.py -c 300 -D 600
3.on the node, monitor all processes of UID in step 2
  top -u 500
4.force stop application and tidy application
  rhc app force-stop -a app1
  rhc app tidy -a app1

Results:
hjw@my app1$ rhc app force-stop -a app1
Password: ******


RESULT:
Success

hjw@my app1$ rhc app tidy -a app1
Password: ******


MESSAGES:
Stopping app...
Running 'git prune'
Running 'git gc --aggressive'
Emptying log dir: /var/lib/stickshift/2da3269bd9df4536b7fda900d5b0da39/app1/logs/
Emptying tmp dir: /tmp/
Emptying tmp dir: /var/lib/stickshift/2da3269bd9df4536b7fda900d5b0da39/app1/tmp/
Starting app...


RESULT:
Success


application is stopped. 
on the node, all processes are terminated

Additional Info:
I have reproduced the problem reported on an older instance. And now this problem is gone.

Fixed in devenv_1899

Comment 10 Rob Millner 2012-07-31 18:02:57 UTC

*** Bug 844736 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.