1003684 – App becomes unresponsive, cartridge gets deleted

Bug 1003684 - App becomes unresponsive, cartridge gets deleted

Summary: App becomes unresponsive, cartridge gets deleted

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jhon Honce
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-09-02 18:02 UTC by Lokesh Mandvekar
Modified:	2015-05-14 23:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-24 03:23:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Lokesh Mandvekar 2013-09-02 18:02:08 UTC

Description of problem:
App becomes unresponsive, rhc ssh and app restart fail as well.

How reproducible:
This currently occurs for a user trying to build a Haskell cartridge.

Additional info:
User was using cdk for building cartridge

Comment 1 Clayton Coleman 2013-09-03 14:29:00 UTC

This has been noticed on other CDK apps - the process appears to die or get wedged, without any relevant input from the user.  It does not appear to be related to idling.  Andy Grimm has looked at this with me before, but it appears to be related to some sort of wedge in the process inside the gear.

Comment 2 gideon 2013-09-04 10:54:08 UTC

I've been suffering from this with my CDK apps since the update at the end of August - they were fine before then. Sometimes the unresponsive app responds to rhc app restart, sometimes it's like the description. (Once it gets stuck, the behaviour seems to be constant - the difference is in how stuck it gets in the first place.)

Comment 3 Clayton Coleman 2013-09-04 15:30:59 UTC

Gideon, the next time it gets stuck can you email me at ccoleman immediately and we'll attempt to debug it?

Comment 4 Clayton Coleman 2013-09-30 16:52:36 UTC

Debug from Andy:

It was killed by oomkiller, but then watchman tried to restart it twice within a few seconds, which is weird:

Sep 30 09:35:34 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:34 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell
Sep 30 09:35:40 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:40 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell

I'm guessing that in the second restart, there was a race condition where the new process ried to bind to 8080 before it was free from the process shutting down.  I restarted the app and it's okay now.  This should be something that we can reproduce in another cdk app instance.  I'll give it a shot.

Comment 5 Andy Grimm 2013-10-03 16:09:41 UTC

Watchman greps /var/log/messages for a given time interval, and if it sees two lines, like this:

Sep 30 09:35:13 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a
Sep 30 09:35:14 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a

then it will restart twice, as long as the timestamps are different.  There's even a comment in "cache_incident" contemplating this situation:

# Repeat death. Should there be an additional delay here?

It seems like we should construct a set of uuids to restart and then iterate through the set.  (or I should get back to my eventfd-based watchman rewrite...)

Comment 6 Andy Grimm 2013-10-03 16:36:08 UTC

Pull request:
https://github.com/openshift/li/pull/1948

Comment 7 openshift-github-bot 2013-10-04 16:29:38 UTC

Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/82932afb43855929e0ce95e2e819887e8d81a9fb
Watchman Bug 1003684 - Avoid double restarts of OOM-killed apps

De-dupe OOM kill messages within a time interval to avoid
multiple-restart scenarios.

Comment 8 Meng Bo 2013-10-23 11:51:49 UTC

Checked on devenv_3934,

When gear memory usage over the limit, it will kill the process and restart the gear only once.

Oct 23 07:44:11 ip-10-73-144-158 kernel: java invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff8111d2c2>] ? oom_kill_process+0x82/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81173854>] ? mem_cgroup_handle_oom+0x274/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81171290>] ? memcg_oom_wake_function+0x0/0xa0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Oct 23 07:45:44 ip-10-73-144-158 oo-admin-ctl-gears: Restarted: 52678e53b104d78029000001
Oct 23 07:45:44 ip-10-73-144-158 rhc-watchman[2032]: watchman restarted user 52678e53b104d78029000001: application jbeap1

Note You need to log in before you can comment on or make changes to this bug.