Bug 1003684 - App becomes unresponsive, cartridge gets deleted
App becomes unresponsive, cartridge gets deleted
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Jhon Honce
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-02 14:02 EDT by Lokesh Mandvekar
Modified: 2015-05-14 19:27 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-23 22:23:03 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lokesh Mandvekar 2013-09-02 14:02:08 EDT
Description of problem:
App becomes unresponsive, rhc ssh and app restart fail as well.

How reproducible:
This currently occurs for a user trying to build a Haskell cartridge.

Additional info:
User was using cdk for building cartridge
Comment 1 Clayton Coleman 2013-09-03 10:29:00 EDT
This has been noticed on other CDK apps - the process appears to die or get wedged, without any relevant input from the user.  It does not appear to be related to idling.  Andy Grimm has looked at this with me before, but it appears to be related to some sort of wedge in the process inside the gear.
Comment 2 gideon 2013-09-04 06:54:08 EDT
I've been suffering from this with my CDK apps since the update at the end of August - they were fine before then. Sometimes the unresponsive app responds to rhc app restart, sometimes it's like the description. (Once it gets stuck, the behaviour seems to be constant - the difference is in how stuck it gets in the first place.)
Comment 3 Clayton Coleman 2013-09-04 11:30:59 EDT
Gideon, the next time it gets stuck can you email me at ccoleman@redhat.com immediately and we'll attempt to debug it?
Comment 4 Clayton Coleman 2013-09-30 12:52:36 EDT
Debug from Andy:

It was killed by oomkiller, but then watchman tried to restart it twice within a few seconds, which is weird:

Sep 30 09:35:34 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:34 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell
Sep 30 09:35:40 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:40 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell

I'm guessing that in the second restart, there was a race condition where the new process ried to bind to 8080 before it was free from the process shutting down.  I restarted the app and it's okay now.  This should be something that we can reproduce in another cdk app instance.  I'll give it a shot.
Comment 5 Andy Grimm 2013-10-03 12:09:41 EDT
Watchman greps /var/log/messages for a given time interval, and if it sees two lines, like this:

Sep 30 09:35:13 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a
Sep 30 09:35:14 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a

then it will restart twice, as long as the timestamps are different.  There's even a comment in "cache_incident" contemplating this situation:

# Repeat death. Should there be an additional delay here?

It seems like we should construct a set of uuids to restart and then iterate through the set.  (or I should get back to my eventfd-based watchman rewrite...)
Comment 6 Andy Grimm 2013-10-03 12:36:08 EDT
Pull request:
https://github.com/openshift/li/pull/1948
Comment 7 openshift-github-bot 2013-10-04 12:29:38 EDT
Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/82932afb43855929e0ce95e2e819887e8d81a9fb
Watchman Bug 1003684 - Avoid double restarts of OOM-killed apps

De-dupe OOM kill messages within a time interval to avoid
multiple-restart scenarios.
Comment 8 Meng Bo 2013-10-23 07:51:49 EDT
Checked on devenv_3934,

When gear memory usage over the limit, it will kill the process and restart the gear only once.

Oct 23 07:44:11 ip-10-73-144-158 kernel: java invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff8111d2c2>] ? oom_kill_process+0x82/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81173854>] ? mem_cgroup_handle_oom+0x274/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81171290>] ? memcg_oom_wake_function+0x0/0xa0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Oct 23 07:45:44 ip-10-73-144-158 oo-admin-ctl-gears: Restarted: 52678e53b104d78029000001
Oct 23 07:45:44 ip-10-73-144-158 rhc-watchman[2032]: watchman restarted user 52678e53b104d78029000001: application jbeap1

Note You need to log in before you can comment on or make changes to this bug.