Bug 1003684

Summary: App becomes unresponsive, cartridge gets deleted
Product: OpenShift Online Reporter: Lokesh Mandvekar <lsm5>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.xCC: agrimm, bmeng, ccoleman, dmcphers, gideon, jkeck
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-24 03:23:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lokesh Mandvekar 2013-09-02 18:02:08 UTC
Description of problem:
App becomes unresponsive, rhc ssh and app restart fail as well.

How reproducible:
This currently occurs for a user trying to build a Haskell cartridge.

Additional info:
User was using cdk for building cartridge

Comment 1 Clayton Coleman 2013-09-03 14:29:00 UTC
This has been noticed on other CDK apps - the process appears to die or get wedged, without any relevant input from the user.  It does not appear to be related to idling.  Andy Grimm has looked at this with me before, but it appears to be related to some sort of wedge in the process inside the gear.

Comment 2 gideon 2013-09-04 10:54:08 UTC
I've been suffering from this with my CDK apps since the update at the end of August - they were fine before then. Sometimes the unresponsive app responds to rhc app restart, sometimes it's like the description. (Once it gets stuck, the behaviour seems to be constant - the difference is in how stuck it gets in the first place.)

Comment 3 Clayton Coleman 2013-09-04 15:30:59 UTC
Gideon, the next time it gets stuck can you email me at ccoleman immediately and we'll attempt to debug it?

Comment 4 Clayton Coleman 2013-09-30 16:52:36 UTC
Debug from Andy:

It was killed by oomkiller, but then watchman tried to restart it twice within a few seconds, which is weird:

Sep 30 09:35:34 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:34 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell
Sep 30 09:35:40 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:40 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell

I'm guessing that in the second restart, there was a race condition where the new process ried to bind to 8080 before it was free from the process shutting down.  I restarted the app and it's okay now.  This should be something that we can reproduce in another cdk app instance.  I'll give it a shot.

Comment 5 Andy Grimm 2013-10-03 16:09:41 UTC
Watchman greps /var/log/messages for a given time interval, and if it sees two lines, like this:

Sep 30 09:35:13 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a
Sep 30 09:35:14 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a

then it will restart twice, as long as the timestamps are different.  There's even a comment in "cache_incident" contemplating this situation:

# Repeat death. Should there be an additional delay here?

It seems like we should construct a set of uuids to restart and then iterate through the set.  (or I should get back to my eventfd-based watchman rewrite...)

Comment 6 Andy Grimm 2013-10-03 16:36:08 UTC
Pull request:
https://github.com/openshift/li/pull/1948

Comment 7 openshift-github-bot 2013-10-04 16:29:38 UTC
Commit pushed to master at https://github.com/openshift/li

https://github.com/openshift/li/commit/82932afb43855929e0ce95e2e819887e8d81a9fb
Watchman Bug 1003684 - Avoid double restarts of OOM-killed apps

De-dupe OOM kill messages within a time interval to avoid
multiple-restart scenarios.

Comment 8 Meng Bo 2013-10-23 11:51:49 UTC
Checked on devenv_3934,

When gear memory usage over the limit, it will kill the process and restart the gear only once.

Oct 23 07:44:11 ip-10-73-144-158 kernel: java invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff8111d2c2>] ? oom_kill_process+0x82/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81173854>] ? mem_cgroup_handle_oom+0x274/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81171290>] ? memcg_oom_wake_function+0x0/0xa0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Oct 23 07:45:44 ip-10-73-144-158 oo-admin-ctl-gears: Restarted: 52678e53b104d78029000001
Oct 23 07:45:44 ip-10-73-144-158 rhc-watchman[2032]: watchman restarted user 52678e53b104d78029000001: application jbeap1