Red Hat Bugzilla – Bug 1003684
App becomes unresponsive, cartridge gets deleted
Last modified: 2015-05-14 19:27:33 EDT
Description of problem:
App becomes unresponsive, rhc ssh and app restart fail as well.
This currently occurs for a user trying to build a Haskell cartridge.
User was using cdk for building cartridge
This has been noticed on other CDK apps - the process appears to die or get wedged, without any relevant input from the user. It does not appear to be related to idling. Andy Grimm has looked at this with me before, but it appears to be related to some sort of wedge in the process inside the gear.
I've been suffering from this with my CDK apps since the update at the end of August - they were fine before then. Sometimes the unresponsive app responds to rhc app restart, sometimes it's like the description. (Once it gets stuck, the behaviour seems to be constant - the difference is in how stuck it gets in the first place.)
Gideon, the next time it gets stuck can you email me at email@example.com immediately and we'll attempt to debug it?
Debug from Andy:
It was killed by oomkiller, but then watchman tried to restart it twice within a few seconds, which is weird:
Sep 30 09:35:34 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:34 ex-med-node5 rhc-watchman: watchman restarted user 5222ebe05973ca378f00008a: application haskell
Sep 30 09:35:40 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a
Sep 30 09:35:40 ex-med-node5 rhc-watchman: watchman restarted user 5222ebe05973ca378f00008a: application haskell
I'm guessing that in the second restart, there was a race condition where the new process ried to bind to 8080 before it was free from the process shutting down. I restarted the app and it's okay now. This should be something that we can reproduce in another cdk app instance. I'll give it a shot.
Watchman greps /var/log/messages for a given time interval, and if it sees two lines, like this:
Sep 30 09:35:13 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a
Sep 30 09:35:14 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a
then it will restart twice, as long as the timestamps are different. There's even a comment in "cache_incident" contemplating this situation:
# Repeat death. Should there be an additional delay here?
It seems like we should construct a set of uuids to restart and then iterate through the set. (or I should get back to my eventfd-based watchman rewrite...)
Commit pushed to master at https://github.com/openshift/li
Watchman Bug 1003684 - Avoid double restarts of OOM-killed apps
De-dupe OOM kill messages within a time interval to avoid
Checked on devenv_3934,
When gear memory usage over the limit, it will kill the process and restart the gear only once.
Oct 23 07:44:11 ip-10-73-144-158 kernel: java invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff8111d2c2>] ? oom_kill_process+0x82/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81173854>] ? mem_cgroup_handle_oom+0x274/0x2a0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81171290>] ? memcg_oom_wake_function+0x0/0xa0
Oct 23 07:44:11 ip-10-73-144-158 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Oct 23 07:45:44 ip-10-73-144-158 oo-admin-ctl-gears: Restarted: 52678e53b104d78029000001
Oct 23 07:45:44 ip-10-73-144-158 rhc-watchman: watchman restarted user 52678e53b104d78029000001: application jbeap1