Bug 1003684
Summary: | App becomes unresponsive, cartridge gets deleted | ||
---|---|---|---|
Product: | OpenShift Online | Reporter: | Lokesh Mandvekar <lsm5> |
Component: | Containers | Assignee: | Jhon Honce <jhonce> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 2.x | CC: | agrimm, bmeng, ccoleman, dmcphers, gideon, jkeck |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-01-24 03:23:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lokesh Mandvekar
2013-09-02 18:02:08 UTC
This has been noticed on other CDK apps - the process appears to die or get wedged, without any relevant input from the user. It does not appear to be related to idling. Andy Grimm has looked at this with me before, but it appears to be related to some sort of wedge in the process inside the gear. I've been suffering from this with my CDK apps since the update at the end of August - they were fine before then. Sometimes the unresponsive app responds to rhc app restart, sometimes it's like the description. (Once it gets stuck, the behaviour seems to be constant - the difference is in how stuck it gets in the first place.) Gideon, the next time it gets stuck can you email me at ccoleman immediately and we'll attempt to debug it? Debug from Andy: It was killed by oomkiller, but then watchman tried to restart it twice within a few seconds, which is weird: Sep 30 09:35:34 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a Sep 30 09:35:34 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell Sep 30 09:35:40 ex-med-node5 oo-admin-ctl-gears: Restarted: 5222ebe05973ca378f00008a Sep 30 09:35:40 ex-med-node5 rhc-watchman[1160]: watchman restarted user 5222ebe05973ca378f00008a: application haskell I'm guessing that in the second restart, there was a race condition where the new process ried to bind to 8080 before it was free from the process shutting down. I restarted the app and it's okay now. This should be something that we can reproduce in another cdk app instance. I'll give it a shot. Watchman greps /var/log/messages for a given time interval, and if it sees two lines, like this: Sep 30 09:35:13 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a Sep 30 09:35:14 ex-med-node5 kernel: Task in /openshift/5222ebe05973ca378f00008a killed as a result of limit of /openshift/5222ebe05973ca378f00008a then it will restart twice, as long as the timestamps are different. There's even a comment in "cache_incident" contemplating this situation: # Repeat death. Should there be an additional delay here? It seems like we should construct a set of uuids to restart and then iterate through the set. (or I should get back to my eventfd-based watchman rewrite...) Pull request: https://github.com/openshift/li/pull/1948 Commit pushed to master at https://github.com/openshift/li https://github.com/openshift/li/commit/82932afb43855929e0ce95e2e819887e8d81a9fb Watchman Bug 1003684 - Avoid double restarts of OOM-killed apps De-dupe OOM kill messages within a time interval to avoid multiple-restart scenarios. Checked on devenv_3934, When gear memory usage over the limit, it will kill the process and restart the gear only once. Oct 23 07:44:11 ip-10-73-144-158 kernel: java invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0 Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff8111d2c2>] ? oom_kill_process+0x82/0x2a0 Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81173854>] ? mem_cgroup_handle_oom+0x274/0x2a0 Oct 23 07:44:11 ip-10-73-144-158 kernel: [<ffffffff81171290>] ? memcg_oom_wake_function+0x0/0xa0 Oct 23 07:44:11 ip-10-73-144-158 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Oct 23 07:45:44 ip-10-73-144-158 oo-admin-ctl-gears: Restarted: 52678e53b104d78029000001 Oct 23 07:45:44 ip-10-73-144-158 rhc-watchman[2032]: watchman restarted user 52678e53b104d78029000001: application jbeap1 |