Bug 999641

Summary: [watchman] detect gears hitting memsw limit
Product: OpenShift Online Reporter: Andy Grimm <agrimm>
Component: ContainersAssignee: Fotios Lindiakos <fotios>
Status: CLOSED UPSTREAM QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: agrimm, dmcphers, jgoulding, jhonce, jkeck
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-25 17:46:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Andy Grimm 2013-08-21 18:40:15 UTC
Gears which hit their cgroup's memsw limit may get "stuck" in a state where ssh logins are not possible, gear control operations do not work, and requests are not being served.  Where possible, we should "heal" the gear.

I have found that in several cases, gears in this state had "defunct" processes which were simply trying to exit, and raising memory.memsw.limit_in_bytes slightly (from 612MB to 640MB, for example), allowed the defunct process to exit, and the gear went back to a "normal" state, able to serve requests.

My suggestion would be to check for gears where memory.memsw.usage_in_bytes is within some threshold of memory.memsw.limit_in_bytes (they may not be exactly equal), and then watch for memory.memsw.failcnt to be increasing.  For gears meeting this criteria (and maybe some others), raise memory.memsw.limit_in_bytes by a particular value or percentage, and give the gear some amount of time to recover.

Comment 1 Jhon Honce 2013-09-25 17:42:58 UTC
Does a force-stop, clear the issue?

Comment 2 Jhon Honce 2013-09-25 17:46:11 UTC
Track status via https://trello.com/c/wxsVyli6