Bug 999641 - [watchman] detect gears hitting memsw limit
Summary: [watchman] detect gears hitting memsw limit
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Fotios Lindiakos
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-21 18:40 UTC by Andy Grimm
Modified: 2016-11-08 03:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-25 17:46:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andy Grimm 2013-08-21 18:40:15 UTC
Gears which hit their cgroup's memsw limit may get "stuck" in a state where ssh logins are not possible, gear control operations do not work, and requests are not being served.  Where possible, we should "heal" the gear.

I have found that in several cases, gears in this state had "defunct" processes which were simply trying to exit, and raising memory.memsw.limit_in_bytes slightly (from 612MB to 640MB, for example), allowed the defunct process to exit, and the gear went back to a "normal" state, able to serve requests.

My suggestion would be to check for gears where memory.memsw.usage_in_bytes is within some threshold of memory.memsw.limit_in_bytes (they may not be exactly equal), and then watch for memory.memsw.failcnt to be increasing.  For gears meeting this criteria (and maybe some others), raise memory.memsw.limit_in_bytes by a particular value or percentage, and give the gear some amount of time to recover.

Comment 1 Jhon Honce 2013-09-25 17:42:58 UTC
Does a force-stop, clear the issue?

Comment 2 Jhon Honce 2013-09-25 17:46:11 UTC
Track status via https://trello.com/c/wxsVyli6


Note You need to log in before you can comment on or make changes to this bug.