Bug 999641 - [watchman] detect gears hitting memsw limit
[watchman] detect gears hitting memsw limit
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Fotios Lindiakos
libra bugs
Depends On:
  Show dependency treegraph
Reported: 2013-08-21 14:40 EDT by Andy Grimm
Modified: 2016-11-07 22:47 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-09-25 13:46:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Andy Grimm 2013-08-21 14:40:15 EDT
Gears which hit their cgroup's memsw limit may get "stuck" in a state where ssh logins are not possible, gear control operations do not work, and requests are not being served.  Where possible, we should "heal" the gear.

I have found that in several cases, gears in this state had "defunct" processes which were simply trying to exit, and raising memory.memsw.limit_in_bytes slightly (from 612MB to 640MB, for example), allowed the defunct process to exit, and the gear went back to a "normal" state, able to serve requests.

My suggestion would be to check for gears where memory.memsw.usage_in_bytes is within some threshold of memory.memsw.limit_in_bytes (they may not be exactly equal), and then watch for memory.memsw.failcnt to be increasing.  For gears meeting this criteria (and maybe some others), raise memory.memsw.limit_in_bytes by a particular value or percentage, and give the gear some amount of time to recover.
Comment 1 Jhon Honce 2013-09-25 13:42:58 EDT
Does a force-stop, clear the issue?
Comment 2 Jhon Honce 2013-09-25 13:46:11 EDT
Track status via https://trello.com/c/wxsVyli6

Note You need to log in before you can comment on or make changes to this bug.