999641 – [watchman] detect gears hitting memsw limit

Bug 999641 - [watchman] detect gears hitting memsw limit

Summary: [watchman] detect gears hitting memsw limit

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Fotios Lindiakos
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-21 18:40 UTC by Andy Grimm
Modified:	2016-11-08 03:47 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-25 17:46:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andy Grimm 2013-08-21 18:40:15 UTC

Gears which hit their cgroup's memsw limit may get "stuck" in a state where ssh logins are not possible, gear control operations do not work, and requests are not being served.  Where possible, we should "heal" the gear.

I have found that in several cases, gears in this state had "defunct" processes which were simply trying to exit, and raising memory.memsw.limit_in_bytes slightly (from 612MB to 640MB, for example), allowed the defunct process to exit, and the gear went back to a "normal" state, able to serve requests.

My suggestion would be to check for gears where memory.memsw.usage_in_bytes is within some threshold of memory.memsw.limit_in_bytes (they may not be exactly equal), and then watch for memory.memsw.failcnt to be increasing.  For gears meeting this criteria (and maybe some others), raise memory.memsw.limit_in_bytes by a particular value or percentage, and give the gear some amount of time to recover.

Comment 1 Jhon Honce 2013-09-25 17:42:58 UTC

Does a force-stop, clear the issue?

Comment 2 Jhon Honce 2013-09-25 17:46:11 UTC

Track status via https://trello.com/c/wxsVyli6

Note You need to log in before you can comment on or make changes to this bug.