Gears which hit their cgroup's memsw limit may get "stuck" in a state where ssh logins are not possible, gear control operations do not work, and requests are not being served. Where possible, we should "heal" the gear. I have found that in several cases, gears in this state had "defunct" processes which were simply trying to exit, and raising memory.memsw.limit_in_bytes slightly (from 612MB to 640MB, for example), allowed the defunct process to exit, and the gear went back to a "normal" state, able to serve requests. My suggestion would be to check for gears where memory.memsw.usage_in_bytes is within some threshold of memory.memsw.limit_in_bytes (they may not be exactly equal), and then watch for memory.memsw.failcnt to be increasing. For gears meeting this criteria (and maybe some others), raise memory.memsw.limit_in_bytes by a particular value or percentage, and give the gear some amount of time to recover.
Does a force-stop, clear the issue?
Track status via https://trello.com/c/wxsVyli6