Description of problem: Under heavy memory pressure, oom-kill starts killing non-interactive processes based on sleep average. Unfortunately, sshd is usually one of the early processes to get killed, which means admin must hike out to datacenter floor or hopefully have a console switch in place. The attached patch to /etc/init.d/sshd uses the rhel5 kernel's oom_adj tunable to provide moderate protection from oom-kill so that sshd remains running for a bit longer. Version-Release number of selected component (if applicable): all versions beginning with rhel 5.0 (this tunable was added in rhel 5.0) Steps to patch initscript: 1. cp sshd-oom_adj-example.patch /etc/init.d 2. cd /etc/init.d 3. cp sshd{,.orig} 4. patch -p0 < sshd-oom_adj-example.patch Expected results: sshd gains a tunable amount of protection from oom-kill See also: https://bugzilla.redhat.com/show_bug.cgi?id=244739 HOWTO - oom kill policy tuning https://bugzilla.redhat.com/show_bug.cgi?id=239313 Document oom_adj & oom_score
Created attachment 244601 [details] patch for /etc/init.d/sshd to use oom-kill tunable
This BZ is an enchancement request for the sshd init script in openssh as shipped by Red Hat. It has no effect on normal production; it only comes into play when oom-kill starts killing non-interactive PIDs.
As the oom_adj value seems to be inherited from parent process to child I propose a slightly different approach: echo 3 >/proc/self/oom_adj just before the ssh daemon is executed.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
excellent! Your suggestion also gets around the nastiness of interfering with RETVAL from starting sshd. Have you tested yet, or should I?
On rawhide it seems to work fine.
also works on rhel5 updated to 2.6.18-8.1.15.el5 kernel; will kickstart a 5.0 box and test there also; to be thorough, i also ran `find /proc -name oom_adj -exec cat {} \; | grep 3' to make sure only sshd and its task had the new oom_adj score (as expected)
works on rhel 5.0, too. unfortunately: inheritance means that ssh client sessions pick up the oom_adj for their bash shell, too. This means that oom_adj propagates...not the intention imho. Hmmmm....
Larry, any comments or suggestions? TIA.
After experimenting with the effects of oom_adj on an ssh client session, it appears that this is a useful side effect. That is, on an active session, oom_score gets artificially raised above the normal sleep avg adjustment. This is good since it actually gives the admin a chance to get a server under control if it's undergoing oom-kill. On an idle session, having oom_adj=3 artificially lowers the normal sleep avg adjustment. This, too, seems good since an idle shell is not being used to recover the machine. The net effect on a box that's trying not to die seems like a positive. Comments?
No go; restarting a service from within the ssh session causes the restarted service to have oom_adj=3, as well. Impacts normal production dangerously. It seems there is no transparent way to protect sshd and just sshd at the moment.
should not implement, closing as wontfix