Description of problem: Create gear on node and run some script to use up the memory. Restart the openshift-watchman during the oom plugin is working on the gear. The watchman will die and cannot be restarted by this operation. Version-Release number of selected component (if applicable): devenv_5079 How reproducible: always Steps to Reproduce: 1. Create gear on node 2. Make the gear use up memory # perl -np -e \'$x="0123456789"x1000000\' < /dev/zero & 3. Check the syslog and restart openshift-watchman when the oom plugin found the gear under oom Actual results: Will cause the watchman dies and cannot be restarted. Expected results: Should not make the watchman dies. Additional info: Logs in syslog: Aug 15 11:21:55 ip-10-231-2-147 watchman[16010]: OOM Plugin: Found gear 53ee2515ae91192737000027 under OOM. Aug 15 11:21:55 ip-10-231-2-147 watchman[16010]: OOM Plugin: Increasing memory for gear 53ee2515ae91192737000027 to 705901363 and restarting Aug 15 11:22:07 ip-10-231-2-147 watchman[16010]: OOM Plugin: Failed to lower memsw limit for gear 53ee2515ae91192737000027 from 705901363 to 641728512 Aug 15 11:22:07 ip-10-231-2-147 watchman[16010]: starting 53ee2515ae91192737000027 Aug 15 11:23:37 ip-10-231-2-147 watchman[16010]: OOM Plugin: Start failed for gear 53ee2515ae91192737000027: Shell command '/usr/sbin/oo-admin-ctl-gears startgear 53ee2515ae91192737000027' exceeded timeout of 90" Aug 15 11:23:37 ip-10-231-2-147 watchman[16010]: SystemExit raised from Watchman plugin #<OomPlugin:0x00000002cbc6c0>:#012/opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application.rb:164:in `exit'#012/opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application.rb:164:in `block in start_load'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:164:in `call'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:164:in `select'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:164:in `block in read_results'#012/opt/rh/ruby193/root/usr/share/ruby/timeout.rb:69:in `timeout'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:162:in `read_results'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:127:in `block (2 levels) in oo_spawn'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:96:in `pipe'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:96:in `block in oo_spawn'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:95:in `pipe'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:95:in `oo_spawn'#012/etc/openshift/watchman/plugins.d/oom_plugin.rb:94:in `block in apply'#012/etc/openshift/watchman/plugins.d/oom_plugin.rb:55:in `each'#012/etc/openshift/watchman/plugins.d/oom_plugin.rb:55:in `apply'#012/usr/sbin/oo-watchman:165:in `block (2 levels) in apply'#012/usr/sbin/oo-watchman:163:in `each'#012/usr/sbin/oo-watchman:163:in `block in apply'#012/usr/sbin/oo-watchman:160:in `loop'#012/us The watchman config as following: RETRY_DELAY=30 RETRY_PERIOD=90 STATE_CHANGE_DELAY=60 OOM_CHECK_PERIOD=0 WATCHMAN_DEBUG=true # oo-accept-node FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19001 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19002 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19010 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19011 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19012 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19013 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19014 cgroups controller: memory FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19001 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19002 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19010 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19011 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19012 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19013 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19014 cgroups controller: freezer FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19001 cgroups controller: net_cls FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19002 cgroups controller: net_cls FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19010 cgroups controller: net_cls FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19011 cgroups controller: net_cls FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19012 cgroups controller: net_cls FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19013 cgroups controller: net_cls FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19014 cgroups controller: net_cls
When the gear is deleted by user, the watchman can be restarted by admin.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/06e9c67e9b731dc9410b5763fe867388d8d6fa21 Bug 1130488 - Capture StandardError not Exception * Capturing Exception prevented SystemExit from operating
Restart watchman during the oom plugin working will not make the watchman dead. Verified on devenv_5556