Bug 1130488

Summary: Restart watchman when the oom plugin is working on the gear will cause the watchman dies and cannot be restarted
Product: OpenShift Online Reporter: Meng Bo <bmeng>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: dmcphers, jokerman, mmccomas, nicholas_schuetz
Target Milestone: ---   
Target Release: 2.x   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-07 23:48:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Meng Bo 2014-08-15 11:43:25 UTC
Description of problem:
Create gear on node and run some script to use up the memory. Restart the openshift-watchman during the oom plugin is working on the gear. The watchman will die and cannot be restarted by this operation.

Version-Release number of selected component (if applicable):
devenv_5079

How reproducible:
always

Steps to Reproduce:
1. Create gear on node
2. Make the gear use up memory
# perl -np -e \'$x="0123456789"x1000000\' < /dev/zero &
3. Check the syslog and restart openshift-watchman when the oom plugin found the gear under oom

Actual results:
Will cause the watchman dies and cannot be restarted.

Expected results:
Should not make the watchman dies.

Additional info:
Logs in syslog:
Aug 15 11:21:55 ip-10-231-2-147 watchman[16010]: OOM Plugin: Found gear 53ee2515ae91192737000027 under OOM.
Aug 15 11:21:55 ip-10-231-2-147 watchman[16010]: OOM Plugin: Increasing memory for gear 53ee2515ae91192737000027 to 705901363 and restarting
Aug 15 11:22:07 ip-10-231-2-147 watchman[16010]: OOM Plugin: Failed to lower memsw limit for gear 53ee2515ae91192737000027 from 705901363 to 641728512
Aug 15 11:22:07 ip-10-231-2-147 watchman[16010]: starting 53ee2515ae91192737000027
Aug 15 11:23:37 ip-10-231-2-147 watchman[16010]: OOM Plugin: Start failed for gear 53ee2515ae91192737000027: Shell command '/usr/sbin/oo-admin-ctl-gears startgear 53ee2515ae91192737000027' exceeded timeout of 90"  
Aug 15 11:23:37 ip-10-231-2-147 watchman[16010]: SystemExit raised from Watchman plugin #<OomPlugin:0x00000002cbc6c0>:#012/opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application.rb:164:in `exit'#012/opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application.rb:164:in `block in start_load'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:164:in `call'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:164:in `select'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:164:in `block in read_results'#012/opt/rh/ruby193/root/usr/share/ruby/timeout.rb:69:in `timeout'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:162:in `read_results'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:127:in `block (2 levels) in oo_spawn'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:96:in `pipe'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:96:in `block in oo_spawn'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:95:in `pipe'#012/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.29.3/lib/openshift-origin-node/utils/shell_exec.rb:95:in `oo_spawn'#012/etc/openshift/watchman/plugins.d/oom_plugin.rb:94:in `block in apply'#012/etc/openshift/watchman/plugins.d/oom_plugin.rb:55:in `each'#012/etc/openshift/watchman/plugins.d/oom_plugin.rb:55:in `apply'#012/usr/sbin/oo-watchman:165:in `block (2 levels) in apply'#012/usr/sbin/oo-watchman:163:in `each'#012/usr/sbin/oo-watchman:163:in `block in apply'#012/usr/sbin/oo-watchman:160:in `loop'#012/us
                                                 

The watchman config as following:
RETRY_DELAY=30
RETRY_PERIOD=90
STATE_CHANGE_DELAY=60
OOM_CHECK_PERIOD=0
WATCHMAN_DEBUG=true


# oo-accept-node 
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19001 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19002 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19010 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19011 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19012 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19013 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19014 cgroups controller: memory
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19001 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19002 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19010 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19011 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19012 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19013 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19014 cgroups controller: freezer
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19001 cgroups controller: net_cls
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19002 cgroups controller: net_cls
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19010 cgroups controller: net_cls
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19011 cgroups controller: net_cls
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19012 cgroups controller: net_cls
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19013 cgroups controller: net_cls
FAIL: 53ee2515ae91192737000027 has a process missing from cgroups: 19014 cgroups controller: net_cls

Comment 1 Meng Bo 2014-08-15 11:45:02 UTC
When the gear is deleted by user, the watchman can be restarted by admin.

Comment 2 openshift-github-bot 2015-06-19 00:18:57 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/06e9c67e9b731dc9410b5763fe867388d8d6fa21
Bug 1130488 - Capture StandardError not Exception

* Capturing Exception prevented SystemExit from operating

Comment 3 Meng Bo 2015-06-24 09:27:56 UTC
Restart watchman during the oom plugin working will not make the watchman dead.

Verified on devenv_5556