Bug 1121217

Summary: watchman takes up gigs of memory, times out on restart
Product: OpenShift Online Reporter: Sten Turpin <sten>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: agrimm, bmeng, jhonce, jokerman, mmccomas
Target Milestone: ---   
Target Release: 2.x   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1127714 (view as bug list) Environment:
Last Closed: 2014-10-10 00:49:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1127714    

Description Sten Turpin 2014-07-18 16:29:58 UTC
Description of problem: watchman takes up lots of memory and times out when attempting a restart


Version-Release number of selected component (if applicable): openshift-origin-node-util-1.26.3-1.el6oso.noarch


How reproducible: rarely


Steps to Reproduce:
1. $ ps aux | grep -i watchman
root      10096  4.3 12.3 2332684 930992 ?      Sl   Jun23 1599:38 watchman

2. $ sudo service openshift-watchman restart
Stopping Watchman.................................................Watchman operation timed out


Actual results:


Expected results:
Watchman should not use so much memory, or fail to restart

Additional info:

Comment 2 Rajat Chopra 2014-07-29 19:27:08 UTC
Put in some debug messages to print memory information after each watchman plugin is invoked. The messages go in /var/log/messages and the debug mode can be enabled by setting an env var 'WATCHMAN_DEBUG' to true.

Hopefully we can narrow it down which plugin causes the leak.

https://github.com/openshift/origin-server/pull/5670

Comment 3 Meng Bo 2014-08-04 06:37:16 UTC
Checked on devenv-stage_946, the debug option was added to watchman config.

# cat /etc/sysconfig/watchman
WATCHMAN_DEBUG=true

# tail -f /var/log/messages
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Watchman debug is set to true
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36484, Plugin : JbossPlugin
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36560, Plugin : OomPlugin
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36608, Plugin : EnvPlugin
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36608, Plugin : ThrottlerPlugin
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36688, Plugin : GearStatePlugin
Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36688, Plugin : MetricsPlugin

Comment 4 Jhon Honce 2014-08-06 23:30:44 UTC
Fixed in https://github.com/openshift/origin-server/pull/5695

Comment 5 openshift-github-bot 2014-08-07 01:46:48 UTC
Commits pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/a0149a176f417aee7cc82190b90859158a38c09d
Bug 1121217 - Symbol leak in Throttler cgroup code

* Enhance debugging output
* Remove to_sym in keys

https://github.com/openshift/origin-server/commit/e00d653b764334fb5da6c2b301b5dd52629c9234
Bug 1121217 - Symbol leak in Throttler cgroup code

* fix tests

Comment 6 Andy Grimm 2014-08-07 13:04:44 UTC
*** Bug 1096270 has been marked as a duplicate of this bug. ***

Comment 7 Meng Bo 2014-08-08 10:21:32 UTC
Checked on devenv-stage_952, with about 80 gears running on a m3.medium node.

With following config in sysconfig:

# cat /etc/sysconfig/watchman 
GEAR_RETRIES=3
RETRY_DELAY=30
RETRY_PERIOD=60
STATE_CHANGE_DELAY=10
STATE_CHECK_PERIOD=1
THROTTLER_CHECK_PERIOD=1
OOM_CHECK_PERIOD=1
WATCHMAN_DEBUG=true

Wathcman running with about 50% cpu usage and memory usage will not greater than 10%. And watchman can be restarted.

Also do regression testing for throttle plugin, gear_state_plugin and oom_plugin. All of them working well.

Move bug to verified.