Description of problem: watchman takes up lots of memory and times out when attempting a restart Version-Release number of selected component (if applicable): openshift-origin-node-util-1.26.3-1.el6oso.noarch How reproducible: rarely Steps to Reproduce: 1. $ ps aux | grep -i watchman root 10096 4.3 12.3 2332684 930992 ? Sl Jun23 1599:38 watchman 2. $ sudo service openshift-watchman restart Stopping Watchman.................................................Watchman operation timed out Actual results: Expected results: Watchman should not use so much memory, or fail to restart Additional info:
Put in some debug messages to print memory information after each watchman plugin is invoked. The messages go in /var/log/messages and the debug mode can be enabled by setting an env var 'WATCHMAN_DEBUG' to true. Hopefully we can narrow it down which plugin causes the leak. https://github.com/openshift/origin-server/pull/5670
Checked on devenv-stage_946, the debug option was added to watchman config. # cat /etc/sysconfig/watchman WATCHMAN_DEBUG=true # tail -f /var/log/messages Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Watchman debug is set to true Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36484, Plugin : JbossPlugin Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36560, Plugin : OomPlugin Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36608, Plugin : EnvPlugin Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36608, Plugin : ThrottlerPlugin Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36688, Plugin : GearStatePlugin Aug 12 00:14:58 ip-10-99-163-60 watchman[21483]: Memory : 36688, Plugin : MetricsPlugin
Fixed in https://github.com/openshift/origin-server/pull/5695
Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/a0149a176f417aee7cc82190b90859158a38c09d Bug 1121217 - Symbol leak in Throttler cgroup code * Enhance debugging output * Remove to_sym in keys https://github.com/openshift/origin-server/commit/e00d653b764334fb5da6c2b301b5dd52629c9234 Bug 1121217 - Symbol leak in Throttler cgroup code * fix tests
*** Bug 1096270 has been marked as a duplicate of this bug. ***
Checked on devenv-stage_952, with about 80 gears running on a m3.medium node. With following config in sysconfig: # cat /etc/sysconfig/watchman GEAR_RETRIES=3 RETRY_DELAY=30 RETRY_PERIOD=60 STATE_CHANGE_DELAY=10 STATE_CHECK_PERIOD=1 THROTTLER_CHECK_PERIOD=1 OOM_CHECK_PERIOD=1 WATCHMAN_DEBUG=true Wathcman running with about 50% cpu usage and memory usage will not greater than 10%. And watchman can be restarted. Also do regression testing for throttle plugin, gear_state_plugin and oom_plugin. All of them working well. Move bug to verified.