Bug 1134206
Summary: | watchman shouldn't restart Jenkins-slave gears | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | |
Component: | Containers | Assignee: | John W. Lamb <jolamb> | |
Status: | CLOSED ERRATA | QA Contact: | libra bugs <libra-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 2.1.0 | CC: | adellape, anli, bleanhar, erich, jbuchta, jokerman, libra-onpremise-devel, misalunk, mmccomas, nicholas_schuetz | |
Target Milestone: | --- | Keywords: | Upstream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openshift-origin-node-util-1.35.1.1-1.el6op | Doc Type: | Bug Fix | |
Doc Text: |
Previously, Jenkins slave (or builder) gears were incorrectly restarted by Watchman after 15 minutes, or after the interval set in the STATE_CHANGE_DELAY parameter in the /etc/sysconfig/watchman file on nodes. This was due to Watchman not including the builder processes in its gear process list. This bug fix adds a condition to prevent Watchman from excluding the builder processes, and as a result Jenkins slave gears are no longer incorrectly restarted in this way.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1134686 (view as bug list) | Environment: | ||
Last Closed: | 2015-04-06 17:05:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1134686 | |||
Bug Blocks: | 1150026 |
Description
Anping Li
2014-08-27 06:40:06 UTC
This is fixed upstream: https://github.com/openshift/origin-server/commit/cc0961110326ed25ab13691b20ed4a6a88a295ab Because the Jenkins builder processes don't resemble daemons (i.e. they tend not to be children of PID 1) they weren't being included in watchman's gear process list. This made it appear that the gear was in an incorrect state: the gear was BUILDING, STARTED, etc., but as far as watchman knew, the gear wasn't handling any services. After 15 minutes - or whatever interval is set in /etc/sysconfig/watchman option STATE_CHANGE_DELAY - watchman restarts the gear to see if that brings it back into a correct state. The fix adds a condition to prevent watchman from excluding jenkins builder processes from gear process lists. Verified and pass on puddle-2-2-2015-03-16 No bldr gears was restarted by watchman. 1) turn the wachman RETRY time to low value. cat /etc/sysconfig/watchman |grep = GEAR_RETRIES=3 RETRY_DELAY=180 RETRY_PERIOD=360 STATE_CHANGE_DELAY=60 2) create jenkin bldr gears rhc apps|grep bldr jbosseap6bldr @ http://jbosseap6bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1b64add718596000071) Git URL: ssh://5507b1b64add718596000071.com.cn/~/git/jbosseap6bldr.git/ SSH: 5507b1b64add718596000071.com.cn php53bldr @ http://php53bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1814add718596000059) Git URL: ssh://5507b1814add718596000059.com.cn/~/git/php53bldr.git/ SSH: 5507b1814add718596000059.com.cn ruby19bldr @ http://ruby19bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b19d4add71c6dc0000f3) Git URL: ssh://5507b19d4add71c6dc0000f3.com.cn/~/git/ruby19bldr.git/ SSH: 5507b19d4add71c6dc0000f3.com.cn 3) waiting for 15 minute until bldr gears disappear. [anli@broker ruby19]$ rhc apps|grep bldr [anli@broker ruby19]$ 4) check logs and no gears was restarted. tailf /var/log/messages|grep watchman Mar 17 12:54:08 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 12:54:28 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: <--snip---> Mar 17 12:57:49 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41524, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 12:58:09 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 12:58:29 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: <--snip---> Mar 17 13:19:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 13:20:13 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 13:20:33 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 13:20:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Verified. All web cartridges can do jenkins build successfully. watchman did stop jenkins gears. Bug 115002 also works well. One thing need to highlight is that the .stop_lock file in jenkins slave may be deleted by watchman, but that doesn't impact to the jenkins building. so move the bug to verified 1. set STATE_CHANGE_DELAY=20,STATE_CHECK_PERIOD=30 in /etc/sysconfig/watchman 2. create web cartridge with jenkins and git push changes. 3. check the logs messages, the messages will be as the following.watchman deleted stop lock; watchman wasn't restart jenkins-bldr-gears [root@node2 ~]# cat /var/log/messages|grep watchman Apr 3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building Apr 3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building Apr 3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building Apr 3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building Apr 3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building Apr 3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building Apr 3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building Apr 3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0779.html |