Description of problem: Observed in /var/log/messages, watchman restarted several Jenkins slave gear. After Jenkins-slave gear is created, it will be moved to "building" state until destroyed. Maybe watchman should ignore "building" state. Or in cartridge level, the Jenkins slave gear should be set to started state once building is finished. Version-Release number of selected component (if applicable): ose-2.1z/ maybe online has same issue How reproducible: Always Steps to Reproduce: 1. Create app and add Jenkins-client cartridges 2. Modify app and git push 3. Observe the /var/log/message Actual results: We can found watchman restarted several bldr gears. [root@node1 openshift]# cat /var/log/messages|grep watchman|grep bldr Aug 26 04:03:44 node1 watchman[1356]: watchman restarted user 53fc3b99fa838ef802000113: application php1bldr (retries: 0) Aug 26 05:11:36 node1 watchman[1356]: watchman restarted user 53fc4b88fa838ef802000138: application php1bldr (retries: 0) Aug 26 23:31:40 node1 watchman[19112]: watchman restarted user 53fd4d66fa838ef802000165: application php1bldr (retries: 0) Aug 27 00:42:13 node1 watchman[19112]: watchman restarted user 53fd5decfa838ef80200018b: application php1bldr (retries: 0) Expected results: watchman doesn't restart Jenkins-slave gears Additional info:
This is fixed upstream: https://github.com/openshift/origin-server/commit/cc0961110326ed25ab13691b20ed4a6a88a295ab Because the Jenkins builder processes don't resemble daemons (i.e. they tend not to be children of PID 1) they weren't being included in watchman's gear process list. This made it appear that the gear was in an incorrect state: the gear was BUILDING, STARTED, etc., but as far as watchman knew, the gear wasn't handling any services. After 15 minutes - or whatever interval is set in /etc/sysconfig/watchman option STATE_CHANGE_DELAY - watchman restarts the gear to see if that brings it back into a correct state. The fix adds a condition to prevent watchman from excluding jenkins builder processes from gear process lists.
Verified and pass on puddle-2-2-2015-03-16 No bldr gears was restarted by watchman. 1) turn the wachman RETRY time to low value. cat /etc/sysconfig/watchman |grep = GEAR_RETRIES=3 RETRY_DELAY=180 RETRY_PERIOD=360 STATE_CHANGE_DELAY=60 2) create jenkin bldr gears rhc apps|grep bldr jbosseap6bldr @ http://jbosseap6bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1b64add718596000071) Git URL: ssh://5507b1b64add718596000071.com.cn/~/git/jbosseap6bldr.git/ SSH: 5507b1b64add718596000071.com.cn php53bldr @ http://php53bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1814add718596000059) Git URL: ssh://5507b1814add718596000059.com.cn/~/git/php53bldr.git/ SSH: 5507b1814add718596000059.com.cn ruby19bldr @ http://ruby19bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b19d4add71c6dc0000f3) Git URL: ssh://5507b19d4add71c6dc0000f3.com.cn/~/git/ruby19bldr.git/ SSH: 5507b19d4add71c6dc0000f3.com.cn 3) waiting for 15 minute until bldr gears disappear. [anli@broker ruby19]$ rhc apps|grep bldr [anli@broker ruby19]$ 4) check logs and no gears was restarted. tailf /var/log/messages|grep watchman Mar 17 12:54:08 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 12:54:28 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: <--snip---> Mar 17 12:57:49 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41524, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 12:58:09 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 12:58:29 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: <--snip---> Mar 17 13:19:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 13:20:13 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 13:20:33 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280 Mar 17 13:20:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Verified. All web cartridges can do jenkins build successfully. watchman did stop jenkins gears. Bug 115002 also works well. One thing need to highlight is that the .stop_lock file in jenkins slave may be deleted by watchman, but that doesn't impact to the jenkins building. so move the bug to verified
1. set STATE_CHANGE_DELAY=20,STATE_CHECK_PERIOD=30 in /etc/sysconfig/watchman 2. create web cartridge with jenkins and git push changes. 3. check the logs messages, the messages will be as the following.watchman deleted stop lock; watchman wasn't restart jenkins-bldr-gears [root@node2 ~]# cat /var/log/messages|grep watchman Apr 3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building Apr 3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building Apr 3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building Apr 3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building Apr 3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building Apr 3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building Apr 3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building Apr 3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0779.html