Bug 1134206

Summary: watchman shouldn't restart Jenkins-slave gears
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: ContainersAssignee: John W. Lamb <jolamb>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 2.1.0CC: adellape, anli, bleanhar, erich, jbuchta, jokerman, libra-onpremise-devel, misalunk, mmccomas, nicholas_schuetz
Target Milestone: ---Keywords: Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-origin-node-util-1.35.1.1-1.el6op Doc Type: Bug Fix
Doc Text:
Previously, Jenkins slave (or builder) gears were incorrectly restarted by Watchman after 15 minutes, or after the interval set in the STATE_CHANGE_DELAY parameter in the /etc/sysconfig/watchman file on nodes. This was due to Watchman not including the builder processes in its gear process list. This bug fix adds a condition to prevent Watchman from excluding the builder processes, and as a result Jenkins slave gears are no longer incorrectly restarted in this way.
Story Points: ---
Clone Of:
: 1134686 (view as bug list) Environment:
Last Closed: 2015-04-06 17:05:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1134686    
Bug Blocks: 1150026    

Description Anping Li 2014-08-27 06:40:06 UTC
Description of problem:
Observed in /var/log/messages, watchman restarted several Jenkins slave gear.

After Jenkins-slave gear is created, it will be moved to "building" state until destroyed. Maybe watchman should ignore "building" state. Or in cartridge level, the Jenkins slave gear should be set to started state once building is finished.
 

Version-Release number of selected component (if applicable):
ose-2.1z/ maybe online has same issue

How reproducible:
Always

Steps to Reproduce:
1. Create app and add Jenkins-client cartridges
2. Modify app and git push
3. Observe the /var/log/message

Actual results:
We can found watchman restarted several bldr gears.

[root@node1 openshift]# cat /var/log/messages|grep watchman|grep bldr
Aug 26 04:03:44 node1 watchman[1356]: watchman restarted user 53fc3b99fa838ef802000113: application php1bldr (retries: 0)
Aug 26 05:11:36 node1 watchman[1356]: watchman restarted user 53fc4b88fa838ef802000138: application php1bldr (retries: 0)
Aug 26 23:31:40 node1 watchman[19112]: watchman restarted user 53fd4d66fa838ef802000165: application php1bldr (retries: 0)
Aug 27 00:42:13 node1 watchman[19112]: watchman restarted user 53fd5decfa838ef80200018b: application php1bldr (retries: 0)


Expected results:
watchman doesn't restart Jenkins-slave gears 

Additional info:

Comment 8 John W. Lamb 2015-02-25 20:09:59 UTC
This is fixed upstream: https://github.com/openshift/origin-server/commit/cc0961110326ed25ab13691b20ed4a6a88a295ab

Because the Jenkins builder processes don't resemble daemons (i.e. they tend not to be children of PID 1) they weren't being included in watchman's gear process list. This made it appear that the gear was in an incorrect state: the gear was BUILDING, STARTED, etc., but as far as watchman knew, the gear wasn't handling any services. After 15 minutes - or whatever interval is set in /etc/sysconfig/watchman option STATE_CHANGE_DELAY - watchman restarts the gear to see if that brings it back into a correct state.

The fix adds a condition to prevent watchman from excluding jenkins builder processes from gear process lists.

Comment 12 Anping Li 2015-03-17 05:29:28 UTC
Verified and pass on puddle-2-2-2015-03-16
No bldr gears was restarted by watchman.

1) turn the wachman RETRY time to low value.
cat /etc/sysconfig/watchman |grep =
GEAR_RETRIES=3
RETRY_DELAY=180
RETRY_PERIOD=360
STATE_CHANGE_DELAY=60

2) create jenkin bldr gears
rhc apps|grep bldr
jbosseap6bldr @ http://jbosseap6bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1b64add718596000071)
  Git URL:    ssh://5507b1b64add718596000071.com.cn/~/git/jbosseap6bldr.git/
  SSH:        5507b1b64add718596000071.com.cn
php53bldr @ http://php53bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1814add718596000059)
  Git URL:    ssh://5507b1814add718596000059.com.cn/~/git/php53bldr.git/
  SSH:        5507b1814add718596000059.com.cn
ruby19bldr @ http://ruby19bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b19d4add71c6dc0000f3)
  Git URL:    ssh://5507b19d4add71c6dc0000f3.com.cn/~/git/ruby19bldr.git/
  SSH:        5507b19d4add71c6dc0000f3.com.cn

3) waiting for 15 minute until bldr gears disappear.
[anli@broker ruby19]$ rhc apps|grep bldr
[anli@broker ruby19]$

4) check logs and no gears was restarted.

tailf  /var/log/messages|grep watchman
Mar 17 12:54:08 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 12:54:28 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: 
<--snip--->
Mar 17 12:57:49 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41524, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 12:58:09 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 12:58:29 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: 
<--snip--->
Mar 17 13:19:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 13:20:13 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 13:20:33 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 13:20:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280

Comment 17 Anping Li 2015-04-03 09:34:21 UTC
Verified. 
All web cartridges can do jenkins build successfully. watchman did stop jenkins gears. Bug 115002 also works well. 

One thing need to highlight is that the .stop_lock file in jenkins slave may be deleted by watchman, but that doesn't impact to the jenkins building. 

so move the bug to verified

Comment 18 Anping Li 2015-04-03 10:13:11 UTC
1. set STATE_CHANGE_DELAY=20,STATE_CHECK_PERIOD=30 in /etc/sysconfig/watchman
2. create web cartridge with jenkins and git push changes.
3. check the logs messages, the messages will be as the following.watchman deleted stop lock; watchman wasn't restart jenkins-bldr-gears

[root@node2 ~]# cat /var/log/messages|grep watchman
Apr  3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building
Apr  3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building
Apr  3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building
Apr  3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building
Apr  3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building
Apr  3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building
Apr  3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building
Apr  3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building

Comment 20 errata-xmlrpc 2015-04-06 17:05:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0779.html