Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1134206 - watchman shouldn't restart Jenkins-slave gears
watchman shouldn't restart Jenkins-slave gears
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers (Show other bugs)
2.1.0
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: John W. Lamb
libra bugs
: Upstream
Depends On: 1134686
Blocks: 1150026
  Show dependency treegraph
 
Reported: 2014-08-27 02:40 EDT by Anping Li
Modified: 2015-04-06 13:05 EDT (History)
10 users (show)

See Also:
Fixed In Version: openshift-origin-node-util-1.35.1.1-1.el6op
Doc Type: Bug Fix
Doc Text:
Previously, Jenkins slave (or builder) gears were incorrectly restarted by Watchman after 15 minutes, or after the interval set in the STATE_CHANGE_DELAY parameter in the /etc/sysconfig/watchman file on nodes. This was due to Watchman not including the builder processes in its gear process list. This bug fix adds a condition to prevent Watchman from excluding the builder processes, and as a result Jenkins slave gears are no longer incorrectly restarted in this way.
Story Points: ---
Clone Of:
: 1134686 (view as bug list)
Environment:
Last Closed: 2015-04-06 13:05:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1150026 None None None Never
Red Hat Bugzilla 1176649 None None None Never
Red Hat Product Errata RHBA-2015:0779 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.2.5 bug fix and enhancement update 2015-04-06 17:05:45 EDT

  None (edit)
Description Anping Li 2014-08-27 02:40:06 EDT
Description of problem:
Observed in /var/log/messages, watchman restarted several Jenkins slave gear.

After Jenkins-slave gear is created, it will be moved to "building" state until destroyed. Maybe watchman should ignore "building" state. Or in cartridge level, the Jenkins slave gear should be set to started state once building is finished.
 

Version-Release number of selected component (if applicable):
ose-2.1z/ maybe online has same issue

How reproducible:
Always

Steps to Reproduce:
1. Create app and add Jenkins-client cartridges
2. Modify app and git push
3. Observe the /var/log/message

Actual results:
We can found watchman restarted several bldr gears.

[root@node1 openshift]# cat /var/log/messages|grep watchman|grep bldr
Aug 26 04:03:44 node1 watchman[1356]: watchman restarted user 53fc3b99fa838ef802000113: application php1bldr (retries: 0)
Aug 26 05:11:36 node1 watchman[1356]: watchman restarted user 53fc4b88fa838ef802000138: application php1bldr (retries: 0)
Aug 26 23:31:40 node1 watchman[19112]: watchman restarted user 53fd4d66fa838ef802000165: application php1bldr (retries: 0)
Aug 27 00:42:13 node1 watchman[19112]: watchman restarted user 53fd5decfa838ef80200018b: application php1bldr (retries: 0)


Expected results:
watchman doesn't restart Jenkins-slave gears 

Additional info:
Comment 8 John W. Lamb 2015-02-25 15:09:59 EST
This is fixed upstream: https://github.com/openshift/origin-server/commit/cc0961110326ed25ab13691b20ed4a6a88a295ab

Because the Jenkins builder processes don't resemble daemons (i.e. they tend not to be children of PID 1) they weren't being included in watchman's gear process list. This made it appear that the gear was in an incorrect state: the gear was BUILDING, STARTED, etc., but as far as watchman knew, the gear wasn't handling any services. After 15 minutes - or whatever interval is set in /etc/sysconfig/watchman option STATE_CHANGE_DELAY - watchman restarts the gear to see if that brings it back into a correct state.

The fix adds a condition to prevent watchman from excluding jenkins builder processes from gear process lists.
Comment 12 Anping Li 2015-03-17 01:29:28 EDT
Verified and pass on puddle-2-2-2015-03-16
No bldr gears was restarted by watchman.

1) turn the wachman RETRY time to low value.
cat /etc/sysconfig/watchman |grep =
GEAR_RETRIES=3
RETRY_DELAY=180
RETRY_PERIOD=360
STATE_CHANGE_DELAY=60

2) create jenkin bldr gears
rhc apps|grep bldr
jbosseap6bldr @ http://jbosseap6bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1b64add718596000071)
  Git URL:    ssh://5507b1b64add718596000071@jbosseap6bldr-anlidom.ose22-app-201503162.com.cn/~/git/jbosseap6bldr.git/
  SSH:        5507b1b64add718596000071@jbosseap6bldr-anlidom.ose22-app-201503162.com.cn
php53bldr @ http://php53bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b1814add718596000059)
  Git URL:    ssh://5507b1814add718596000059@php53bldr-anlidom.ose22-app-201503162.com.cn/~/git/php53bldr.git/
  SSH:        5507b1814add718596000059@php53bldr-anlidom.ose22-app-201503162.com.cn
ruby19bldr @ http://ruby19bldr-anlidom.ose22-app-201503162.com.cn/ (uuid: 5507b19d4add71c6dc0000f3)
  Git URL:    ssh://5507b19d4add71c6dc0000f3@ruby19bldr-anlidom.ose22-app-201503162.com.cn/~/git/ruby19bldr.git/
  SSH:        5507b19d4add71c6dc0000f3@ruby19bldr-anlidom.ose22-app-201503162.com.cn

3) waiting for 15 minute until bldr gears disappear.
[anli@broker ruby19]$ rhc apps|grep bldr
[anli@broker ruby19]$

4) check logs and no gears was restarted.

tailf  /var/log/messages|grep watchman
Mar 17 12:54:08 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 12:54:28 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41456, Plugin: 
<--snip--->
Mar 17 12:57:49 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41524, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 12:58:09 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 12:58:29 dhcp-128-178 watchman[21392]: Gears: 7, Memory: 41528, Plugin: 
<--snip--->
Mar 17 13:19:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 13:20:13 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 13:20:33 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Mar 17 13:20:53 dhcp-128-178 watchman[21392]: Gears: 4, Memory: 41540, Plugin: all, Symbols: 10672, Objects: 182280
Comment 17 Anping Li 2015-04-03 05:34:21 EDT
Verified. 
All web cartridges can do jenkins build successfully. watchman did stop jenkins gears. Bug 115002 also works well. 

One thing need to highlight is that the .stop_lock file in jenkins slave may be deleted by watchman, but that doesn't impact to the jenkins building. 

so move the bug to verified
Comment 18 Anping Li 2015-04-03 06:13:11 EDT
1. set STATE_CHANGE_DELAY=20,STATE_CHECK_PERIOD=30 in /etc/sysconfig/watchman
2. create web cartridge with jenkins and git push changes.
3. check the logs messages, the messages will be as the following.watchman deleted stop lock; watchman wasn't restart jenkins-bldr-gears

[root@node2 ~]# cat /var/log/messages|grep watchman
Apr  3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building
Apr  3 17:05:39 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-perl510bldr-1 because the state of the gear was building
Apr  3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building
Apr  3 17:09:00 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-python33bldr-1 because the state of the gear was building
Apr  3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building
Apr  3 17:11:41 node2 openshift-platform[28484]: watchman deleted stop lock for gear anlidom-ruby20bldr-1 because the state of the gear was building
Apr  3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building
Apr  3 17:36:46 node2 openshift-platform[23559]: watchman deleted stop lock for gear anlidom-jbosseap6bldr-1 because the state of the gear was building
Comment 20 errata-xmlrpc 2015-04-06 13:05:54 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0779.html

Note You need to log in before you can comment on or make changes to this bug.