Bug 811667

Summary: jenkins builder app not cleaned up on node creation failure
Product: OKD Reporter: Bill DeCoste <wdecoste>
Component: ContainersAssignee: Dan Mace <dmace>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: abhgupta, bmeng, dmace, dmcphers, jhonce, jhou, rmillner, szhang, wsun, xjia
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: devenv_2826+ Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-15 14:13:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bill DeCoste 2012-04-11 16:47:25 UTC
Description of problem:
Files left in /var/lib/stickshift if builder creation fails.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jhon Honce 2012-06-13 22:00:23 UTC
Do you have a method for reproducing the failure?

Comment 2 Bill DeCoste 2012-06-14 00:38:41 UTC
When the node creation fails (e.g. from a DNS timeout) then the jenkins node is not created but the Shift builder is created. Since the 15 min lifecycle is controlled by the node the builder lasts forever taking up a gear. The workaround is to rebuild and hope DNS resolves or manually delete the builder. The fix is to delete the Shift builder application via the java client from the jenkins plugin in the event of a DNS timeout. 

Steps to recreate:
1) Create a jenkins app and a jboss app w/enabled jenkins
2) Go into the configuration of the jboss job and change the timeout from 300000 to something small like 3000
3) Kick off a build. The builder will be created, DNS will timeout, and no node will be created

Expected behavior:
On DNS timeout the Shift builder is deleted and doesn't hang around taking up a gear indefinitely

Comment 3 Abhishek Gupta 2012-07-16 20:13:21 UTC
Assigning this back to Bill based on my conversation with him to fix in the jenkins plugin.

Comment 4 Dan Mace 2013-02-14 20:24:58 UTC
Please re-test with latest devenvs. There have been many improvements to the Jenkins plugin's failure handling which might resolve this.

Comment 5 Jianwei Hou 2013-02-16 08:39:07 UTC
Tested on devenv_2816
openshift-origin-cartridge-jenkins-client-1.4-1.4.2-1.git.82.1888878.el6.noarch
jenkins-plugin-openshift-0.6.14-0.el6_3.x86_64
openshift-origin-cartridge-jenkins-1.4-1.5.2-1.git.79.1888878.el6.noarch

The previous problem is gone, but when building jbossas application without changing configurations, they will fail.

The errors logged in jenkins console indicate there is a load error for open4 :
Started by user Jenkins System Builder
Building remotely on as1bldr in workspace jbossas-7/ci/jenkins/workspace/as1-build
Checkout:as1-build / jbossas-7/ci/jenkins/workspace/as1-build - hudson.remoting.Channel@6d8c1:as1bldr
Using strategy: Default
Checkout:as1-build / jbossas-7/ci/jenkins/workspace/as1-build - hudson.remoting.LocalChannel@1e2b3c
Cloning the remote Git repository
Cloning repository origin
Fetching upstream changes from ssh://511f424f7e88cea7c2000129.rhcloud.com/~/git/as1.git/
Seen branch in repository origin/HEAD
Seen branch in repository origin/master
Commencing build of Revision ec968694256a345985edaaf8abd225555661f5a8 (origin/HEAD, origin/master)
Checking out Revision ec968694256a345985edaaf8abd225555661f5a8 (origin/HEAD, origin/master)
No change to record in branch origin/HEAD
No change to record in branch origin/master
[as1-build] $ /bin/sh -xe /tmp/hudson454816972224339925.sh
+ source /usr/libexec/openshift/cartridges/abstract/info/lib/jenkins_util
+ jenkins_rsync '511f424f7e88cea7c2000129.rhcloud.com:~/.m2/' /var/lib/openshift/511f42fe7e88cea7c200016f/.m2/
+ rsync --delete-after -az -e /usr/libexec/openshift/cartridges/jenkins-1.4/info/bin/git_ssh_wrapper.sh '511f424f7e88cea7c2000129.rhcloud.com:~/.m2/' /var/lib/openshift/511f42fe7e88cea7c200016f/.m2/
+ . ci_build.sh
++ set +x
Running .openshift/action_hooks/pre_build
/usr/bin/oo-cgroup-read:7:in `require': no such file to load -- open4 (LoadError)
	from /usr/bin/oo-cgroup-read:7
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE


[root@ip-10-195-191-40 ~]# oo-cgroup-read 
/usr/bin/oo-cgroup-read:7:in `require': no such file to load -- open4 (LoadError)
	from /usr/bin/oo-cgroup-read:7
You have new mail in /var/spool/mail/root

[root@ip-10-195-191-40 ~]# gem list open4

*** LOCAL GEMS ***

open4 (1.3.0)

Comment 6 xjia 2013-02-17 08:27:36 UTC
Because "oo-cgroup-read" is used when snapshot the app, so the function about snapshot totally can't work. 
So change the Severity to "Medium".

Comment 7 Jianwei Hou 2013-02-18 06:43:46 UTC
Considering the importance of oo-cgroup-read, I have filed bug 912215 to keep track.
Please set this bug to ON_QA and I will verify this bug when the open4 issue is fixed, thanks!

Comment 8 Dan Mace 2013-02-18 15:51:10 UTC
Bug 912215 is resolved in devenv_2826; updating this one to ON_QA.

Comment 9 Jianwei Hou 2013-02-19 02:33:16 UTC
Bug 912215 still got an SELinux issue, waiting for its fix to verify this bug.

Comment 10 Wei Sun 2013-03-01 07:19:10 UTC
Bug 912215 still got an SELinux issue, waiting for its fix to verify this bug.

Comment 11 Meng Bo 2013-03-04 11:17:55 UTC
Move bug to Verified, since the bug 912215 has been fixed.

Comment 12 Meng Bo 2013-03-05 03:22:28 UTC
Checked on devenv_2894,

1.Create jboss w/ jenkins enabled.
2.Modify the builder timeout to small value.
3.Push build.
4.Check the jenkins log.

Build will fail since the low timeout.

Jenkins log as below:
Mar 04, 2013 10:13:56 PM hudson.plugins.openshift.OpenShiftCloud cancelItem
WARNING: Build app1-build app1bldr has been canceled

Mar 04, 2013 10:13:56 PM hudson.plugins.openshift.OpenShiftCloud cancelItem
INFO: Cancelling Item 

Mar 04, 2013 10:13:56 PM hudson.plugins.openshift.OpenShiftCloud provision
INFO: Provisioned 0 new nodes

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftSlave _terminate
INFO: Terminating OpenShift application...

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftSlave _terminate
INFO: Terminating slave app1bldr (uuid: 51356197a11aca9e1a000091)

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftCloud provisionSlave
INFO: Slave exists without corresponding builder. Deleting slave

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftCloud builderExists
INFO: Found an existing builder.  Not provisioning...

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftCloud builderExists
INFO: Capacity remaining - checking for existing type...

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftCloud getSlave
INFO: slaveExists app1bldr app1bldr

Mar 04, 2013 10:13:46 PM hudson.plugins.openshift.OpenShiftCloud getSlaves
INFO: Found existing slave for: app1bldr

Mar 04, 2013 10:13:45 PM hudson.plugins.openshift.OpenShiftCloud getOpenShiftConnection
INFO: Initiating Java Client Service - Configured for OpenShift Server https://localhost

Mar 04, 2013 10:13:45 PM hudson.plugins.openshift.OpenShiftCloud provision
INFO: Provisioning new node for workload = 2 and label = app1-build

Mar 04, 2013 10:13:45 PM hudson.slaves.NodeProvisioner update
INFO: app1-build provisioning successfully completed. We have now 1 computer(s)


After build failed. Check /var/lib/openshift/, there is no builder files left.