Bug 978592 - Jenkins cloud integration fails to start new builder slave
Jenkins cloud integration fails to start new builder slave
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers (Show other bugs)
1.1.1
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Brenton Leanhardt
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-26 17:15 EDT by Matt Johnson
Modified: 2017-03-08 12 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-11 11:36:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matt Johnson 2013-06-26 17:15:41 EDT
Description of problem:
It appears the application/gear creation when a build slave is needed is doomed to fail, as the application does not have the needed bootstrap slave.jar based on the ENV variables of the application. (more info below)

Attempting to connect slave...
Transferring slave.jar file...

ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins
java.io.IOException: Download of slave.jar failed.  Return code = 4
	at hudson.plugins.openshift.OpenShiftComputerLauncher.launch(OpenShiftComputerLauncher.java:97)
	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:679)
 
Version-Release number of selected component (if applicable):
Latest openshift enterprise 30 day trial


How reproducible:
Always

Steps to Reproduce... The below is my log of debugging which I was sending to a colleague, it shows the issue, what we tried to fix/hack/understand more about the problem. Hopefully it will provide all info needed.

the file exists at;

 https://jenkins1-mj.poc.matt-j.co.uk/jnlpJars/slave.jar
so that env variable for the slave SHOULD be the hostname of the 'master' jenkins instace it's attached to? I'd think.

Build slave variables as follows

[MJCITestbldr-mj.poc.matt-j.co.uk ~]\> export
declare -x HISTFILE="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4/app-root/data/.bash_history"
declare -x HOME="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4"
declare -x JENKINS_PASSWORD="mVrHmRGXTqkb"
declare -x JENKINS_URL="https://jenkins1-mj.poc.matt-j.co.uk/"
declare -x JENKINS_USERNAME="system_builder"
declare -x LANG="en_GB.UTF-8"
declare -x LOGNAME="945379cddf004aaf91d6993437a9bcb4"
declare -x MAIL="/var/mail/945379cddf004aaf91d6993437a9bcb4"
declare -x OLDPWD
declare -x OPENSHIFT_APP_DNS="MJCITestbldr-mj.poc.matt-j.co.uk"
declare -x OPENSHIFT_APP_NAME="MJCITestbldr"
declare -x OPENSHIFT_APP_UUID="945379cddf004aaf91d6993437a9bcb4"
declare -x OPENSHIFT_DATA_DIR="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4/app-root/data/"
declare -x OPENSHIFT_GEAR_DNS="MJCITestbldr-mj.poc.matt-j.co.uk"
declare -x OPENSHIFT_GEAR_NAME="MJCITestbldr"
declare -x OPENSHIFT_GEAR_UUID="945379cddf004aaf91d6993437a9bcb4"
declare -x OPENSHIFT_HOMEDIR="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4/"
declare -x OPENSHIFT_INTERNAL_IP="127.0.253.1"
declare -x OPENSHIFT_INTERNAL_PORT="8080"
declare -x OPENSHIFT_PHP_IP="127.0.253.1"
declare -x OPENSHIFT_PHP_LOG_DIR="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4/php-5.3/logs/"
declare -x OPENSHIFT_PHP_PORT="8080"
declare -x OPENSHIFT_REPO_DIR="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4/app-root/runtime/repo/"
declare -x OPENSHIFT_TMP_DIR="/tmp/"
declare -x PATH="/usr/libexec/openshift/cartridges/php-5.3/info/bin/:/usr/libexec/openshift/cartridges/abstract-httpd/info/bin/:/usr/libexec/openshift/cartridges/abstract/info/bin/:/sbin:/usr/sbin:/bin:/usr/bin"
declare -x PS1="[MJCITestbldr-mj.poc.matt-j.co.uk \\W]\\> "
declare -x PWD="/var/lib/openshift/945379cddf004aaf91d6993437a9bcb4"
declare -x SELINUX_LEVEL_REQUESTED=""
declare -x SELINUX_ROLE_REQUESTED=""
declare -x SELINUX_USE_CURRENT_RANGE=""
declare -x SHELL="/bin/bash"
declare -x SHLVL="1"
declare -x SSH_CLIENT="128.107.239.233 16345 22"
declare -x SSH_CONNECTION="128.107.239.233 16345 172.31.3.71 22"
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x TMOUT="300"
declare -x USER="945379cddf004aaf91d6993437a9bcb4"


Which means the slave is trying to pull from https://mjcitestbldr-mj.poc.matt-j.co.uk/jnlpJars/slave.jar instead

not from https://jenkins1-mj.poc.matt-j.co.uk/jnlpJars/slave.jar

Which DOESNT appear to have the file;

Service Temporarily Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Maybe it SHOULD work... but the builder app appears stopped

[MJCITestbldr-mj.poc.matt-j.co.uk ~]\> ctl_all restart
httpd (no pid file) not running
Done
Application is explicitly stopped!  Use 'rhc app start -a MJCITestbldr' to start back up.

Done
but if it cant start without getting the file from itself then it's never going to start????


But the builder application goes away dynamically it seems after a while. Clicking 'build now' in Jenkins forks a new instance with the same '<appname>bldr' instance name... 


Your client tools are now configured.
matjohn2-mac:~ matt$ rhc app start -a MJCITestbldr
Password: *********

Application 'MJCITestbldr' not found


So i kicked off a 'build now' in jenkins
which made the MJCITestbldr application come back

again, 503 service unavailable because it didnt start properly (because it couldnt get the jar?)

so i then manually started from rh tools, just like before it auto dissapeared

matjohn2-mac:~ matt$ rhc app start -a MJCITestbldr
Password: *********


RESULT:
MJCITestbldr started

....and now the app is up (from a HTTP perspective)

https://mjcitestbldr-mj.poc.matt-j.co.uk/jnlpJars/slave.jar
Not Found

The requested URL /jnlpJars/slave.jar was not found on this server.

From the Log:

Apache/2.2.15 (Red Hat) Server at mjcitestbldr-mj.poc.matt-j.co.uk Port 80
But the file doesnt exist
[Wed Jun 26 15:00:24 2013] [notice] Digest: done
[Wed Jun 26 15:00:24 2013] [notice] Apache/2.2.15 (Unix) configured -- resuming normal operations
[Wed Jun 26 15:01:52 2013] [notice] SELinux policy enabled; httpd running as context system_u:system_r:openshift_t:s0:c0,c506
[Wed Jun 26 15:01:52 2013] [notice] mod_bw : Memory Allocated 32 bytes (each conf takes 32 bytes)
[Wed Jun 26 15:01:52 2013] [notice] mod_bw : Version 0.8 - Initialized [1 Confs]
[Wed Jun 26 15:01:52 2013] [notice] Digest: generating secret for digest authentication ...
[Wed Jun 26 15:01:52 2013] [notice] Digest: done
[Wed Jun 26 15:01:53 2013] [notice] Apache/2.2.15 (Unix) configured -- resuming normal operations
[Wed Jun 26 15:02:04 2013] [error] [client 127.0.253.1] File does not exist: /var/lib/openshift/689bf7ed2b854a789b187ae98505fe35/app-root/runtime/repo/php/jnlpJars
[Wed Jun 26 15:02:07 2013] [error] [client 127.0.253.1] File does not exist: /var/lib/openshift/689bf7ed2b854a789b187ae98505fe35/app-root/runtime/repo/php/jnlpJars
[Wed Jun 26 15:03:59 2013] [error] [client 127.0.253.1] File does not exist: /var/lib/openshift/689bf7ed2b854a789b187ae98505fe35/app-root/runtime/repo/php/jnlpJars
[Wed Jun 26 15:04:00 2013] [error] [client 127.0.253.1] File does not exist: /var/lib/openshift/689bf7ed2b854a789b187ae98505fe35/app-root/runtime/repo/php/jnlpJars

So the slave.jar should be on the MJCITestbldr app, but isnt. Lets test by putting it there;

Tried manually SCPing the jar to the slave jenkins node to see if that does actually work

the gear dissapeared again so tried to rebuild on jenkins again, hence the ID's for the SCP below have changed;

matjohn2-mac:~ matt$ mkdir jnlpJars
matjohn2-mac:~ matt$ mv Downloads/slave.jar jnlpJars/.
matjohn2-mac:~ matt$ scp -r jnlpJars 86e98aadd6d44930b2e972b2e36c89cd@MJCITestbldr-mj.poc.matt-j.co.uk:/var/lib/openshift/86e98aadd6d44930b2e972b2e36c89cd/app-root/runtime/repo/php/.
slave.jar                                                                                                                                                                             100%  278KB 277.5KB/s   00:00

start the app again
matjohn2-mac:~ matt$ rhc app start -a MJCITestbldr
Password: *********


RESULT:
MJCITestbldr started

https://mjcitestbldr-mj.poc.matt-j.co.uk/jnlpJars/slave.jar

This request now works... so lets try building in Jenkins again

WTF... Still fails
Attempting to connect slave...
Transferring slave.jar file...

ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins
java.io.IOException: Download of slave.jar failed.  Return code = 4
	at hudson.plugins.openshift.OpenShiftComputerLauncher.launch(OpenShiftComputerLauncher.java:97)
	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:679)


but the file IS on the box. Finding the code for this plugin in GITHUB to check what it should be doing; 

https://github.com/openshift/jenkins-cloud-plugin/blob/master/src/main/java/hudson/plugins/openshift/OpenShiftComputerLauncher.java


source code looks like it should be resolving, can you double check my understanding of this section;

  // A command for the initial slave setup
            StringBuilder execCommand = new StringBuilder();
            execCommand.append("mkdir -p $OPENSHIFT_DATA_DIR/jenkins")
                    .append(" && cd $OPENSHIFT_DATA_DIR/jenkins")
                    .append(" && rm -f slave.jar").append(" && wget -q --no-check-certificate https://")
                    .append(getGearDNS(hostName))
                    .append("/jnlpJars/slave.jar");

looks like it WILL result in https://mjcitestbldr-mj.poc.matt-j.co.uk/jnlpJars/slave.jar

which works from a browser now i've SCP'ed it across.

Definitley isnt downloading, as nothing appears on node in;

ls /var/lib/openshift/86e98aadd6d44930b2e972b2e36c89cd/app-root/data/jenkins/

I wonder if it's DNS Resolution of the jenkins app (not quick enough etc?)

hacking the ENV to an IP address doesn't work as it the node needs the DNS name to know which app to route the request too (node has many gears);

OPENSHIFT_GEAR_DNS=54.229.72.85

So thats a no go, try adding to hosts to rule out DNS;


[root@node ~]# wget --no-check-certificate https://mjcitestbldr-mj.poc.matt-j.co.uk/jnlpJars/slave.jar
--2013-06-26 15:44:42--  https://mjcitestbldr-mj.poc.matt-j.co.uk/jnlpJars/slave.jar
Resolving mjcitestbldr-mj.poc.matt-j.co.uk... 54.229.72.85
Connecting to mjcitestbldr-mj.poc.matt-j.co.uk|54.229.72.85|:443... connected.
WARNING: cannot verify mjcitestbldr-mj.poc.matt-j.co.uk’s certificate, issued by “/C=--/ST=SomeState/L=SomeCity/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=node.poc.matt-j.co.uk/emailAddress=root@node.poc.matt-j.co.uk”:
  Self-signed certificate encountered.
WARNING: certificate common name “node.poc.matt-j.co.uk” doesn’t match requested host name “mjcitestbldr-mj.poc.matt-j.co.uk”.
HTTP request sent, awaiting response... 200 OK
Length: 284160 (278K) [application/x-java-archive]
Saving to: “slave.jar”

100%[==============================================================================================================================================================================>] 284,160      485K/s   in 0.6s

2013-06-26 15:44:42 (485 KB/s) - “slave.jar” saved [284160/284160]

So DNS now 100% sure resolves for WGET on the node running our gears, lets see if we have success;

Attempting to connect slave...
Transferring slave.jar file...
ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins
java.io.IOException: Download of slave.jar failed.  Return code = 4
	at hudson.plugins.openshift.OpenShiftComputerLauncher.launch(OpenShiftComputerLauncher.java:97)
	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:679)


OK Now i'm confused, i'll log a RH bug for now, as everything that is needed is there! that i can see and it Still isnt succeeding.


  Status
 Delete Slave
 Configure
 Build History
 Load Statistics
 Script Console
 Log
 System Information
Build Executor Status
#	
Status
	

Attempting to connect slave...
Transferring slave.jar file...

ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins
java.io.IOException: Download of slave.jar failed.  Return code = 4
	at hudson.plugins.openshift.OpenShiftComputerLauncher.launch(OpenShiftComputerLauncher.java:97)
	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:679)


Actual results:
Jenkins fails to start worker

Expected results:
Build
Comment 2 Trastle 2013-06-26 17:58:05 EDT
The following is a work around and not a solution but knowing that it works may help you debug.


In order to work around this at the moment we have rebuilt the plugin.
The changes made have been to OpenShiftComputerLauncher.java

https://github.com/openshift/jenkins-cloud-plugin/blob/master/src/main/java/hudson/plugins/openshift/OpenShiftComputerLauncher.java


We've switched out the generate URL for slave.jar to a hard-coded URL we know to be good.

.append(" && rm -f slave.jar").append(" && wget -q --no-check-certificate https://")
.append(getGearDNS(hostName))
.append("/jnlpJars/slave.jar");

Ends up as 

.append(" && rm -f slave.jar").append(" && wget -q --no-check-certificate ") 
.append("https://dl.dropboxusercontent.com/u/214102/jars/slave.jar");	

This is allowing us to get slaves up and running.
Jenkins is now building the apps correctly.

Thanks.
Comment 3 Luke Meyer 2013-06-27 12:20:00 EDT
This is likely resolved with OSE 1.2 (coming soon). Not sure when we'll get a chance to look at this.
Comment 4 Brenton Leanhardt 2013-08-15 16:35:28 EDT
I'm wondering if this is related to Bug #984608.  If you are willing to test we could give you the updated jenkins-plugin-openshift that we're shipping in 1.2.2.
Comment 6 Brenton Leanhardt 2013-09-11 11:36:16 EDT
This is almost certainly related to Bug #984608 that has been fixed in 1.2.  I'm marking this as CURRENTRELEASE.  If the reporter is willing to debug further I can be available to debug their environment.

Note You need to log in before you can comment on or make changes to this bug.