Bug 1333133 - better retry in accessing replication controllers from openshift jenkin-plugin
Summary: better retry in accessing replication controllers from openshift jenkin-plugin
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 3.2.0
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Gabe Montero
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-04 17:53 UTC by Gabe Montero
Modified: 2016-06-06 19:06 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-06 19:06:38 UTC
Target Upstream Version:
Embargoed:
gmontero: needinfo-


Attachments (Terms of Use)
config verify openshift deployment (36.49 KB, image/png)
2016-05-26 03:39 UTC, wewang
no flags Details
verify Openshift build UI (36.97 KB, image/png)
2016-05-26 05:39 UTC, wewang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1206 0 normal SHIPPED_LIVE Moderate: jenkins security update 2016-06-06 23:06:23 UTC

Description Gabe Montero 2016-05-04 17:53:14 UTC
During online testing, when attempting to verify deployments, a timing issue occurs were if we attempt to retrieve the RC before it is created, we get:

ERROR: Build step failed with exception
com.openshift.restclient.OpenShiftException: Could not get resource frontend-prod-1 in namespace gmontero-online-hackday: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"replicationcontrollers \"frontend-prod-1\" not found","reason":"NotFound","details":{"name":"frontend-prod-1","kind":"replicationcontrollers"},"code":404}

	at com.openshift.internal.restclient.DefaultClient.createOpenShiftException(DefaultClient.java:482)
	at com.openshift.internal.restclient.DefaultClient.get(DefaultClient.java:306)
	at com.openshift.jenkins.plugins.pipeline.IOpenShiftPlugin.getLatestReplicationController(IOpenShiftPlugin.java:64)
	at com.openshift.jenkins.plugins.pipeline.OpenShiftDeploymentVerifier.coreLogic(OpenShiftDeploymentVerifier.java:101)
	at com.openshift.jenkins.plugins.pipeline.IOpenShiftPlugin.doItCore(IOpenShiftPlugin.java:97)
	at com.openshift.jenkins.plugins.pipeline.IOpenShiftPlugin.doIt(IOpenShiftPlugin.java:111)
	at com.openshift.jenkins.plugins.pipeline.OpenShiftBaseStep.perform(OpenShiftBaseStep.java:89)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
	at hudson.model.Build$BuildExecution.build(Build.java:205)
	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: com.openshift.internal.restclient.http.NotFoundException: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"replicationcontrollers \"frontend-prod-1\" not found","reason":"NotFound","details":{"name":"frontend-prod-1","kind":"replicationcontrollers"},"code":404}

	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.createException(UrlConnectionHttpClient.java:230)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.request(UrlConnectionHttpClient.java:165)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.request(UrlConnectionHttpClient.java:141)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.get(UrlConnectionHttpClient.java:103)
	at com.openshift.internal.restclient.DefaultClient.get(DefaultClient.java:302)
	... 14 more
Caused by: java.io.FileNotFoundException: https://openshift.default.svc.cluster.local/api/v1/namespaces/gmontero-online-hackday/replicationcontrollers/frontend-prod-1
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1836)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.request(UrlConnectionHttpClient.java:161)
	... 17 more


Need to ensure that getLatestReplicationController() catches exceptions from the restclient and simply returns null, so that higher level retry logic can operate correctly.

Comment 1 Gabe Montero 2016-05-04 19:59:52 UTC
Fixed pushed to openshift/jenkins-plugin with commit https://github.com/openshift/jenkins-plugin/commit/1c7c55083cdafe9db2f3074e4ef707186545b06b

An jenkins image update will probably start within the week anyway because of future work.

Comment 2 Gabe Montero 2016-05-13 20:43:31 UTC
RHEL Jenkins images with v1.0.10 of the plugin, which has the fix for this bug, are no available on brew-pulp.

Comment 4 wewang 2016-05-25 02:40:14 UTC
@ Gabe Montero, could you add steps to how to reproduce, I am a little confused how to verify the bug, thanks

Comment 5 Gabe Montero 2016-05-25 13:55:42 UTC
@Wen Wang - unfortunately, this is a tricky timing window that will be hard to reproduce.

When I found it and then locally verified my fix, there were timing issues during the Online Hackthon, given the heavy usage, which delayed the actual creation of ReplicationControllers when a deployment was initiated and the plugin had a "Verify OpenShift Deployment" step.

If you can somehow replicate that delay, and run the "Verify OpenShift Deployment" with "Allow for verbose logging during this build step plug-in" turned on, you should see an exception like in the description, but retries should occur, and if the deployment is ultimately successful, the "Verify OpenShift Deployment" will ultimately report success.

My best guess at being able to artificially manufacture this delay is to define a DeploymentConfig with a pre lifecycle hook that say sleeps for say 60 seconds.

It is not my area of expertise, but if I'm reading the code right, that could mimic this delay.

Then, create a Jenkins job which deploys this DeploymentConfig and then attempts to verify the deploy, with verbose logging so you see that the exception occurs initially, but then ultimately the ReplicationController is created and the verify succeeds.

Comment 6 wewang 2016-05-26 03:39:40 UTC
Created attachment 1161707 [details]
config verify openshift deployment

Comment 7 wewang 2016-05-26 03:43:14 UTC
Tested for openshift3/jenkins-1-rhel7 8fe7d109f5bd
My env is :
 [root@dhcp-128-91 build]# oc get dc
NAME            REVISION   REPLICAS   TRIGGERED BY
frontend        1          1          config,image(origin-nodejs-sample:latest)
frontend-prod   0          1          config,image(origin-nodejs-sample:prod)
jenkins         1          1          config,image(jenkins:latest)
[root@dhcp-128-91 build]# oc get rc
NAME         DESIRED   CURRENT   AGE
frontend-1   1         1         6m
jenkins-1    1         1         12m
[root@dhcp-128-91 build]# oc get pods
NAME               READY     STATUS      RESTARTS   AGE
frontend-1-build   0/1       Completed   0          9m
frontend-1-nts1f   1/1       Running     0          6m
jenkins-1-yxomb    1/1       Running     0          12m

when configure "verify openshift deployment" like attachement , build job ,failed error: http://pastebin.test.redhat.com/377590

and ask a question, the build step is equal to what command in the background?

Comment 8 wewang 2016-05-26 05:39:10 UTC
Created attachment 1161725 [details]
verify Openshift build UI

Comment 9 wewang 2016-05-26 05:41:26 UTC
and configure "verify openshift deployment" like attachment 1161725 [details], build success, but dc ,rc and pod have no change.pls see console output:http://pastebin.test.redhat.com/377596

Comment 10 Gabe Montero 2016-05-26 15:02:27 UTC
The plugin reacted correctly in my opinion.  You never started a frontend-prod deployment.  The `oc get rc` and `oc get pods` confirms that.

I could see making the message a bit clearer when a deployment is not available, but I don't think we should gate this bugzilla's verification on that.

Add a "Tag OpenShift Image" step prior to the "Verify OpenShift Deployment", were you tag origin-nodejs-sample:latest to origin-nodejs-sample:prod.

Also, with the screen shots you posted, unless you scale out the deployment to 3 before running the verify, that failure will be noted.

Lastly, I saw no indication that you attempted the sabotage I articulated in #Comment 5.  If that is too involved, and you simply want to do some regression testing to make sure I did not break the typical mainline path, I am OK with that.  Just wanted to confirm that is what you were thinking.

Comment 11 wewang 2016-05-27 06:12:11 UTC
@ Gabe Montero, I verified below:
  
 1. Configure "Tag OpenShift Image", set origin-nodejs-sample to new tag:prod
 # oc get is
NAME                   DOCKER REPO                                       TAGS          UPDATED
nodejs-010-rhel7       172.30.153.230:5000/wewang/nodejs-010-rhel7                     
origin-nodejs-sample   172.30.153.230:5000/wewang/origin-nodejs-sample   prod,latest   19 seconds ago
2. So there is  rc: frontend-prod-1
   # oc get rc
NAME              DESIRED   CURRENT   AGE
frontend-1        1         1         2h
frontend-prod-1   1         1         2m
jenkins-1         1         1         2h
3. Configure"verify openshift deployment",and build job ,build complete 
   "Verify OpenShift Deployment" successfully; deployment "frontend-prod-1" has completed with status:  [Complete].
 and console output:http://pastebin.test.redhat.com/378066

4.also using below template to check 
  $oc new-app -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/image/language-image-templates/python-34-rhel7-stibuild.json(#change timeout to 60)
  Configure"verify openshift deployment",and build job ,build complete 

is there anything I should verify ? if no, will change status to "verified" ,I am not sure understand comment5 totally, so wait your reply to deal the bug

Comment 13 Gabe Montero 2016-06-02 14:46:14 UTC
Yeah, at this point, let's not worry about Comment #5.  I wasn't 100% sure it was a viable sabotage anyway.  And as I said before, I was able to try this change in the unstable online env the day I found this.

Go ahead and mark this verified.

Comment 15 errata-xmlrpc 2016-06-06 19:06:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1206


Note You need to log in before you can comment on or make changes to this bug.