Bug 1333133

Summary: better retry in accessing replication controllers from openshift jenkin-plugin
Product: OpenShift Container Platform Reporter: Gabe Montero <gmontero>
Component: ImageStreamsAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: aos-bugs, bparees, gmontero, jokerman, mmccomas, tdawson, wewang, wzheng
Target Milestone: ---Keywords: Rebase
Target Release: ---Flags: gmontero: needinfo-
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-06 19:06:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
config verify openshift deployment
none
verify Openshift build UI none

Description Gabe Montero 2016-05-04 17:53:14 UTC
During online testing, when attempting to verify deployments, a timing issue occurs were if we attempt to retrieve the RC before it is created, we get:

ERROR: Build step failed with exception
com.openshift.restclient.OpenShiftException: Could not get resource frontend-prod-1 in namespace gmontero-online-hackday: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"replicationcontrollers \"frontend-prod-1\" not found","reason":"NotFound","details":{"name":"frontend-prod-1","kind":"replicationcontrollers"},"code":404}

	at com.openshift.internal.restclient.DefaultClient.createOpenShiftException(DefaultClient.java:482)
	at com.openshift.internal.restclient.DefaultClient.get(DefaultClient.java:306)
	at com.openshift.jenkins.plugins.pipeline.IOpenShiftPlugin.getLatestReplicationController(IOpenShiftPlugin.java:64)
	at com.openshift.jenkins.plugins.pipeline.OpenShiftDeploymentVerifier.coreLogic(OpenShiftDeploymentVerifier.java:101)
	at com.openshift.jenkins.plugins.pipeline.IOpenShiftPlugin.doItCore(IOpenShiftPlugin.java:97)
	at com.openshift.jenkins.plugins.pipeline.IOpenShiftPlugin.doIt(IOpenShiftPlugin.java:111)
	at com.openshift.jenkins.plugins.pipeline.OpenShiftBaseStep.perform(OpenShiftBaseStep.java:89)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
	at hudson.model.Build$BuildExecution.build(Build.java:205)
	at hudson.model.Build$BuildExecution.doRun(Build.java:162)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
	at hudson.model.Run.execute(Run.java:1738)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:410)
Caused by: com.openshift.internal.restclient.http.NotFoundException: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"replicationcontrollers \"frontend-prod-1\" not found","reason":"NotFound","details":{"name":"frontend-prod-1","kind":"replicationcontrollers"},"code":404}

	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.createException(UrlConnectionHttpClient.java:230)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.request(UrlConnectionHttpClient.java:165)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.request(UrlConnectionHttpClient.java:141)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.get(UrlConnectionHttpClient.java:103)
	at com.openshift.internal.restclient.DefaultClient.get(DefaultClient.java:302)
	... 14 more
Caused by: java.io.FileNotFoundException: https://openshift.default.svc.cluster.local/api/v1/namespaces/gmontero-online-hackday/replicationcontrollers/frontend-prod-1
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1836)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
	at com.openshift.internal.restclient.http.UrlConnectionHttpClient.request(UrlConnectionHttpClient.java:161)
	... 17 more


Need to ensure that getLatestReplicationController() catches exceptions from the restclient and simply returns null, so that higher level retry logic can operate correctly.

Comment 1 Gabe Montero 2016-05-04 19:59:52 UTC
Fixed pushed to openshift/jenkins-plugin with commit https://github.com/openshift/jenkins-plugin/commit/1c7c55083cdafe9db2f3074e4ef707186545b06b

An jenkins image update will probably start within the week anyway because of future work.

Comment 2 Gabe Montero 2016-05-13 20:43:31 UTC
RHEL Jenkins images with v1.0.10 of the plugin, which has the fix for this bug, are no available on brew-pulp.

Comment 4 wewang 2016-05-25 02:40:14 UTC
@ Gabe Montero, could you add steps to how to reproduce, I am a little confused how to verify the bug, thanks

Comment 5 Gabe Montero 2016-05-25 13:55:42 UTC
@Wen Wang - unfortunately, this is a tricky timing window that will be hard to reproduce.

When I found it and then locally verified my fix, there were timing issues during the Online Hackthon, given the heavy usage, which delayed the actual creation of ReplicationControllers when a deployment was initiated and the plugin had a "Verify OpenShift Deployment" step.

If you can somehow replicate that delay, and run the "Verify OpenShift Deployment" with "Allow for verbose logging during this build step plug-in" turned on, you should see an exception like in the description, but retries should occur, and if the deployment is ultimately successful, the "Verify OpenShift Deployment" will ultimately report success.

My best guess at being able to artificially manufacture this delay is to define a DeploymentConfig with a pre lifecycle hook that say sleeps for say 60 seconds.

It is not my area of expertise, but if I'm reading the code right, that could mimic this delay.

Then, create a Jenkins job which deploys this DeploymentConfig and then attempts to verify the deploy, with verbose logging so you see that the exception occurs initially, but then ultimately the ReplicationController is created and the verify succeeds.

Comment 6 wewang 2016-05-26 03:39:40 UTC
Created attachment 1161707 [details]
config verify openshift deployment

Comment 7 wewang 2016-05-26 03:43:14 UTC
Tested for openshift3/jenkins-1-rhel7 8fe7d109f5bd
My env is :
 [root@dhcp-128-91 build]# oc get dc
NAME            REVISION   REPLICAS   TRIGGERED BY
frontend        1          1          config,image(origin-nodejs-sample:latest)
frontend-prod   0          1          config,image(origin-nodejs-sample:prod)
jenkins         1          1          config,image(jenkins:latest)
[root@dhcp-128-91 build]# oc get rc
NAME         DESIRED   CURRENT   AGE
frontend-1   1         1         6m
jenkins-1    1         1         12m
[root@dhcp-128-91 build]# oc get pods
NAME               READY     STATUS      RESTARTS   AGE
frontend-1-build   0/1       Completed   0          9m
frontend-1-nts1f   1/1       Running     0          6m
jenkins-1-yxomb    1/1       Running     0          12m

when configure "verify openshift deployment" like attachement , build job ,failed error: http://pastebin.test.redhat.com/377590

and ask a question, the build step is equal to what command in the background?

Comment 8 wewang 2016-05-26 05:39:10 UTC
Created attachment 1161725 [details]
verify Openshift build UI

Comment 9 wewang 2016-05-26 05:41:26 UTC
and configure "verify openshift deployment" like attachment 1161725 [details], build success, but dc ,rc and pod have no change.pls see console output:http://pastebin.test.redhat.com/377596

Comment 10 Gabe Montero 2016-05-26 15:02:27 UTC
The plugin reacted correctly in my opinion.  You never started a frontend-prod deployment.  The `oc get rc` and `oc get pods` confirms that.

I could see making the message a bit clearer when a deployment is not available, but I don't think we should gate this bugzilla's verification on that.

Add a "Tag OpenShift Image" step prior to the "Verify OpenShift Deployment", were you tag origin-nodejs-sample:latest to origin-nodejs-sample:prod.

Also, with the screen shots you posted, unless you scale out the deployment to 3 before running the verify, that failure will be noted.

Lastly, I saw no indication that you attempted the sabotage I articulated in #Comment 5.  If that is too involved, and you simply want to do some regression testing to make sure I did not break the typical mainline path, I am OK with that.  Just wanted to confirm that is what you were thinking.

Comment 11 wewang 2016-05-27 06:12:11 UTC
@ Gabe Montero, I verified below:
  
 1. Configure "Tag OpenShift Image", set origin-nodejs-sample to new tag:prod
 # oc get is
NAME                   DOCKER REPO                                       TAGS          UPDATED
nodejs-010-rhel7       172.30.153.230:5000/wewang/nodejs-010-rhel7                     
origin-nodejs-sample   172.30.153.230:5000/wewang/origin-nodejs-sample   prod,latest   19 seconds ago
2. So there is  rc: frontend-prod-1
   # oc get rc
NAME              DESIRED   CURRENT   AGE
frontend-1        1         1         2h
frontend-prod-1   1         1         2m
jenkins-1         1         1         2h
3. Configure"verify openshift deployment",and build job ,build complete 
   "Verify OpenShift Deployment" successfully; deployment "frontend-prod-1" has completed with status:  [Complete].
 and console output:http://pastebin.test.redhat.com/378066

4.also using below template to check 
  $oc new-app -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/image/language-image-templates/python-34-rhel7-stibuild.json(#change timeout to 60)
  Configure"verify openshift deployment",and build job ,build complete 

is there anything I should verify ? if no, will change status to "verified" ,I am not sure understand comment5 totally, so wait your reply to deal the bug

Comment 13 Gabe Montero 2016-06-02 14:46:14 UTC
Yeah, at this point, let's not worry about Comment #5.  I wasn't 100% sure it was a viable sabotage anyway.  And as I said before, I was able to try this change in the unstable online env the day I found this.

Go ahead and mark this verified.

Comment 15 errata-xmlrpc 2016-06-06 19:06:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1206