|Summary:||Jobs from Jenkins are not stopped when jenkins build pod is killed|
|Product:||OpenShift Container Platform||Reporter:||Nicolas Nosenzo <nnosenzo>|
|Component:||Build||Assignee:||Gabe Montero <gmontero>|
|Status:||CLOSED ERRATA||QA Contact:||Wenjing Zheng <wzheng>|
|Version:||3.4.1||CC:||aivaras.laimikis, aos-bugs, bparees, jupierce, smunilla, wzheng, xiuwang|
|Fixed In Version:||Doc Type:||Bug Fix|
Cause: build delete watch events, and current Jenkins job being canceled when a build was canceled in OpenShift, were not handled Consequence: Various negative, inconsistent Jenkins job results occurred along with a slew is exception stack traces in the Jenkins system log Fix: Jenkins jobs are halted as soon as the build watch event detects that a build was deleted as the result of a build cancel action taken within OpenShift. Result: Consistent, sensible behavior for the Jenkins users when builds are canceled/deleted.
|Last Closed:||2017-11-28 22:04:10 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Nicolas Nosenzo 2017-07-20 13:47:32 UTC
Description of problem: If Jenkins pod is killed and recreated then you can't start again pipeline and we had to remove jobs from Jenkins to make it work again. Log from Jenkins: Jul 20, 2017 7:23:04 AM io.fabric8.jenkins.openshiftsync.BuildConfigWatcher onInitialBuildConfigs SEVERE: Failed to update job java.lang.IllegalArgumentException: aila-myfirstpipeline already exists at hudson.model.Items.verifyItemDoesNotAlreadyExist(Items.java:462) at hudson.model.ItemGroupMixIn.createProjectFromXML(ItemGroupMixIn.java:263) at jenkins.model.Jenkins.createProjectFromXML(Jenkins.java:3781) at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher$2.call(BuildConfigWatcher.java:214) at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher$2.call(BuildConfigWatcher.java:159) at hudson.security.ACL.impersonate(ACL.java:221) at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher.upsertJob(BuildConfigWatcher.java:159) at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher.onInitialBuildConfigs(BuildConfigWatcher.java:129) at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher.access$200(BuildConfigWatcher.java:65) at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher$1.doRun(BuildConfigWatcher.java:88) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Version-Release number of selected component (if applicable): OCP 3.4 jenkins-plugin-openshift-sync-0.1.7-1.el7.x86_64 Jenkins image: https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/jenkins-2-rhel7/images/2.32-8 How reproducible: 100% Steps to Reproduce: 1. Start a Jenkins build 2. Kill the build pod 3. Re-start the pipeline Actual results: io.fabric8.jenkins.openshiftsync.BuildConfigWatcher onInitialBuildConfigs SEVERE: Failed to update job java.lang.IllegalArgumentException: aila-myfirstpipeline already exists Expected results: Pipeline execute without issues Additional info:
Comment 1 Nicolas Nosenzo 2017-07-20 13:48:18 UTC
Seem there is a PR for this already: https://github.com/openshift/jenkins-sync-plugin/issues/104
Comment 2 Ben Parees 2017-07-20 20:32:51 UTC
Since that issue was resolved, you can try updating to the new sync plugin (0.1.23 is the latest i believe) which would contain the fix. We'll have a new image out w/ that version of the plugin as part of 3.6.
Comment 3 Gabe Montero 2017-07-21 14:29:23 UTC
updated target release and moving to modified
Comment 4 Gabe Montero 2017-07-24 11:51:13 UTC
the actual PR for this was https://github.com/openshift/jenkins-sync-plugin/pull/144 However, Ben and I decided to hold this out of the 3.6 openshift Jenkins images. Ben - are we going to build a 3.6.1 image ? Pending that answer, it won't be in a *RHEL* openshift jenkins image until 3.7.
Comment 19 XiuJuan Wang 2017-09-14 09:14:03 UTC
This issue doesn't fix completely. Create jenkins pod with persistent volume. Recreate jenkins pod when pipeline build is runing. Then the broken build will be stuck in ‘Resuming build at Thu Sep 14 09:09:12 UTC 2017 after Jenkins restart’ step. There is no error as comment #0 when start a new pipeline build which will keep new status. sync plugin version is 0.1.26 pipeline plgin version is 1.0.50
Comment 20 Gabe Montero 2017-09-14 11:34:47 UTC
Hey XiuJuan I'll certainly try to recreate wit ha PV (I did not use that in my testing), but at first blush I'm not sure why that would make a difference. Just in case it is a timing issue, can you please provide the jenkins pod logs, the job run console logs, and the output from `oc get build <build name> -o json`. thanks
Comment 21 Gabe Montero 2017-09-14 20:03:20 UTC
Ok, yeah, I saw mixed results when running this a few times. Bottom line, Jenkins has gaps/limitations on its restart / resume capability. I got some successful restarts. Other times, based on timing and when *exactly* I recreated the jenkins deployment, it would fail with exceptions that the sync plugin cannot do anything to work around. For example: Caught: java.lang.Exception: Resume after a restart not supported for non-blocking synchronous steps [Pipeline] End of Pipeline java.lang.Exception: Resume after a restart not supported for non-blocking synchronous steps at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution.onResume(AbstractSynchronousNonBlockingStepExecution.java:70) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl$1.onSuccess(FlowExecutionList.java:185) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl$1.onSuccess(FlowExecutionList.java:180) at com.google.common.util.concurrent.Futures$6.run(Futures.java:975) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253) My steps: 1) provision jenkins with persistent template 2) oc new-app https://raw.githubusercontent.com/openshift/origin/master/examples/jenkins/pipeline/samplepipeline.yaml 3) oc start-build sample-pipeline let the build get to running state 4) oc rollout latest dc/jenkins 5) jenkins redeploys ... the build seemed to stay in running state 6) when jenkins comes back up, I see its failure recovery make some noise: Sep 14, 2017 7:42:37 PM jenkins.model.RunIdMigrator migrate INFO: Migrating build records in /var/lib/jenkins/jobs/myproject-sample-pipeline/builds 7) then I see the job restarted, including the slave launch Sep 14, 2017 7:44:17 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud provision INFO: Excess workload after pending Spot instances: 1 Sep 14, 2017 7:44:17 PM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply INFO: Started provisioning Kubernetes Pod Template from openshift with 1 executors. Remaining excess workload: 0 Sep 14, 2017 7:44:18 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call INFO: Created Pod: nodejs-6d8537f5d56 Sep 14, 2017 7:44:18 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call INFO: Waiting for Pod to be scheduled (0/100): nodejs-6d8537f5d56 Sep 14, 2017 7:44:21 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted JNLP4-connect connection #1 from /172.17.0.4:51586 Sep 14, 2017 7:44:24 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call INFO: Waiting for slave to connect (0/100): nodejs-6d8537f5d56 8) and the job /build finishes Sep 14, 2017 7:46:07 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish INFO: myproject-sample-pipeline #1 completed: SUCCESS Sep 14, 2017 7:46:08 PM io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onCompleted INFO: onCompleted job/myproject-sample-pipeline/1/ Sep 14, 2017 7:46:08 PM io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onFinalized INFO: onFinalized job/myproject-sample-pipeline/1/ sample-pipeline-1 JenkinsPipeline Complete 11 minutes ago So, @XiuJuan, please provide the data I asked for in Comment #20, as well as the specifics of each step like I noted above. Based on the data you get, as long as with my attempts to reproduce using your recipe, we'll see if there are some recovery scenarios the sync plugin can assist, or if we have to simply cite this as a "jenkins shortcoming". We'll also need to decide what if any warnings we put in our docs / readme's for this space.
Comment 22 XiuJuan Wang 2017-09-15 05:21:07 UTC
@Gabe,My steps are exactly same with you. 1.My first build went to cancelled finally after redeploy jenkins pod first time.And other new status pipeline builds went to completed. Unfortunately,I don't save jenkins log before redeploy it again :( Here is pipeline build log from webconsole Resuming build at Thu Sep 14 08:56:49 UTC 2017 after Jenkins restart Resuming build at Thu Sep 14 09:09:12 UTC 2017 after Jenkins restart [Pipeline] // timeout [Pipeline] echo in catch block [Pipeline] echo Caught: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException [Pipeline] End of Pipeline Timeout has been exceeded Finished: ABORTED $oc get builds sample-pipeline-1 -o json -n xiu1 http://pastebin.test.redhat.com/516371 2.But today all pipeline builds go to failed in few mins after redeploy jenkins pod. Here is jenkins pod logs and pipeline build log http://pastebin.test.redhat.com/516374
Comment 23 Gabe Montero 2017-09-15 15:55:23 UTC
Thanks XiuJuan. Your second pastebin showed you saw the same basic core issue I did when re-deploy of jenkins did not allow the build to continue: java.lang.Exception: Resume after a restart not supported for non-blocking synchronous steps Again, this is a limitation wrt Jenkins. Also, I should note that the re-deploy of jenkins is different than say the canceling of the build, which was the crux of the exception in the bugzilla description and what the fix addressed. If you were able to do that and succeed, I think we should mark this bug verified. And if need be we create a separate documentation bug to clarify the behavior, and cite the limitations of jenkins, if you restart jenkins during a pipeline strategy build. Sending back to ON_QA to see if you agree. Also, Ben (on cc:), certainly let us know if your disagree with my rationale here. thanks
Comment 24 XiuJuan Wang 2017-09-18 08:18:58 UTC
Agree. The failed build has complained the failed reason, I am ok to verified this bug.
Comment 27 errata-xmlrpc 2017-11-28 22:04:10 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188