Bug 1473329 - Jobs from Jenkins are not stopped when jenkins build pod is killed
Summary: Jobs from Jenkins are not stopped when jenkins build pod is killed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.7.0
Assignee: Gabe Montero
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-20 13:47 UTC by Nicolas Nosenzo
Modified: 2021-09-09 12:27 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: build delete watch events, and current Jenkins job being canceled when a build was canceled in OpenShift, were not handled Consequence: Various negative, inconsistent Jenkins job results occurred along with a slew is exception stack traces in the Jenkins system log Fix: Jenkins jobs are halted as soon as the build watch event detects that a build was deleted as the result of a build cancel action taken within OpenShift. Result: Consistent, sensible behavior for the Jenkins users when builds are canceled/deleted.
Clone Of:
Environment:
Last Closed: 2017-11-28 22:04:10 UTC
Target Upstream Version:
Embargoed:
xiuwang: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Nicolas Nosenzo 2017-07-20 13:47:32 UTC
Description of problem:
If Jenkins pod is killed and recreated then you can't start again pipeline and we had to remove jobs from Jenkins to make it work again. Log from Jenkins:

Jul 20, 2017 7:23:04 AM io.fabric8.jenkins.openshiftsync.BuildConfigWatcher onInitialBuildConfigs
SEVERE: Failed to update job
java.lang.IllegalArgumentException: aila-myfirstpipeline already exists
        at hudson.model.Items.verifyItemDoesNotAlreadyExist(Items.java:462)
        at hudson.model.ItemGroupMixIn.createProjectFromXML(ItemGroupMixIn.java:263)
        at jenkins.model.Jenkins.createProjectFromXML(Jenkins.java:3781)
        at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher$2.call(BuildConfigWatcher.java:214)
        at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher$2.call(BuildConfigWatcher.java:159)
        at hudson.security.ACL.impersonate(ACL.java:221)
        at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher.upsertJob(BuildConfigWatcher.java:159)
        at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher.onInitialBuildConfigs(BuildConfigWatcher.java:129)
        at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher.access$200(BuildConfigWatcher.java:65)
        at io.fabric8.jenkins.openshiftsync.BuildConfigWatcher$1.doRun(BuildConfigWatcher.java:88)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)


Version-Release number of selected component (if applicable):
OCP 3.4
jenkins-plugin-openshift-sync-0.1.7-1.el7.x86_64
Jenkins image: 
https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/jenkins-2-rhel7/images/2.32-8

How reproducible:
100%

Steps to Reproduce:
1. Start a Jenkins build
2. Kill the build pod
3. Re-start the pipeline

Actual results:

io.fabric8.jenkins.openshiftsync.BuildConfigWatcher onInitialBuildConfigs
SEVERE: Failed to update job
java.lang.IllegalArgumentException: aila-myfirstpipeline already exists

Expected results:

Pipeline execute without issues

Additional info:

Comment 1 Nicolas Nosenzo 2017-07-20 13:48:18 UTC
Seem there is a PR for this already:

https://github.com/openshift/jenkins-sync-plugin/issues/104

Comment 2 Ben Parees 2017-07-20 20:32:51 UTC
Since that issue was resolved, you can try updating to the new sync plugin (0.1.23 is the latest i believe) which would contain the fix.  We'll have a new image out w/ that version of the plugin as part of 3.6.

Comment 3 Gabe Montero 2017-07-21 14:29:23 UTC
updated target release and moving to modified

Comment 4 Gabe Montero 2017-07-24 11:51:13 UTC
the actual PR for this was https://github.com/openshift/jenkins-sync-plugin/pull/144

However, Ben and I decided to hold this out of the 3.6 openshift Jenkins images.

Ben - are we going to build a 3.6.1 image ?

Pending that answer, it won't be in a *RHEL* openshift jenkins image until 3.7.

Comment 19 XiuJuan Wang 2017-09-14 09:14:03 UTC
This issue doesn't fix completely.
Create jenkins pod with persistent volume.

Recreate jenkins pod when pipeline build is runing. Then the broken build will be stuck in ‘Resuming build at Thu Sep 14 09:09:12 UTC 2017 after Jenkins restart’ step. There is no error as comment #0 when start a new pipeline build which will keep new status.

sync plugin version is 0.1.26
pipeline plgin version is 1.0.50

Comment 20 Gabe Montero 2017-09-14 11:34:47 UTC
Hey XiuJuan

I'll certainly try to recreate wit ha PV (I did not use that in my testing), but at first blush I'm not sure why that would make a difference.

Just in case it is a timing issue, can you please provide the jenkins pod logs, the job run console logs, and the output from `oc get build <build name> -o json`.

thanks

Comment 21 Gabe Montero 2017-09-14 20:03:20 UTC
Ok, yeah, I saw mixed results when running this a few times.  Bottom line, Jenkins has gaps/limitations on its restart / resume capability.

I got some successful restarts.  Other times,  based on timing and when *exactly* I recreated the jenkins deployment, it would fail with exceptions that the sync plugin cannot do anything to work around.  For example:

Caught: java.lang.Exception: Resume after a restart not supported for non-blocking synchronous steps
[Pipeline] End of Pipeline
java.lang.Exception: Resume after a restart not supported for non-blocking synchronous steps
	at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution.onResume(AbstractSynchronousNonBlockingStepExecution.java:70)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl$1.onSuccess(FlowExecutionList.java:185)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl$1.onSuccess(FlowExecutionList.java:180)
	at com.google.common.util.concurrent.Futures$6.run(Futures.java:975)
	at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)


My steps:

1) provision jenkins with persistent template

2) oc new-app https://raw.githubusercontent.com/openshift/origin/master/examples/jenkins/pipeline/samplepipeline.yaml

3) oc start-build sample-pipeline

   let the build get to running state

4) oc rollout latest dc/jenkins

5)   jenkins redeploys ... the build seemed to stay in running state 

6) when jenkins comes back up, I see its failure recovery make some noise:

Sep 14, 2017 7:42:37 PM jenkins.model.RunIdMigrator migrate
INFO: Migrating build records in /var/lib/jenkins/jobs/myproject-sample-pipeline/builds


7) then I see the job restarted, including the slave launch

Sep 14, 2017 7:44:17 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud provision
INFO: Excess workload after pending Spot instances: 1
Sep 14, 2017 7:44:17 PM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply
INFO: Started provisioning Kubernetes Pod Template from openshift with 1 executors. Remaining excess workload: 0
Sep 14, 2017 7:44:18 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call
INFO: Created Pod: nodejs-6d8537f5d56
Sep 14, 2017 7:44:18 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call
INFO: Waiting for Pod to be scheduled (0/100): nodejs-6d8537f5d56
Sep 14, 2017 7:44:21 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
INFO: Accepted JNLP4-connect connection #1 from /172.17.0.4:51586
Sep 14, 2017 7:44:24 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call
INFO: Waiting for slave to connect (0/100): nodejs-6d8537f5d56


8) and the job /build finishes

Sep 14, 2017 7:46:07 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
INFO: myproject-sample-pipeline #1 completed: SUCCESS
Sep 14, 2017 7:46:08 PM io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onCompleted
INFO: onCompleted job/myproject-sample-pipeline/1/
Sep 14, 2017 7:46:08 PM io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onFinalized
INFO: onFinalized job/myproject-sample-pipeline/1/


sample-pipeline-1          JenkinsPipeline                 Complete   11 minutes ago   


So, @XiuJuan, please provide the data I asked for in Comment #20, as well as the specifics of each step like I noted above.  Based on the data you get, as long as with my attempts to reproduce using your recipe, we'll see if there are some recovery scenarios the sync plugin can assist, or if we have to simply cite this as a "jenkins shortcoming".

We'll also need to decide what if any warnings we put in our docs / readme's for this space.

Comment 22 XiuJuan Wang 2017-09-15 05:21:07 UTC
@Gabe,My steps are exactly same with you.

1.My first build went to cancelled finally after redeploy jenkins pod first time.And other new status pipeline builds went to completed. Unfortunately,I don't save jenkins log before redeploy it again :(

Here is pipeline build log from webconsole

Resuming build at Thu Sep 14 08:56:49 UTC 2017 after Jenkins restart
Resuming build at Thu Sep 14 09:09:12 UTC 2017 after Jenkins restart
[Pipeline] // timeout
[Pipeline] echo
in catch block
[Pipeline] echo
Caught: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
[Pipeline] End of Pipeline
Timeout has been exceeded
Finished: ABORTED

$oc get builds sample-pipeline-1  -o json  -n xiu1
http://pastebin.test.redhat.com/516371

2.But today all pipeline builds go to failed in few mins after redeploy jenkins pod.

Here is jenkins pod logs and pipeline build log
http://pastebin.test.redhat.com/516374

Comment 23 Gabe Montero 2017-09-15 15:55:23 UTC
Thanks XiuJuan.  Your second pastebin showed you saw the same basic core issue I did when re-deploy of jenkins did not allow the build to continue:

java.lang.Exception: Resume after a restart not supported for non-blocking synchronous steps

Again, this is a limitation wrt Jenkins.

Also, I should note that the re-deploy of jenkins is different than say the canceling of the build, which was the crux of the exception in the bugzilla description and what the fix addressed.

If you were able to do that and succeed, I think we should mark this bug verified.
And if need be we create a separate documentation bug to clarify the behavior, and cite the limitations of jenkins, if you restart jenkins during a pipeline strategy build.

Sending back to ON_QA to see if you agree.

Also, Ben (on cc:), certainly let us know if your disagree with my rationale here.

thanks

Comment 24 XiuJuan Wang 2017-09-18 08:18:58 UTC
Agree.
The failed build has complained the failed reason, I am ok to verified this bug.

Comment 27 errata-xmlrpc 2017-11-28 22:04:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.