Bug 1293859

Summary:	Failed to trigger more than one time jobs in jenkins webconsole
Product:	OKD	Reporter:	XiuJuan Wang <xiuwang>
Component:	Deployments	Assignee:	Dan Mace <dmace>
Status:	CLOSED CURRENTRELEASE	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.x	CC:	akokshar, aos-bugs, erich, gmontero, jkaur, misalunk, mmccomas, pep, xiuwang, yinzhou
Target Milestone:	---	Flags:	gmontero: needinfo- xiuwang: needinfo-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift3/jenkins-1-rhel7:1.642-30	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-18 11:32:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1339502

Description XiuJuan Wang 2015-12-23 09:53:48 UTC

Description of problem:
Senario 1：
Trigger the first job in jenkins webconsole,will succeed. And frontend and frontend-prod pod deploy successfully.
Trigger the second time job in jenkins web, will fail due to 'BUILD STEP EXIT: OpenShiftDeploymentVerifier deployment frontend-2 failed'

Senario 2：
Trigger first build in cli,build will be completed and frontend deploys successfully.
Trigger second job in jenkins web, will successed,frontend and frontend-prod pod deploy successfully.
Trigger third job in jenkins web, will fail due to 'BUILD STEP EXIT: OpenShiftDeploymentVerifier deployment frontend-3 failed'


So if triggering more than one time job in jenkins webconsole will failed.

After trigger second job in jenkins web,always scale frontend pod to 0 firstly before build is completed.Maybe this reason result in frontend pod deploy failed.

Version-Release number of selected component (if applicable):

jenkins-1-rhel7 756be05f30be
devenv-fedora_2940

How reproducible:
always

Steps to Reproduce:
1.Create jenkins app  following https://github.com/openshift/origin/tree/master/examples/jenkins
2.Create an applicate
oc new-app application-template.json
3.Trigger more than one job in jenkins webconsole.
4.Get pods in from cli side 

Actual results:
3.Execpt first job succeed, other jobs will fail.

4.$ oc  get  pods 
NAME                    READY     STATUS      RESTARTS   AGE
frontend-1-build        0/1       Completed   0          1h
frontend-2-build        0/1       Completed   0          1h
frontend-2-deploy       0/1       Error       0          59m
frontend-prod-2-ha87d   1/1       Running     0          59m
jenkins-1-4evpm         1/1       Running     0          1h


Expected results:
Could trigger more time jobs from jenkins webconsole.

Additional info:

$oc logs frontend-2-deploy 
I1223 08:29:13.640629       1 deployer.go:198] Deploying from test/frontend-1 to test/frontend-2 (replicas: 0)
I1223 08:29:14.675486       1 rolling.go:232] RollingUpdater: Continuing update with existing controller frontend-2.
F1223 08:29:14.688773       1 deployer.go:65] one of maxSurge or maxUnavailable must be specified

Comment 1 Ben Parees 2015-12-23 19:22:45 UTC

Gabe I know you dug into this previously with the Platform Management team, does this need to go to them or is there something needed on our side?

Comment 2 Gabe Montero 2015-12-23 20:09:56 UTC

Nothing needed on our side.

I think it is a question of the tester updating the openshift images he's running against (the deployer in particular I believe) to get the fix from the Platform mgmt Team. 

I know in the last couple of weeks with the most up to date images I no longer see this, at least with centos.

I don't know the whether the platform mgmt team's fixes for this maxSurge/maxUnavailable have made to the rhel images.

So sending to platform mgmt to confirm which rhel images that fix got into seems like the logical next step.

Looking at the components list here in bugzilla, not sure 100% sure which one is best ... "Deployments" perhaps?

Comment 3 Gabe Montero 2015-12-23 20:52:08 UTC

Apologies if I've sent this to the wrong component.

Could someone from the Platform Management team confirm which levels of the deployer image on rhel have the fixes for the 

F1223 08:29:14.688773       1 deployer.go:65] one of maxSurge or maxUnavailable must be specified

error ... at least for myself and the QA contact, these occur after the initial deployment.

I can provide some historical context:
Dan Mace, myself, and other emailed about this on openshift-sme around the Oct 20 to Nov 2 timeframe (the subj was "maxSurge or maxUnavailable must be specified", and I've seen the issue resolved on the centos images for the last couple of weeks.  Ben Parees was able to reproduce with OpenShift images from around 3 weeks old I believe.

thanks

Comment 4 Dan Mace 2016-01-05 13:53:52 UTC

This is the same as https://github.com/openshift/origin/issues/5787

The bug occurs when the deployment's replica count is 0.

Comment 5 Dan Mace 2016-01-07 15:17:52 UTC

Waiting for this upstream patch to be vetted before patching origin: https://github.com/kubernetes/kubernetes/pull/19366

Comment 6 Wei Sun 2016-01-25 07:46:26 UTC

*** Bug 1300295 has been marked as a duplicate of this bug. ***

Comment 7 Dan Mace 2016-02-01 15:57:44 UTC

https://github.com/openshift/origin/pull/6937

Comment 8 XiuJuan Wang 2016-02-02 11:09:52 UTC

Test in devenv-rhel7_3301
After scale dc to 0, trigger a build, the new replicas in rc/dc is still 0, so no pod will be deployed.

So for my bug comment #0, this issue still exists.Failed to trigger second job in jenkins webconsole, since the jenkins will scale frontend dc to 0 firstly,then build and check the new rc if is 1.

Comment 9 Dan Mace 2016-02-02 14:09:23 UTC

The bug resolved by the referenced PRs is to correct the error produced by the deployment:

I1223 08:29:13.640629       1 deployer.go:198] Deploying from test/frontend-1 to test/frontend-2 (replicas: 0)
I1223 08:29:14.675486       1 rolling.go:232] RollingUpdater: Continuing update with existing controller frontend-2.
F1223 08:29:14.688773       1 deployer.go:65] one of maxSurge or maxUnavailable must be specified

If the error no longer occurs and the deployment succeeds, then the PRs fixed the intended bug.

It's not clear to me why you're expecting a new rc with a scale of 1 when the scale prior to the deployment is 0. The underlying bug is that the deployment subsystem refused to deploy with a scale of 0, which is now fixed. Can you elaborate on your expectations?

Comment 10 Gabe Montero 2016-02-02 15:01:34 UTC

To build on Dan Mace's comment, in the context of the OpenShift Sample Job in the Jenkins image (the use case I believe the test used to produce the maxSurge, etc. issue) if you look at the complete set of build steps, there is in fact an explicit scale to 1 *after* the build step.

The build step does *NOT* automatically increase the replica count of the rc.

If you previously scale down to 0, then build, you have to explicitly scale back up to 1.

I suspect this issue is then in fact fixed (as the 2nd scale to 0 would fail previously), but let's wait on Xiujuan Wang to respond.

Comment 11 Gabe Montero 2016-02-02 15:21:02 UTC

Ahh ... I just tried running the sample job twice, and I may be seeing what Xiujuan Wang is referring to.

There is a "Check Deployment Success" step between the build step and scale to 1 step in the Sample OpenShift job (come to think of it, I may remember Ben and I discussing removing this).

It succeeds the first time I run the job, but it fails the second time I run the job.

I suspect that we need to simply remove that step from the job, though it would be good to understand the differing behavior.

Dan - I'll do a little more digging (maybe even see if I can reproduce the behavior using `oc`), and ping you on IRC.

Comment 12 Gabe Montero 2016-02-02 16:10:14 UTC

OK, I confirmed what is going on, and sync'ed up with Dan.

The net is this bug should be verified, and a new bug should be opened against image/cartridge, so i can update the sample job in the jenkins image, and remove that problematic deployment verification step.

The basic flow is:
- on the first run, no rc exists, so the initial scale to 0 does nothing
- the build then runs, and default rc / deploy behavior occurs,
- on the second run, the rc does in fact exists; replica is set to 0
- then after the build, the deploy verify before the scale to 0 is what is problematic.

Xiujuan Wang - would you like to open the Image/Cartridge defect so I can fix the sample job in the jenkins image, or would you prefer I open it?

thanks

Comment 13 XiuJuan Wang 2016-02-03 02:52:18 UTC

Gabe, I'd appreciate your deeply diging, I will open a new bug in image component and verify this one.
Dan, thanks your hardworking on fixing this bug,:)

Mark this bug as verified.

Comment 14 XiuJuan Wang 2016-02-03 06:42:46 UTC

@Gabe
I can't reproduce the second job failed in jenkins webconsole due to check deployment in devenv-rhel7-3315, with jenkins-1-centos7(066a52bb8fa4) or jenkins-1-rhel7(e9e1ffba0334).So I will hold on to open a new bug in image component.

Comment 15 Gabe Montero 2016-02-03 15:19:41 UTC

@XiuJuan
I had some time yesterday and implemented the fix :-).  

The 2 jenkins images you tried have the fix (you'll notice that there is not check deployment between the build step and the scale to 1 step).  I double checked them myself as well.

If you'd still like to open a defect for historical / tracking purposes, go ahead and you and I can move it through to verify.  Otherwise I think we are good to go.

thanks

Comment 16 Dan Mace 2016-02-12 13:18:54 UTC

*** Bug 1307013 has been marked as a duplicate of this bug. ***

Comment 21 Josep 'Pep' Turro Mauri 2016-08-18 11:32:15 UTC

The deployer related change from comment #7 was released with OpenShift Enterprise 3.2.0:

https://access.redhat.com/errata/RHSA-2016:1064

The Jenkins image changes referenced in comment #15 were released in the openshift3/jenkins-1-rhel7:1.642-30 image.