Bug 1293859
Summary: | Failed to trigger more than one time jobs in jenkins webconsole | ||
---|---|---|---|
Product: | OKD | Reporter: | XiuJuan Wang <xiuwang> |
Component: | Deployments | Assignee: | Dan Mace <dmace> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | zhou ying <yinzhou> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.x | CC: | akokshar, aos-bugs, erich, gmontero, jkaur, misalunk, mmccomas, pep, xiuwang, yinzhou |
Target Milestone: | --- | Flags: | gmontero:
needinfo-
xiuwang: needinfo- |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openshift3/jenkins-1-rhel7:1.642-30 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-18 11:32:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1339502 |
Description
XiuJuan Wang
2015-12-23 09:53:48 UTC
Gabe I know you dug into this previously with the Platform Management team, does this need to go to them or is there something needed on our side? Nothing needed on our side. I think it is a question of the tester updating the openshift images he's running against (the deployer in particular I believe) to get the fix from the Platform mgmt Team. I know in the last couple of weeks with the most up to date images I no longer see this, at least with centos. I don't know the whether the platform mgmt team's fixes for this maxSurge/maxUnavailable have made to the rhel images. So sending to platform mgmt to confirm which rhel images that fix got into seems like the logical next step. Looking at the components list here in bugzilla, not sure 100% sure which one is best ... "Deployments" perhaps? Apologies if I've sent this to the wrong component. Could someone from the Platform Management team confirm which levels of the deployer image on rhel have the fixes for the F1223 08:29:14.688773 1 deployer.go:65] one of maxSurge or maxUnavailable must be specified error ... at least for myself and the QA contact, these occur after the initial deployment. I can provide some historical context: Dan Mace, myself, and other emailed about this on openshift-sme around the Oct 20 to Nov 2 timeframe (the subj was "maxSurge or maxUnavailable must be specified", and I've seen the issue resolved on the centos images for the last couple of weeks. Ben Parees was able to reproduce with OpenShift images from around 3 weeks old I believe. thanks This is the same as https://github.com/openshift/origin/issues/5787 The bug occurs when the deployment's replica count is 0. Waiting for this upstream patch to be vetted before patching origin: https://github.com/kubernetes/kubernetes/pull/19366 *** Bug 1300295 has been marked as a duplicate of this bug. *** Test in devenv-rhel7_3301 After scale dc to 0, trigger a build, the new replicas in rc/dc is still 0, so no pod will be deployed. So for my bug comment #0, this issue still exists.Failed to trigger second job in jenkins webconsole, since the jenkins will scale frontend dc to 0 firstly,then build and check the new rc if is 1. The bug resolved by the referenced PRs is to correct the error produced by the deployment: I1223 08:29:13.640629 1 deployer.go:198] Deploying from test/frontend-1 to test/frontend-2 (replicas: 0) I1223 08:29:14.675486 1 rolling.go:232] RollingUpdater: Continuing update with existing controller frontend-2. F1223 08:29:14.688773 1 deployer.go:65] one of maxSurge or maxUnavailable must be specified If the error no longer occurs and the deployment succeeds, then the PRs fixed the intended bug. It's not clear to me why you're expecting a new rc with a scale of 1 when the scale prior to the deployment is 0. The underlying bug is that the deployment subsystem refused to deploy with a scale of 0, which is now fixed. Can you elaborate on your expectations? To build on Dan Mace's comment, in the context of the OpenShift Sample Job in the Jenkins image (the use case I believe the test used to produce the maxSurge, etc. issue) if you look at the complete set of build steps, there is in fact an explicit scale to 1 *after* the build step. The build step does *NOT* automatically increase the replica count of the rc. If you previously scale down to 0, then build, you have to explicitly scale back up to 1. I suspect this issue is then in fact fixed (as the 2nd scale to 0 would fail previously), but let's wait on Xiujuan Wang to respond. Ahh ... I just tried running the sample job twice, and I may be seeing what Xiujuan Wang is referring to. There is a "Check Deployment Success" step between the build step and scale to 1 step in the Sample OpenShift job (come to think of it, I may remember Ben and I discussing removing this). It succeeds the first time I run the job, but it fails the second time I run the job. I suspect that we need to simply remove that step from the job, though it would be good to understand the differing behavior. Dan - I'll do a little more digging (maybe even see if I can reproduce the behavior using `oc`), and ping you on IRC. OK, I confirmed what is going on, and sync'ed up with Dan. The net is this bug should be verified, and a new bug should be opened against image/cartridge, so i can update the sample job in the jenkins image, and remove that problematic deployment verification step. The basic flow is: - on the first run, no rc exists, so the initial scale to 0 does nothing - the build then runs, and default rc / deploy behavior occurs, - on the second run, the rc does in fact exists; replica is set to 0 - then after the build, the deploy verify before the scale to 0 is what is problematic. Xiujuan Wang - would you like to open the Image/Cartridge defect so I can fix the sample job in the jenkins image, or would you prefer I open it? thanks Gabe, I'd appreciate your deeply diging, I will open a new bug in image component and verify this one. Dan, thanks your hardworking on fixing this bug,:) Mark this bug as verified. @Gabe I can't reproduce the second job failed in jenkins webconsole due to check deployment in devenv-rhel7-3315, with jenkins-1-centos7(066a52bb8fa4) or jenkins-1-rhel7(e9e1ffba0334).So I will hold on to open a new bug in image component. @XiuJuan I had some time yesterday and implemented the fix :-). The 2 jenkins images you tried have the fix (you'll notice that there is not check deployment between the build step and the scale to 1 step). I double checked them myself as well. If you'd still like to open a defect for historical / tracking purposes, go ahead and you and I can move it through to verify. Otherwise I think we are good to go. thanks *** Bug 1307013 has been marked as a duplicate of this bug. *** The deployer related change from comment #7 was released with OpenShift Enterprise 3.2.0: https://access.redhat.com/errata/RHSA-2016:1064 The Jenkins image changes referenced in comment #15 were released in the openshift3/jenkins-1-rhel7:1.642-30 image. |