| Summary: | Could not cancel the deployment successfully | ||
|---|---|---|---|
| Product: | OKD | Reporter: | Wei Sun <wsun> |
| Component: | Deployments | Assignee: | Dan Mace <dmace> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.x | CC: | abhgupta, aos-bugs, maszulik, mkargaki |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-12 17:11:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Wei Sun
2016-03-18 06:54:50 UTC
Most probably a race between the deployer pod controller (the component responsible for transitioning the deployment between phases) and `oc deploy --cancel`. It seems that the deployment is marked for cancelation (still running) and at the same time the deployer finishes successfully and transitions the deployment to Complete. Opened https://github.com/openshift/origin/pull/8163 that takes a stab at it. We decided that since --cancel is a best-effort call, emitting an event in case of a failed cancel and cleaning up the cancel annotations is enough. Since this is a corner case and already fixed in https://github.com/openshift/origin/pull/8163, I am dropping the priority. Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/5cca4224cf3156aa7e3c13ab1535f599005358e0 Bug 1318920: emit events for failed cancellations Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1318920 The issue only reproduced on Online: [root@zhouy testjson-for-int]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 0 14m hooks-2-deploy 0/1 Error 0 13m hooks-3-deploy 1/1 Running 0 3m [root@zhouy testjson-for-int]# oc deploy hooks --cancel No deployments are in progress (latest deployment #3 running 3 minutes ago) [root@zhouy testjson-for-int]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 0 14m hooks-2-deploy 0/1 Error 0 13m hooks-3-deploy 1/1 Running 0 3m Zhou, what you hit is different from the reported issue. Reassigning to the Online team. The issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1318920#c5 is different from the issue this bug is tracking. I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1323710 to track the newly discovered behavior. I'm putting this issue back ON_QA so it can be verified against origin where it was originally reported. Let's keep the scope of testing limited to the reported bug and open new bugs as necessary. Hi Dan: The issue only reproduced on Online, and not fixed now, and when cancel failed, try to cancel again, will see the https://bugzilla.redhat.com/show_bug.cgi?id=1318920#c5, so, in my point, https://bugzilla.redhat.com/show_bug.cgi?id=1323710 is same issue with this bug. please see: [root@zhouy roottest]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 0 4d hooks-2-deploy 0/1 Error 0 4d hooks-3-deploy 0/1 Error 0 4d You have new mail in /var/spool/mail/root [root@zhouy roottest]# oc deploy hooks --latest Started deployment #4 [root@zhouy roottest]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 0 4d hooks-2-deploy 0/1 Error 0 4d hooks-3-deploy 0/1 Error 0 4d hooks-4-deploy 1/1 Running 0 <invalid> [root@zhouy roottest]# oc deploy hooks --cancel Cancelled deployment #4 [root@zhouy roottest]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 0 4d hooks-2-deploy 0/1 Error 0 4d hooks-3-deploy 0/1 Error 0 4d hooks-4-deploy 1/1 Running 0 <invalid> [root@zhouy roottest]# oc deploy hooks --cancel No deployments are in progress (latest deployment #4 running less than a second ago) [root@zhouy roottest]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 0 4d hooks-2-deploy 0/1 Error 0 4d hooks-3-deploy 0/1 Error 0 4d hooks-4-deploy 1/1 Running 0 <invalid> No, it's not the same issue. `deploy --cancel` marks the deployment (replication controller) as cancelled, then the deployment controller will pick it up, and terminate the deployer pod. So, it's impossible to observe the deployer pod terminating as soon as you --cancel, unless we moved the deployment controller functionality in oc which was discussed and rejected (and still we would just minimize the race window). What you are observing in https://bugzilla.redhat.com/show_bug.cgi?id=1323710 is that the cancel message post-cancellation is wrong because we rely on the deployment phase which may still be pending, or running. The component responsible for transitioning the deployment phase is the deployer pod controller and not --cancel. Also keep in mind that --cancel is a best effort call (noted in the docs at https://docs.openshift.org/latest/dev_guide/deployments.html#canceling-a-deployment). When a user tries to --cancel and the deployer pod just succeeded (what this issue reports), we will just emit an event letting the user know that their --cancel failed. Wait for some mins, the --cancel completed, will verify this bug. [root@zhouy ~]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 1 4d hooks-2-deploy 0/1 Error 0 4d hooks-3-deploy 0/1 Error 0 4d hooks-4-deploy 0/1 Error 0 23h hooks-5-deploy 1/1 Running 0 5m [root@zhouy ~]# oc get pods NAME READY STATUS RESTARTS AGE hooks-1-lenv2 1/1 Running 1 5d hooks-2-deploy 0/1 Error 0 5d hooks-3-deploy 0/1 Error 0 4d hooks-4-deploy 0/1 Error 0 23h hooks-5-deploy 0/1 Error 0 9m |