Bug 1685322 - Builds delayed with multiple occurrences of the error "build_controller.go:1289] Giving up retrying <build name>: invalid phase transition <build name> (Running) -> Pending"
Summary: Builds delayed with multiple occurrences of the error "build_controller.go:12...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.10.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 3.10.z
Assignee: Gabe Montero
QA Contact: wewang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-05 00:22 UTC by Venkata Tadimarri
Modified: 2019-12-26 02:39 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: slow progression through a builds init container steps could result in builds getting marked "Running" and then the build controller subsequently attempts to mark the build back to "Pending" Consequence: the build will still eventually complete, but undo churn in build controller processing occurs and unneeded warning messages are presented to the user Fix: the build controller now better prevents these erroneous transitions, and logs an more useful diagnostic event for edge/unexpected cases Result: clean presentation to the user
Clone Of:
Environment:
Last Closed: 2019-06-27 16:41:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master-logs-controllers_02279152 (521.85 KB, text/plain)
2019-07-30 23:34 UTC, David
no flags Details
master-logs_api.txt (1.31 MB, text/plain)
2019-07-30 23:35 UTC, David
no flags Details
journalctl_atomic-openshift.txt (65.65 KB, text/plain)
2019-07-30 23:36 UTC, David
no flags Details
master-logs_controllers.tx (1.44 MB, text/plain)
2019-07-30 23:36 UTC, David
no flags Details
events.txt (20 bytes, text/plain)
2019-07-30 23:36 UTC, David
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1607 0 None None None 2019-06-27 16:41:23 UTC

Description Venkata Tadimarri 2019-03-05 00:22:31 UTC
Description of problem:

All the builds on the cluster get delayed execution because of the error "build_controller.go:1289] Giving up retrying <build name>: invalid phase transition <build name> (Running) -> Pending". The build shows the error half way through but finally completes with a delay.


Error log:

message:     I0218 18:06:31.384110       1 build_controller.go:1289] Giving up retrying <namespace/buildnumber>: invalid phase transition <namespace/buildnumber> (Running) -> Pending

I0218 18:44:12.971040       1 build_controller.go:1289] Giving up retrying customer-identities/customer-identities-49: invalid phase transition customer-identities/customer-identities-49 (Running) -> Pending

I0218 18:07:35.182839       1 build_controller.go:1289] Giving up retrying platform/test-jenkins-ee-s2i-414-49-1: invalid phase transition platform/test-jenkins-ee-s2i-414-49-1 (Running) -> Pending

I0218 15:07:28.570112       1 build_controller.go:1289] Giving up retrying <namespace>/python-s2i-3-3.5-test-123: invalid phase transition <namespace>/python-s2i-3-3.5-test-123 (Running) -> Pending

I0218 15:07:06.084165       1 build_controller.go:1289] Giving up retrying 765300/ngrxtest-79: invalid phase transition 765300/ngrxtest-79 (Running) -> Pending

Version-Release number of selected component (if applicable):

atomic-openshift-3.10.72-1.git.0.3cb2fdc.el7.x86_64


Actual results:

Build succeeds with error logs and delay

Expected results:

Build completes successfully without errors and in the time it should usually take.


Additional info:

-> Attached the template of one build that is also displaying this behaviour.
-> Attached sosreport.
-> Build config

Comment 8 Mitchell Rollinson 2019-03-12 19:13:34 UTC
@Adam - attaching logs as requested.

Comment 10 Adam Kaplan 2019-03-19 13:39:52 UTC
@Mitchell @Venkata based on the logs builds are able to complete. However, at some point there is a build whose state isn't able to be transitioned, and eventually the build controller gives up reporting state. This is causes the next build for the given BuildConfig to be delayed.

As a work-around, the customer can cancel the individual builds that are generating the invalid state transition messages. If that fails to cancel, they can force the build to be deleted via `oc delete build <build-config>-<build-number>`.

We will need additional time to investigate why the build is trying to transition from Running to Pending in the first place.

Comment 20 Gabe Montero 2019-04-17 18:54:31 UTC
PR https://github.com/openshift/origin/pull/22585 is marked for merging in the master/4.x branch, and cherry picks have been requested for 3.11 and 3.10.

Will report back when the PRs for those cherrypick requests are up.

Comment 23 wewang 2019-05-08 01:39:19 UTC
Now tested in 3.10 3.11 and 4.1, error "invalid phase transition" already disappeared.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-06-011159   True        False         20h     Cluster version is 4.1.0-0.nightly-2019-05-06-011159

Comment 25 errata-xmlrpc 2019-06-27 16:41:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1607

Comment 31 David 2019-07-30 23:34:25 UTC
Created attachment 1594861 [details]
master-logs-controllers_02279152

Comment 32 David 2019-07-30 23:35:49 UTC
Created attachment 1594862 [details]
master-logs_api.txt

Comment 33 David 2019-07-30 23:36:13 UTC
Created attachment 1594863 [details]
journalctl_atomic-openshift.txt

Comment 34 David 2019-07-30 23:36:37 UTC
Created attachment 1594864 [details]
master-logs_controllers.tx

Comment 35 David 2019-07-30 23:36:57 UTC
Created attachment 1594865 [details]
events.txt

Comment 36 Gabe Montero 2019-08-01 15:23:41 UTC
Yeah build_controller.go:1354 does not line up with the 3.11 version that has the fix.  That log comes from line 1111 in the 3.11 version.  And line 1058 in the 3.10 veersion.

So the log from https://bugzilla.redhat.com/show_bug.cgi?id=1685322#c30 does not come from a system with the fix.


Note You need to log in before you can comment on or make changes to this bug.