Description of problem: All the builds on the cluster get delayed execution because of the error "build_controller.go:1289] Giving up retrying <build name>: invalid phase transition <build name> (Running) -> Pending". The build shows the error half way through but finally completes with a delay. Error log: message: I0218 18:06:31.384110 1 build_controller.go:1289] Giving up retrying <namespace/buildnumber>: invalid phase transition <namespace/buildnumber> (Running) -> Pending I0218 18:44:12.971040 1 build_controller.go:1289] Giving up retrying customer-identities/customer-identities-49: invalid phase transition customer-identities/customer-identities-49 (Running) -> Pending I0218 18:07:35.182839 1 build_controller.go:1289] Giving up retrying platform/test-jenkins-ee-s2i-414-49-1: invalid phase transition platform/test-jenkins-ee-s2i-414-49-1 (Running) -> Pending I0218 15:07:28.570112 1 build_controller.go:1289] Giving up retrying <namespace>/python-s2i-3-3.5-test-123: invalid phase transition <namespace>/python-s2i-3-3.5-test-123 (Running) -> Pending I0218 15:07:06.084165 1 build_controller.go:1289] Giving up retrying 765300/ngrxtest-79: invalid phase transition 765300/ngrxtest-79 (Running) -> Pending Version-Release number of selected component (if applicable): atomic-openshift-3.10.72-1.git.0.3cb2fdc.el7.x86_64 Actual results: Build succeeds with error logs and delay Expected results: Build completes successfully without errors and in the time it should usually take. Additional info: -> Attached the template of one build that is also displaying this behaviour. -> Attached sosreport. -> Build config
@Adam - attaching logs as requested.
@Mitchell @Venkata based on the logs builds are able to complete. However, at some point there is a build whose state isn't able to be transitioned, and eventually the build controller gives up reporting state. This is causes the next build for the given BuildConfig to be delayed. As a work-around, the customer can cancel the individual builds that are generating the invalid state transition messages. If that fails to cancel, they can force the build to be deleted via `oc delete build <build-config>-<build-number>`. We will need additional time to investigate why the build is trying to transition from Running to Pending in the first place.
PR https://github.com/openshift/origin/pull/22585 is marked for merging in the master/4.x branch, and cherry picks have been requested for 3.11 and 3.10. Will report back when the PRs for those cherrypick requests are up.
Now tested in 3.10 3.11 and 4.1, error "invalid phase transition" already disappeared. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-06-011159 True False 20h Cluster version is 4.1.0-0.nightly-2019-05-06-011159
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1607
Created attachment 1594861 [details] master-logs-controllers_02279152
Created attachment 1594862 [details] master-logs_api.txt
Created attachment 1594863 [details] journalctl_atomic-openshift.txt
Created attachment 1594864 [details] master-logs_controllers.tx
Created attachment 1594865 [details] events.txt
Yeah build_controller.go:1354 does not line up with the 3.11 version that has the fix. That log comes from line 1111 in the 3.11 version. And line 1058 in the 3.10 veersion. So the log from https://bugzilla.redhat.com/show_bug.cgi?id=1685322#c30 does not come from a system with the fix.