Bug 1719792

Summary: Build fail with "Unable to look up the service account secrets for this build" because SA token generation is delayed
Product: OpenShift Container Platform Reporter: Alberto Gonzalez de Dios <algonzal>
Component: BuildAssignee: Gabe Montero <gmontero>
Status: CLOSED INSUFFICIENT_DATA QA Contact: wewang <wewang>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: adam.kaplan, ahoness, aos-bugs, apjagtap, bparees, gmontero, jokerman, mfojtik, mmccomas, rdiazgav, sparpate, wzheng
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1729508 1729509 (view as bug list) Environment:
Last Closed: 2019-11-13 13:55:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1729508, 1729509    

Description Alberto Gonzalez de Dios 2019-06-12 14:55:18 UTC
Description of problem:
Sometimes SA token generation takes more time after a project is created. When running a new build (i.e. a new-app), build fails because SA token is not present. But the problem is that build still fails, there is no retry to get the builder token

# oc describe builds/testapp-1 -n ns
Name:		testapp-1
Namespace:	ns
Created:	33 minutes ago
Labels:		app=testapp
		buildconfig=testapp
		openshift.io/build-config.name=testapp
		openshift.io/build.start-policy=Serial
Annotations:	openshift.io/build-config.name=testapp
		openshift.io/build.number=1

Status:		New (Unable to look up the service account secrets for this build.)
Duration:	waiting for 33m4s


Background:
BZ 1462542 "API token is not present at build time":
If new-app is run immediately after a project is created, it races with service account token generation. This has been this way since 3.0 (c.f. https://bugzilla.redhat.com/show_bug.cgi?id=1318917#c5)

The time it takes for the service account and tokens to be provisioned has been greatly improved, but the fundamental race still exists.


Version-Release number of selected component (if applicable):
3.11.104

How reproducible:
Run a loop to create a new project and get the SA token builder

Steps to Reproduce:
1. Run a loop to create a new project
for i in {1..200}; do echo "Test: $i" >> test-sa-debug && date 2>&1 >> test-sa-debug && oc new-project appu-1640 >> test-sa-debug 2>&1 && oc serviceaccounts get-token builder -n appu-1640 --loglevel 10 >> test-sa-debug 2>&1 && oc delete project appu-1640 --loglevel 10 >> test-sa-debug 2>&1; done


Actual results:
In HA environments, it will fail to get the SA token after some loops

Comment 3 Adam Kaplan 2019-07-10 13:16:21 UTC
This is a bug in the build controller - we need to wait or retry fetching the builder service account secrets.

Unfortunately the best work-around in this situation is to cancel the first build, then start a new one for the given BuildConfig.
If the build is started via a script (`oc start-build`, `oc new-app`, etc.), the script may need to add a loop to check the status of the build and cancel+restart if the build does not move past the New state after a set period of time - 5 minutes is reasonable.

Comment 9 Ben Parees 2019-07-30 16:42:59 UTC
> This is a bug in the build controller - we need to wait or retry fetching the builder service account secrets.


The controller does retry this.  so i'm not sure why this would be happening:

whenever we set that message we return an error:
https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1079
https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1109
https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1132

and that error bubbles up to the sync loop to ultimately cause the key to be retried in the queue.

https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1675

if someone can reproduce this w/ level 4 logging enabled in the openshift controller manager it might shed some light on what is happening here.

Comment 13 Ben Parees 2019-07-31 18:19:20 UTC
it does not look like those controller logs had loglevel 4 enabled.

Comment 21 Adam Kaplan 2019-10-31 14:25:50 UTC
*** Bug 1729509 has been marked as a duplicate of this bug. ***

Comment 23 Gabe Montero 2019-11-13 13:55:11 UTC
Customer case has been closed.