Description of problem: concurrent builds on scale-lab: 250 2xlarge workers. #builds=250, 500, 750, 1000, 2000 1. Create n projects with nodejs app, during the creation, the builds were triggered automatically once. 2. when all builds in 1 above were completed, trigger them by `oc start-build -n ...` We monitored the number of builds with status `Running`: A. After the creation: we saw many builds with status `New`, eg, `svt-nodejs-1987 nodejs-mongodb-example-1 Source Git New (CannotRetrieveServiceAccount)` B. After the `oc start-build`, it is at most 200 running builds (about 10% of overall). The rest of them is with `New` or `Pending` All builds completed eventually but this is different from what we saw on OCP 311. It seems like we have kind of throttling on the number of running builds (specially for B abvoe). Is this an expected behavior? Or we have some parameter that we can try? How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: ###pbench node # oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"da48e8391", GitTreeState:"", BuildDate:"2019-04-18T14:04:32Z", GoVersion:"", Compiler:"", Platform:""} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+81fc896", GitCommit:"81fc896", GitTreeState:"clean", BuildDate:"2019-04-21T23:18:54Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} ###jump node for project>=250 # oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201904280051+49f08ba-dirty", GitCommit:"49f08ba", GitTreeState:"dirty", BuildDate:"2019-04-28T05:26:29Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+81fc896", GitCommit:"81fc896", GitTreeState:"clean", BuildDate:"2019-04-21T23:18:54Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-04-22-005054 True False 5d15h Cluster version is 4.1.0-0.nightly-2019-04-22-005054 # oc get pod -n openshift-image-registry image-registry-8ffd766d8-z6sld -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES image-registry-8ffd766d8-z6sld 1/1 Running 2 5d21h 10.128.5.253 ip-10-0-129-208.us-west-2.compute.internal <none> <none> # no build pods running on this node # oc get node ip-10-0-129-208.us-west-2.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-129-208.us-west-2.compute.internal Ready infra 5d21h v1.13.4+da48e8391 # oc get configs.imageregistry.operator.openshift.io cluster -o json | jq -r .spec.storage { "s3": { "bucket": "image-registry-us-west-2-XXX", "encrypt": true, "keyID": "", "region": "us-west-2", "regionEndpoint": "" } }
if they are in "new(CannotRetrieveServiceAccount)" it is because the SA controller has not created the SAs for your new project yet... that is either "working as designed" or a bug against the SA controller to create the SAs faster, depending how much latency you are seeing between project creation and SA creation. If they are in New for any other reason, we'd need to see the details of the build yaml and the controller logs to understand why. If they are in pending, you should look for the associated build-pod and see why it is not running..most likely your cluster doesn't have enough resources to schedule the pod. The pod yaml/descrption + events should provide this information.
Based on these logs it looks like you may have attempted this multiple times. Can we get a clean run with timestamps for when you kicked off the run(created the buildconfigs)? It would also help if you can enable loglevel=4 on the openshift controller manager, you can do this by: oc edit openshiftcontrollermanagers.operator.openshift.io/cluster set spec.logLevel: "4" wait for the the controller manager pods to rollout. then kick off your run. Once you see some build pods in a complete state with their corresponding build still reporting "new" or "pending", gather the controller logs and tell us the exact namespace+name of the build in question. My current theory (based on nothing in particular) is that the build controller is falling way behind on processing the pod events and getting the builds updated. So it will also be useful to know if over time, your builds are eventually making it into a completed state.
# oc get pod -n svt-nodejs-861 nodejs-mongodb-example-8-build NAME READY STATUS RESTARTS AGE nodejs-mongodb-example-8-build 0/1 Completed 0 4m55s root@ip-172-31-19-86: ~ # root@ip-172-31-19-86: ~ # root@ip-172-31-19-86: ~ # oc get build -n svt-nodejs-861 nodejs-mongodb-example-8 NAME TYPE FROM STATUS STARTED DURATION nodejs-mongodb-example-8 Source Git@e59fe75 Pending logs are on the way.
Some data points/info on this: The builds do eventually complete. It appears we have a backlog of events to process and the build controller eventually works its way through them. It is not clear (due to insufficient log tracing) whether the issue is that we aren't getting the events fast enough, or we aren't processing them fast enough. We effectively process 4800 events in 5 minutes (those are both build+pod create/update events), which amounts to about 16/sec. Since we have 5 queue workers, that's really 3/sec per worker. or 1/3sec per event processing. That sounds a bit slow to me, but there's quite a range of work to be done, in some cases no-op, in other cases pods have to be created. It's possible we are also getting client-throttled which makes the worker take longer to process events that require api calls. based on output like this: I0430 21:46:40.074363 1 request.go:530] Throttling request took 293.9029ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/svt-nodejs-939/serviceaccounts/builder I think we are getting throttled. Need to figure out how to unthrottle the build controller's client.
Throttling is configured for controller clients here: https://github.com/openshift/origin/blob/b73bbf3477951439d917732ba636252d66f8d48a/pkg/cmd/openshift-controller-manager/controller/interfaces.go#L55-L64
David, did we change how client throttling works, or how client rate limiting is configured for controllers, in any meaningful way in 4.1? If i'm reading the code correctly, it seems like all controllers in the process are sharing a single rate limit of 5 (burst to 10) for each openshift client type (builds, images, etc) and then sharing another rate limit of 5/10 for all k8s resources.
ok I dumped the rest config that the build controller is using, looks like it is configured as: QPS:16, Burst:31
https://github.com/openshift/origin/pull/22734
Tested with 11 m5.2xlarge workers (including one dedicated to image-registry pod): the scale-lab cluster is gone. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-07-085358 True False 107m Cluster version is 4.1.0-0.nightly-2019-05-07-085358 Results: Much bigger number of conc. builds (500-700) was shown up: http://file.rdu.redhat.com/~hongkliu/test_result/bz1704722/20190507/Screenshot%20from%202019-05-07%2009-17-31.png Although it is not as expected, 1000, I think it could be related to the limited computing resources. http://file.rdu.redhat.com/~hongkliu/test_result/bz1704722/20190507/Screenshot%20from%202019-05-07%2009-53-52.png The build pods stayed in Pending and Init status for quite long. So this can be confirmed only when i get hold of scale-lab the next round. With it being said, I am convinced Problem B in this bug is verified. During the test, I still saw Problem A: CannotRetrieveServiceAccount. Will fire a new one if encountering again the next round.
Recovering need info
The service account bit would need to go to the master team. The issue there is that you are creating new projects and then, before the service accounts are created, you are creating builds in those projects. The build needs the service account to proceed, so it sits in New until the SA exists. It's not a bug per se, but if you think it's taking too long for the SAs to get created, then you'd have to ask the master team to look into it as a separate bug.
All clear, Ben. Thanks for the information. I am not concerned about the SA issue either (since it wont be a problem after creation). Mentioning it here because we never hit it for OCP 311. I am testing more with less conc. builds (since I have a much smaller cluster now) and will compare the build/push time in the last round (since you fixed the threadhold, I want to know if it has any impacts). Will update here.
I'd be surprised if it has any impact on build/push time.
This is what I got before your fix in the last round (4.0.0-0.nightly-2019-04-10-182914) #500 (n=2) s3 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Failed builds: 2 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Invalid builds: 0 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Good builds included in stats: 998 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Average build time, all good builds: 1386 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Minimum build time, all good builds: 299 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Maximum build time, all good builds: 1850 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Average push time, all good builds: 13.8296593186 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Minimum push time, all good builds: 2.0 2019-04-11 20:09:59,640 - build_test - MainThread - INFO - Maximum push time, all good builds: 94.0 result today with your fix (4.1.0-0.nightly-2019-05-07-085358) 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Failed builds: 92 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Invalid builds: 0 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Good builds included in stats: 908 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Average build time, all good builds: 1374 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Minimum build time, all good builds: 719 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Maximum build time, all good builds: 2164 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Average push time, all good builds: 12.0947136564 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Minimum push time, all good builds: 2.0 2019-05-07 19:03:14,219 - build_test - MainThread - INFO - Maximum push time, all good builds: 77.0 Does not change much on the build/push time. In the test we see the number of conc. builds/pods: 250/500 ... exactly like 311. http://file.rdu.redhat.com/~hongkliu/test_result/bz1704722/20190507/round2/ About the build time, results on OCP 311: 500 builds 2018-09-26 16:25:15,514 - build_test - MainThread - INFO - Average build time, all good builds: 240 2018-09-26 16:25:15,514 - build_test - MainThread - INFO - Minimum build time, all good builds: 70 2018-09-26 16:25:15,514 - build_test - MainThread - INFO - Maximum build time, all good builds: 332 2018-09-26 16:25:15,514 - build_test - MainThread - INFO - Average push time, all good builds: 30.2643333333 2018-09-26 16:25:15,514 - build_test - MainThread - INFO - Minimum push time, all good builds: 1.0 2018-09-26 16:25:15,514 - build_test - MainThread - INFO - Maximum push time, all good builds: 147.0 The avg build time is much smaller, but I think that is caused by docker -> crio/buildah change from 311 to 40.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days