Description of problem: Running 10, 100, 200, 300 concurrent builds in Online Pro production. This cluster has 60 computes, 3 registries and uses S3 as the backend storage for the registry. At the 300 concurrent build level a significant (20%) number of builds start failing with either 404 errors for the push or a cryptic EOF message. Sample logs from failed builds are attached. I'll add a link to complete build logs in a private comment. Registry pod logs and the syslogs of the nodes where the registry is running are attached. Version-Release number of selected component (if applicable): 3.5.5.50 How reproducible: Always when running 300 concurrent builds (5 builds per compute) Steps to Reproduce: 1. Online Pro production 2. Create 300 projects with a quickstart - we used nodejs-mongodb 3. Run all 300 builds concurrently. Looping the projects with oc start-build will work or you can use the svt build_test tool: https://github.com/openshift/svt/blob/master/openshift_performance/ose3_perf/scripts/build_test-README.md Actual results: 20% of the builds fail with 404 errors on the push or with EOF messages like this: 2017-07-03T21:25:12.365716675Z error: build error: Failed to push image: Put https://172.30.118.67:5000/v2/proj44/nodejs-mongodb-example/blobs/uploads/c6348934-ce2a-480f-8307-d20a06a0a0d9?_state=RVk_IZciwmaWU3X7URbhnPJDO4CzFFnefRROMuybp-F7Ik5hbWUiOiJwcm9qNDQvbm9kZWpzLW1vbmdvZGItZXhhbXBsZSIsIlVVSUQiOiJjNjM0ODkzNC1jZTJhLTQ4MGYtODMwNy1kMjBhMDZhMGEwZDkiLCJPZmZzZXQiOjgwNjAsIlN0YXJ0ZWRBdCI6IjIwMTctMDctMDNUMjE6MjQ6MTJaIn0%3D&digest=sha256%3Ae8ccea6b2ec2c6cca16c742a2777017e86a223fb36961385d1ddfaba573de1b8: EOF Expected results: Less than 1% failures is the target
Created attachment 1293989 [details] build log for EOF failure
Created attachment 1293990 [details] build log for 404 failure
Created attachment 1293991 [details] registry pod logs and syslog for registry nodes
Pushing this out of the blocker list as we should not hit ~300 concurrent builds in prod at launch and users can retry their builds.
Mike, do the rate limiting options added to the registry resolve this? (can you configure a rate limit that allows 300 concurrent builds to succeed, though obviously they will take longer to get pushed) Note that we've also added automatic retry logic to build pushes, so even if pushes are failing, builds should eventually succeed unless the build gives up (it tries a total of 6 times i believe).