Bug 1467441

Summary: [paid][online-prod] Registry pushes failing with 404s and EOF errors for Online pro with 300 concurrent builds
Product: OpenShift Online Reporter: Mike Fiedler <mifiedle>
Component: Image RegistryAssignee: Alexey Gladkov <agladkov>
Status: CLOSED DEFERRED QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.xCC: abhgupta, aos-bugs, bparees, dyan, hongkliu, mifiedle
Target Milestone: ---Keywords: OnlinePro, UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-30 14:02:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
build log for EOF failure
none
build log for 404 failure
none
registry pod logs and syslog for registry nodes none

Description Mike Fiedler 2017-07-03 22:49:22 UTC
Description of problem:

Running 10, 100, 200, 300 concurrent builds in Online Pro production.  This cluster has 60 computes, 3 registries and uses S3 as the backend storage for the registry.

At the 300 concurrent build level a significant (20%) number of builds start failing with either 404 errors for the push or a cryptic EOF message.   Sample logs from failed builds are attached.   I'll add a link to complete build logs in a private comment.   Registry pod logs and the syslogs of the nodes where the registry is running are attached.

Version-Release number of selected component (if applicable): 3.5.5.50


How reproducible: Always when running 300 concurrent builds (5 builds per compute)


Steps to Reproduce:
1.  Online Pro production
2.  Create 300 projects with a quickstart - we used nodejs-mongodb
3.  Run all 300 builds concurrently.   Looping the projects with oc start-build will work or you can use the svt build_test tool:  https://github.com/openshift/svt/blob/master/openshift_performance/ose3_perf/scripts/build_test-README.md

Actual results:

20% of the builds fail with 404 errors on the push or with EOF messages like this:

2017-07-03T21:25:12.365716675Z error: build error: Failed to push image: Put https://172.30.118.67:5000/v2/proj44/nodejs-mongodb-example/blobs/uploads/c6348934-ce2a-480f-8307-d20a06a0a0d9?_state=RVk_IZciwmaWU3X7URbhnPJDO4CzFFnefRROMuybp-F7Ik5hbWUiOiJwcm9qNDQvbm9kZWpzLW1vbmdvZGItZXhhbXBsZSIsIlVVSUQiOiJjNjM0ODkzNC1jZTJhLTQ4MGYtODMwNy1kMjBhMDZhMGEwZDkiLCJPZmZzZXQiOjgwNjAsIlN0YXJ0ZWRBdCI6IjIwMTctMDctMDNUMjE6MjQ6MTJaIn0%3D&digest=sha256%3Ae8ccea6b2ec2c6cca16c742a2777017e86a223fb36961385d1ddfaba573de1b8: EOF


Expected results:

Less than 1% failures is the target

Comment 1 Mike Fiedler 2017-07-03 22:50:25 UTC
Created attachment 1293989 [details]
build log for EOF failure

Comment 2 Mike Fiedler 2017-07-03 22:50:48 UTC
Created attachment 1293990 [details]
build log for 404 failure

Comment 3 Mike Fiedler 2017-07-03 22:52:37 UTC
Created attachment 1293991 [details]
registry pod logs and syslog for registry nodes

Comment 5 Abhishek Gupta 2017-07-06 17:33:41 UTC
Pushing this out of the blocker list as we should not hit ~300 concurrent builds in prod at launch and users can retry their builds.

Comment 6 Ben Parees 2017-10-03 01:56:21 UTC
Mike, do the rate limiting options added to the registry resolve this?  (can you configure a rate limit that allows 300 concurrent builds to succeed, though obviously they will take longer to get pushed)

Note that we've also added automatic retry logic to build pushes, so even if pushes are failing, builds should eventually succeed unless the build gives up (it tries a total of 6 times i believe).