Bug 1467441 - [paid][online-prod] Registry pushes failing with 404s and EOF errors for Online pro with 300 concurrent builds
Summary: [paid][online-prod] Registry pushes failing with 404s and EOF errors for Onli...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Image Registry
Version: 3.x
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Alexey Gladkov
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-03 22:49 UTC by Mike Fiedler
Modified: 2017-10-30 14:02 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-30 14:02:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
build log for EOF failure (6.82 KB, text/plain)
2017-07-03 22:50 UTC, Mike Fiedler
no flags Details
build log for 404 failure (6.37 KB, text/plain)
2017-07-03 22:50 UTC, Mike Fiedler
no flags Details
registry pod logs and syslog for registry nodes (11.46 MB, application/x-gzip)
2017-07-03 22:52 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2017-07-03 22:49:22 UTC
Description of problem:

Running 10, 100, 200, 300 concurrent builds in Online Pro production.  This cluster has 60 computes, 3 registries and uses S3 as the backend storage for the registry.

At the 300 concurrent build level a significant (20%) number of builds start failing with either 404 errors for the push or a cryptic EOF message.   Sample logs from failed builds are attached.   I'll add a link to complete build logs in a private comment.   Registry pod logs and the syslogs of the nodes where the registry is running are attached.

Version-Release number of selected component (if applicable): 3.5.5.50


How reproducible: Always when running 300 concurrent builds (5 builds per compute)


Steps to Reproduce:
1.  Online Pro production
2.  Create 300 projects with a quickstart - we used nodejs-mongodb
3.  Run all 300 builds concurrently.   Looping the projects with oc start-build will work or you can use the svt build_test tool:  https://github.com/openshift/svt/blob/master/openshift_performance/ose3_perf/scripts/build_test-README.md

Actual results:

20% of the builds fail with 404 errors on the push or with EOF messages like this:

2017-07-03T21:25:12.365716675Z error: build error: Failed to push image: Put https://172.30.118.67:5000/v2/proj44/nodejs-mongodb-example/blobs/uploads/c6348934-ce2a-480f-8307-d20a06a0a0d9?_state=RVk_IZciwmaWU3X7URbhnPJDO4CzFFnefRROMuybp-F7Ik5hbWUiOiJwcm9qNDQvbm9kZWpzLW1vbmdvZGItZXhhbXBsZSIsIlVVSUQiOiJjNjM0ODkzNC1jZTJhLTQ4MGYtODMwNy1kMjBhMDZhMGEwZDkiLCJPZmZzZXQiOjgwNjAsIlN0YXJ0ZWRBdCI6IjIwMTctMDctMDNUMjE6MjQ6MTJaIn0%3D&digest=sha256%3Ae8ccea6b2ec2c6cca16c742a2777017e86a223fb36961385d1ddfaba573de1b8: EOF


Expected results:

Less than 1% failures is the target

Comment 1 Mike Fiedler 2017-07-03 22:50:25 UTC
Created attachment 1293989 [details]
build log for EOF failure

Comment 2 Mike Fiedler 2017-07-03 22:50:48 UTC
Created attachment 1293990 [details]
build log for 404 failure

Comment 3 Mike Fiedler 2017-07-03 22:52:37 UTC
Created attachment 1293991 [details]
registry pod logs and syslog for registry nodes

Comment 5 Abhishek Gupta 2017-07-06 17:33:41 UTC
Pushing this out of the blocker list as we should not hit ~300 concurrent builds in prod at launch and users can retry their builds.

Comment 6 Ben Parees 2017-10-03 01:56:21 UTC
Mike, do the rate limiting options added to the registry resolve this?  (can you configure a rate limit that allows 300 concurrent builds to succeed, though obviously they will take longer to get pushed)

Note that we've also added automatic retry logic to build pushes, so even if pushes are failing, builds should eventually succeed unless the build gives up (it tries a total of 6 times i believe).


Note You need to log in before you can comment on or make changes to this bug.