Description of problem: On at least 3 of our clusters, I've seen this issue come up about once per day, during a new app build: builder.go:185] Error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server error In the registry pod's logs, I can see status=500 errors occurring. Sometimes the errors occur with different error messages, but this is the one I see most often (URL shortened for brevity): level=error msg="response completed with error" err.code=UNKNOWN err.detail="s3: Put https://s3.amazonaws.com/[...]/data: EOF" err.message="unknown error" go.version=go1.4.2 http.request.host="172.30.184.56:5000" Version-Release number of selected component (if applicable): How reproducible: I'm uncertain how to reproduce it. When I was creating new projects and apps by hand (just using the oc commands documented below), I thought it was quite reproducible, since I hit it 3 times in the same day, on different clusters. There was probably a 1 in 5 chance of reproducing at that time. However, when I went to reproduce it later, using a bash loop creating 30 apps, I could not. Steps to Reproduce: 1. Maybe have some other docker pushes going simultaneously? First step is unknown. 2. oc new-project <projectname> 3. oc new-app cakephp-example -n <projectname> Actual results: Builds occasionally fail with status=500 errors appearing in the registry pod logs. Expected results: Builds ideally shouldn't be failing. Perhaps the registry can more gracefully handle errors it receives from S3? Additional info: I know this is a fairly vague bug to file, but I'm hoping it will help us track down this issue, since it's currently affecting one customer.
Related upstream issues: - https://github.com/docker/distribution/issues/1288 - error message "The request signature we calculated does not match the signature you provided." - setting REGISTRY_STORAGE_S3_V4AUTH=false allegedly solves the problem - however, we don't set it - https://github.com/docker/distribution/issues/873 - error message "The request signature we calculated does not match the signature you provided" - error message "error fetching signature from : net/http: transport closed before response was received" - reproducible under heavy load - supposedly caused by a [glibc bug](https://sourceware.org/bugzilla/show_bug.cgi?id=15946) (corresponding [golang bug](https://github.com/golang/go/issues/6336))
Documentation PR https://github.com/openshift/openshift-docs/pull/1900 available for review.
The glibc bug mentioned in comment 2 is out of question. It is already fixed in glibc-2.17-90 build on Fri May 29 2015 [1]. I've tested on RHEL7.2 with glibc-2.17-106.el7_2.1.x86_64 (which is a bit older than the one shipped in the latest pulic atomic image available) and couldn't reproduce the problem using reproducer [2] linked in the sourceware bugzilla [3]. (look for CVE-2013-7423) [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=478390 [2] https://sourceware.org/bugzilla/attachment.cgi?id=8161 [3] https://sourceware.org/bugzilla/show_bug.cgi?id=15946 Related bz is a bug 1194143.
I followed up with AWS about the 500 errors being received from S3. The service rep gave me some tips [1][2] and suggested we implement some kind of retry logic, since these are occasionally expected from the service. So I think getting 1-2 of these per day is completely normal. He said if we were getting, say, 20% 500 errors, then that would indicate an actual problem. But for the most part, this is just normal S3 operation and we'll have to adjust our application accordingly. [1] "500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter). One such algorithm can be found at http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff " [2] https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Thank you Stefanie! I'll look into adding such a retry into upstream's driver.
Any update on this Michal?
confirmed with fork_ami_openshift3_miminar_295, and can not reproduce this issue
Docker registry 2.4 has been back-ported to OpenShift 3.2. https://github.com/openshift/ose/pull/314
Will confirm this issue when the latest version sync
Confirmed with latest OCP 3.2.1.16 version, can't reproduce this issue. [root@ip-172-18-6-253 ~]# openshift version openshift v3.2.1.16 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5
This bug has been fixed in OCP 3.3 however the fix will not be backported to OSE 3.2.