Bug 1324629
Summary: | S3 Registry periodically returns "500 Internal Server Error" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stefanie Forrester <dakini> |
Component: | Image Registry | Assignee: | Michal Minar <miminar> |
Status: | CLOSED NEXTRELEASE | QA Contact: | zhou ying <yinzhou> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.2.0 | CC: | aos-bugs, ihorvath, jgoulding, mfojtik, miminar, pweil, sspeiche, yinzhou |
Target Milestone: | --- | ||
Target Release: | 3.2.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 19:40:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1303130 |
Description
Stefanie Forrester
2016-04-06 20:14:18 UTC
Related upstream issues: - https://github.com/docker/distribution/issues/1288 - error message "The request signature we calculated does not match the signature you provided." - setting REGISTRY_STORAGE_S3_V4AUTH=false allegedly solves the problem - however, we don't set it - https://github.com/docker/distribution/issues/873 - error message "The request signature we calculated does not match the signature you provided" - error message "error fetching signature from : net/http: transport closed before response was received" - reproducible under heavy load - supposedly caused by a [glibc bug](https://sourceware.org/bugzilla/show_bug.cgi?id=15946) (corresponding [golang bug](https://github.com/golang/go/issues/6336)) Documentation PR https://github.com/openshift/openshift-docs/pull/1900 available for review. The glibc bug mentioned in comment 2 is out of question. It is already fixed in glibc-2.17-90 build on Fri May 29 2015 [1]. I've tested on RHEL7.2 with glibc-2.17-106.el7_2.1.x86_64 (which is a bit older than the one shipped in the latest pulic atomic image available) and couldn't reproduce the problem using reproducer [2] linked in the sourceware bugzilla [3]. (look for CVE-2013-7423) [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=478390 [2] https://sourceware.org/bugzilla/attachment.cgi?id=8161 [3] https://sourceware.org/bugzilla/show_bug.cgi?id=15946 Related bz is a bug 1194143. I followed up with AWS about the 500 errors being received from S3. The service rep gave me some tips [1][2] and suggested we implement some kind of retry logic, since these are occasionally expected from the service. So I think getting 1-2 of these per day is completely normal. He said if we were getting, say, 20% 500 errors, then that would indicate an actual problem. But for the most part, this is just normal S3 operation and we'll have to adjust our application accordingly. [1] "500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter). One such algorithm can be found at http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff " [2] https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html Thank you Stefanie! I'll look into adding such a retry into upstream's driver. Any update on this Michal? confirmed with fork_ami_openshift3_miminar_295, and can not reproduce this issue Docker registry 2.4 has been back-ported to OpenShift 3.2. https://github.com/openshift/ose/pull/314 Will confirm this issue when the latest version sync Confirmed with latest OCP 3.2.1.16 version, can't reproduce this issue. [root@ip-172-18-6-253 ~]# openshift version openshift v3.2.1.16 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 This bug has been fixed in OCP 3.3 however the fix will not be backported to OSE 3.2. |