1324629 – S3 Registry periodically returns "500 Internal Server Error"

Bug 1324629 - S3 Registry periodically returns "500 Internal Server Error"

Summary: S3 Registry periodically returns "500 Internal Server Error"

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.2.1
Assignee:	Michal Minar
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-04-06 20:14 UTC by Stefanie Forrester
Modified:	2020-02-14 17:44 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 19:40:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Stefanie Forrester 2016-04-06 20:14:18 UTC

Description of problem:

On at least 3 of our clusters, I've seen this issue come up about once per day, during a new app build:

builder.go:185] Error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server error

In the registry pod's logs, I can see status=500 errors occurring. Sometimes the errors occur with different error messages, but this is the one I see most often (URL shortened for brevity):

level=error msg="response completed with error" err.code=UNKNOWN err.detail="s3: Put https://s3.amazonaws.com/[...]/data: EOF" err.message="unknown error" go.version=go1.4.2 http.request.host="172.30.184.56:5000"

Version-Release number of selected component (if applicable):

How reproducible:

I'm uncertain how to reproduce it. When I was creating new projects and apps by hand (just using the oc commands documented below), I thought it was quite reproducible, since I hit it 3 times in the same day, on different clusters. There was probably a 1 in 5 chance of reproducing at that time. However, when I went to reproduce it later, using a bash loop creating 30 apps, I could not.

Steps to Reproduce:
1. Maybe have some other docker pushes going simultaneously? First step is unknown.
2. oc new-project <projectname>
3. oc new-app cakephp-example -n <projectname>

Actual results:

Builds occasionally fail with status=500 errors appearing in the registry pod logs.

Expected results:

Builds ideally shouldn't be failing. Perhaps the registry can more gracefully handle errors it receives from S3?

Additional info:

I know this is a fairly vague bug to file, but I'm hoping it will help us track down this issue, since it's currently affecting one customer.

Comment 2 Michal Minar 2016-04-07 09:26:04 UTC

Related upstream issues:

- https://github.com/docker/distribution/issues/1288
  - error message "The request signature we calculated does not match the signature you provided."
  - setting REGISTRY_STORAGE_S3_V4AUTH=false allegedly solves the problem
  - however, we don't set it
- https://github.com/docker/distribution/issues/873
  - error message "The request signature we calculated does not match the signature you provided"
  - error message "error fetching signature from : net/http: transport closed before response was received"
  - reproducible under heavy load
  - supposedly caused by a [glibc  bug](https://sourceware.org/bugzilla/show_bug.cgi?id=15946) (corresponding [golang bug](https://github.com/golang/go/issues/6336))

Comment 5 Michal Minar 2016-04-14 11:30:32 UTC

Documentation PR https://github.com/openshift/openshift-docs/pull/1900 available for review.

Comment 6 Michal Minar 2016-04-20 11:40:10 UTC

The glibc bug mentioned in comment 2 is out of question. It is already fixed in glibc-2.17-90 build on Fri May 29 2015 [1]. I've tested on RHEL7.2 with glibc-2.17-106.el7_2.1.x86_64 (which is a bit older than the one shipped in the latest pulic atomic image available) and couldn't reproduce the problem using reproducer [2] linked in the sourceware bugzilla [3].

    (look for CVE-2013-7423)
[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=478390
[2] https://sourceware.org/bugzilla/attachment.cgi?id=8161
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=15946

Related bz is a bug 1194143.

Comment 7 Stefanie Forrester 2016-05-05 17:49:24 UTC

I followed up with AWS about the 500 errors being received from S3. The service rep gave me some tips [1][2] and suggested we implement some kind of retry logic, since these are occasionally expected from the service. So I think getting 1-2 of these per day is completely normal. He said if we were getting, say, 20% 500 errors, then that would indicate an actual problem. But for the most part, this is just normal S3 operation and we'll have to adjust our application accordingly.

[1] "500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter). One such algorithm can be found at http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff "

[2] https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Comment 8 Michal Minar 2016-05-09 07:37:44 UTC

Thank you Stefanie! I'll look into adding such a retry into upstream's driver.

Comment 9 Michal Fojtik 2016-07-20 11:22:15 UTC

Any update on this Michal?

Comment 13 zhou ying 2016-09-01 09:05:09 UTC

confirmed with fork_ami_openshift3_miminar_295, and can not reproduce this issue

Comment 14 Michal Minar 2016-10-07 14:46:32 UTC

Docker registry 2.4 has been back-ported to OpenShift 3.2.

https://github.com/openshift/ose/pull/314

Comment 16 zhou ying 2016-10-09 03:03:03 UTC

Will confirm this issue when the latest version sync

Comment 17 zhou ying 2016-10-12 10:01:18 UTC

Confirmed with latest OCP 3.2.1.16 version, can't reproduce this issue. 
[root@ip-172-18-6-253 ~]# openshift version
openshift v3.2.1.16
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Comment 19 Scott Dodson 2016-12-14 19:40:32 UTC

This bug has been fixed in OCP 3.3 however the fix will not be backported to OSE 3.2.

Note You need to log in before you can comment on or make changes to this bug.