Bug 1335279

Summary:	504 gateway timeouts on 'docker push' to registry.dev-preview-int
Product:	OpenShift Online	Reporter:	Steve Speicher <sspeiche>
Component:	Image Registry	Assignee:	Michal Minar <miminar>
Status:	CLOSED WORKSFORME	QA Contact:	Bing Li <bingli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.x	CC:	abhgupta, aos-bugs, bingli, dakini, erich, mfojtik, miminar, pep, rjstoneus, twaugh, yinzhou
Target Milestone:	---	Keywords:	UpcomingRelease
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-28 13:06:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1303130

Description Steve Speicher 2016-05-11 18:35:24 UTC

Description of problem:
After 4 tries, receive 504 gateway timeout errors on image pushes (see log below).

Version-Release number of selected component (if applicable):
penShift Master:
v3.2.0.40
Kubernetes Master:
v1.2.0-36-g4a3f9c5
registry.dev-preview-int.openshift.com

Actual results:
504

Expected results:
Push successful

Additional info:

$ docker pull openshift/jenkins-1-centos7
Using default tag: latest
latest: Pulling from openshift/jenkins-1-centos7
a3ed95caeb02: Pull complete 
5989106db7fb: Pull complete 
9e821b096409: Pull complete 
8b7780ce69e9: Pull complete 
36b03e5ef2b5: Pull complete 
b858f60ccfab: Pull complete 
079bd8173a11: Pull complete 
Digest: sha256:2d3529c9a7afa766aa5523d38c65453312414b615ee0f4e24e21a725267abfff
Status: Downloaded newer image for openshift/jenkins-1-centos7:latest

$ docker tag openshift/jenkins-1-centos7 registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7

$ docker push registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7
The push refers to a repository [registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7]
5f70bf18a086: Pushed 
d9e9d04904b3: Pushed 
3f422360fd61: Pushed 
9bd1c68510c8: Pushed 
214c631467fc: Pushed 
cc436dfbd7d3: Pushed 
6a6c96337be1: Pushed 
Received unexpected HTTP status: 504 Gateway Time-out

$ docker push registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7
The push refers to a repository [registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7]
5f70bf18a086: Layer already exists 
d9e9d04904b3: Layer already exists 
3f422360fd61: Layer already exists 
9bd1c68510c8: Layer already exists 
214c631467fc: Layer already exists 
cc436dfbd7d3: Layer already exists 
6a6c96337be1: Layer already exists 
Received unexpected HTTP status: 504 Gateway Time-out

$ docker push registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7
The push refers to a repository [registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7]
5f70bf18a086: Layer already exists 
d9e9d04904b3: Layer already exists 
3f422360fd61: Layer already exists 
9bd1c68510c8: Layer already exists 
214c631467fc: Layer already exists 
cc436dfbd7d3: Layer already exists 
6a6c96337be1: Layer already exists 
Received unexpected HTTP status: 504 Gateway Time-out

$ docker push registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7
The push refers to a repository [registry.dev-preview-int.openshift.com/sspeiche-jenkins/jenkins-1-centos7]
5f70bf18a086: Layer already exists 
d9e9d04904b3: Layer already exists 
3f422360fd61: Layer already exists 
9bd1c68510c8: Layer already exists 
214c631467fc: Layer already exists 
cc436dfbd7d3: Layer already exists 
6a6c96337be1: Layer already exists 
Received unexpected HTTP status: 504 Gateway Time-out

Comment 1 Stefanie Forrester 2016-05-16 22:08:44 UTC

Is this still happening? I'm curious if the ELB timeouts I set last week are helping. Here is the release-ticket for reference. https://github.com/openshift/online/issues/131#issuecomment-217921293

Comment 2 Steve Speicher 2016-05-18 13:55:37 UTC

This bug was reported after you made these changes. I also just reproduced it right now (as I type).

Comment 3 Stefanie Forrester 2016-05-18 14:20:34 UTC

According to the release ticket above, this PR may solve the issue. https://github.com/kubernetes/kubernetes/pull/24142

Comment 4 Maciej Szulik 2016-06-03 13:27:48 UTC

The mentioned fix won't make into origin until after the next rebase, not the current one mfojtik is doing.

Comment 5 Steve Speicher 2016-06-14 16:57:45 UTC

I haven't hit this issue in the last week or two. So I'm not sure what has changed.

Comment 6 Stefanie Forrester 2016-06-14 17:01:07 UTC

We are using ose version v3.2.1.1-1 in INT now. Maybe that version is new enough to contain the fix?

Comment 7 Steve Speicher 2016-06-14 17:01:53 UTC

I was referring to registry.preview.openshift.com

Comment 8 Abhishek Gupta 2016-07-21 20:49:24 UTC

Moving this bug to ON_QA for QE verification.

Comment 9 zhou ying 2016-07-22 08:02:59 UTC

I still can reproduce this issue on INT:
[root@dev-preview-int-master-741a5 ~]# openshift version 
openshift v3.2.1.10-1-g668ed0a
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

[root@dhcp-136-93 ~]# docker tag openshift/jenkins-1-centos7 registry.dev-preview-int.openshift.com/zhouy/jenkins-1-centos7
[root@dhcp-136-93 ~]# docker push registry.dev-preview-int.openshift.com/zhouy/jenkins-1-centos7
The push refers to a repository [registry.dev-preview-int.openshift.com/zhouy/jenkins-1-centos7] (len: 1)
c014669e27a0: Pushed 
b9cd42fb8607: Pushed 
c58f3637b60b: Pushing 1.024 kB
Received unexpected HTTP status: 504 Gateway Time-out

Comment 11 Michal Minar 2016-08-15 13:35:13 UTC

Not yet, I'll be looking into this this week.

Steve, Stefanie, can we get registry log out of registry.dev-preview-int.openshift.com for the time corresponding to the failed push?

Comment 12 Michal Minar 2016-08-15 13:35:42 UTC

*** Bug 1333978 has been marked as a duplicate of this bug. ***

Comment 13 Stefanie Forrester 2016-08-15 19:16:12 UTC

The registry logs are purged each time we do an upgrade, so we'll have to gather logs next time the issue is reproduced. Let's try that after INT has been upgraded in the next day or so. I'm still working on the upgrade, so it's not quite ready for testing yet, and the old logs have already been purged.

Comment 14 Robert Stone 2016-10-05 03:47:34 UTC

I ran into this problem today. Setting this seemed to help some:

docker@default:~$ cat /var/lib/boot2docker/profile

EXTRA_ARGS='
--label provider=virtualbox
--max-concurrent-uploads=1
'
...


That is, I added --max-concurrent-uploads=1 (defaults to 5)

This did not fix all problems but I was at least able to get an image uploaded after a few retries.

Comment 15 Abhishek Gupta 2016-11-01 20:54:18 UTC

Would like to try this again in INT after the upgrade is done.

Comment 16 Michal Minar 2016-11-03 15:05:17 UTC

@dakini Any chance we could get a hold of the registry logs from the time of gateway timeout occurrence?

Comment 17 Stefanie Forrester 2016-11-03 15:12:25 UTC

(In reply to Michal Minar from comment #16)
> @dakini Any chance we could get a hold of the registry logs from the time of
> gateway timeout occurrence?

We probably can get those logs, yes. But I haven't heard of any occurrence of a gateway timeout since the latest build.

We're running 3.4.0.18 in dev-preview-int currently and it's planned to upgrade to 3.4.0.19 later this evening when we get the new build. If someone can ping me when they hit the issue again, I'll retrieve the logs for it.

Comment 18 Michal Minar 2016-11-03 15:18:52 UTC

Cool, I'll give it a shot.

Comment 21 Michal Minar 2016-11-07 09:51:11 UTC

@abhgupta It seems hard to reproduce to me. I've tried this morning to push 30 images in parallel on two hosts. Their average size was 250MB. I wasn't able to hit this problem. All pushes were successful. Maybe it was due to a low load on the cluster. Nevertheless thanks to a moderately low occurrence and a possibility for simple workaround by just retrying, I wouldn't consider it a blocker.

I'll try again afternoon - expecting higher load and higher probability of hitting this.

Comment 23 Bing Li 2016-11-08 14:15:14 UTC

We didn't meet this issue in our test against online 3.4 these days. Will try more. Thanks!

Comment 24 Steve Speicher 2017-03-28 13:06:06 UTC

I'm not hitting this anymore. I'm just going to close it and reopen if it happens again.