Bug 1381532

Summary:

[dedicated][aws]error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error

Product:

OpenShift Online

Reporter:

Steve Speicher <sspeiche>

Component:

Image Registry

Assignee:

Abhishek Gupta <abhgupta>

Status:

CLOSED WORKSFORME

QA Contact:

zhaliu

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.x

CC:

abhgupta, agladkov, aos-bugs, bingli, bparees, dakini, mfojtik, mifiedle, pweil, rjstoneus, xiuwang, xtian, yufchang, zhaliu

Target Milestone:

---

Keywords:

OnlineDedicated

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-11-06 23:03:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
docker-registry pod logs	none

Description Steve Speicher 2016-10-04 11:45:19 UTC

Description of problem:
Repeated builds, on 4th build on upgraded preview (prod), getting an image push failure.

Version-Release number of selected component (if applicable):
OpenShift Master: v3.3.0.33
Kubernetes Master: v1.3.0+52492b4

How reproducible:
Launch nodejs-ex with defaults.
Repeat build steps multiple times until failure.

Additional info:

2016-10-04T10:49:14.935343243Z npm info ok 
2016-10-04T10:49:18.963139727Z 
2016-10-04T10:49:25.687201638Z 
2016-10-04T10:49:25.687223345Z Pushing image 172.30.47.227:5000/sspeiche/njs:latest ...
2016-10-04T10:49:26.546152209Z Pushed 4/5 layers, 80% complete
2016-10-04T10:49:31.707420179Z Pushed 5/5 layers, 100% complete
2016-10-04T10:50:03.661885255Z Registry server Address: 
2016-10-04T10:50:03.661924932Z Registry server User Name: serviceaccount
2016-10-04T10:50:03.661930464Z Registry server Email: serviceaccount
2016-10-04T10:50:03.661933557Z Registry server Password: <<non-empty>>
2016-10-04T10:50:03.661937061Z error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error

Comment 1 Steve Speicher 2016-10-04 11:46:25 UTC

dakini, can you add logs from your end or maybe you know what the problem is?

Comment 2 Stefanie Forrester 2016-10-04 14:14:16 UTC

Created attachment 1207204 [details]
docker-registry pod logs

Comment 3 Stefanie Forrester 2016-10-04 14:14:50 UTC

I'm seeing an unusually high number of http.response.status=500 in the docker logs. It might be normal to see this once every few days, but this is much more frequent than expected.

This is what we're currently running in prod:

atomic-openshift-3.3.0.33-1.git.0.8601ee7.el7.x86_64

registry.ops.openshift.com/openshift3/ose-docker-registry:v3.3.0.33

[root@preview-master-ed4d3 ~]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-15-6oiqo   1/1       Running   0          14h
docker-registry-15-xv1eq   1/1       Running   0          14h

The two errors I'm seeing are these:

err.detail="manifest invalid: manifest invalid"
err.detail="Timeout: request did not complete within allowed duration"

The timeout error is the same one we're getting in the intercom reconciler pod. I know there's been some very high load on etcd since the upgrade yesterday. I'll follow up on that today.

Comment 4 Stefanie Forrester 2016-10-04 14:18:34 UTC

I had to update the docker-registry config during this upgrade, because the old config was not backwards-compatible during INT testing.

Here is the docker-registry config, before the upgrade:


version: 0.1
log:
  level: debug
http:
  addr: :5000
storage:
  cache:
    layerinfo: inmemory
  delete:
    enabled: true
  s3:
    accesskey: REDACTED
    secretkey: REDACTED
    region: us-east-1
    bucket: oso-preview-docker-registry
    encrypt: true
    secure: true
    v4auth: true
    rootdirectory: /registry
auth:
  openshift:
    realm: openshift
middleware:
  repository:
    - name: openshift
      options:
        pullthrough: true


This is the config after the upgrade:


version: 0.1
log:
  level: debug
http:
  addr: :5000
storage:
  cache:
    blobdescriptor: inmemory
  delete:
    enabled: true
  s3:
    accesskey: REDACTED
    secretkey: REDACTED
    region: us-east-1
    bucket: oso-preview-docker-registry
    encrypt: true
    secure: true
    v4auth: true
    rootdirectory: /registry
auth:
  openshift:
    realm: openshift
middleware:
  registry:
    - name: openshift
  repository:
    - name: openshift
      options:
        acceptschema2: false
        pullthrough: true
        enforcequota: false
        projectcachettl: 1m
        blobrepositorycachettl: 10m
  storage:
    - name: openshift

Comment 5 Robert Stone 2016-10-05 03:53:58 UTC

I think this is the same problem I ran into earlier:

Robert@crimson  ~/Documents/GitHub/golang-ex (master)
$ docker push registry.preview.openshift.com/rjstone/golang:1.6-onbuild
The push refers to a repository [registry.preview.openshift.com/rjstone/golang]
f96e3740a202: Layer already exists
c312ac6b8ebe: Layer already exists
c9253187e251: Layer already exists
b01c73507859: Layer already exists
b16ea3826a8b: Layer already exists
a5eb0fc1decb: Pushed
b2ac5371e0f2: Pushed
142a601d9793: Pushed
received unexpected HTTP status: 500 Internal Server Error

Robert@crimson  ~/Documents/GitHub/golang-ex (master)
$ docker push registry.preview.openshift.com/rjstone/golang:1.6-onbuild
The push refers to a repository [registry.preview.openshift.com/rjstone/golang]
f96e3740a202: Layer already exists
c312ac6b8ebe: Layer already exists
c9253187e251: Layer already exists
b01c73507859: Layer already exists
b16ea3826a8b: Layer already exists
a5eb0fc1decb: Layer already exists
b2ac5371e0f2: Layer already exists
142a601d9793: Layer already exists
1.6-onbuild: digest: sha256:931aba0613140d839c60544b9a07460bddd8faa1f83f916cba323789447f5fd6 size: 11963

(failed on first try after all pushes succeeded, worked on second)

Comment 6 Steve Speicher 2016-10-05 21:30:56 UTC

Any update on this? Hitting it again.

2016-10-05T21:25:36.185770530Z npm info ok 
2016-10-05T21:25:40.542307520Z 
2016-10-05T21:25:47.868490410Z 
2016-10-05T21:25:47.868514363Z Pushing image 172.30.47.227:5000/sspeiche/njs:latest ...
2016-10-05T21:25:48.515258649Z Pushed 4/5 layers, 80% complete
2016-10-05T21:25:53.597362788Z Pushed 5/5 layers, 100% complete
2016-10-05T21:26:25.296186600Z Registry server Address: 
2016-10-05T21:26:25.296223927Z Registry server User Name: serviceaccount
2016-10-05T21:26:25.296229481Z Registry server Email: serviceaccount
2016-10-05T21:26:25.296233826Z Registry server Password: <<non-empty>>
2016-10-05T21:26:25.296238316Z error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error

Comment 7 Stefanie Forrester 2016-10-06 22:06:18 UTC

The majority of the errors are still "manifest invalid". I don't see anything on the ops side that might be causing it.

[root@preview-master-e69da ~]# grep status=500 docker-registry-15-6oiqo.1.log |grep -c Timeout
230

[root@preview-master-e69da ~]# grep status=500 docker-registry-15-6oiqo.1.log |grep -c 'manifest invalid'
879

Michal, could you check over our config and see if we're maybe missing an option that's needed? I read over these before deploying to INT/STG/Prod, but maybe I missed something.

https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#advanced-overriding-the-registry-configuration

Comment 8 zhaliu 2016-10-08 10:13:37 UTC

In the PROD env, after about 20 times rebuilding, there was not similar errors, but pulling error occurred for several times.

steps:

1. Create new app nodejs-ex

2. Repeat building for several times:

result:
oc get pod
NAME                 READY     STATUS      RESTARTS   AGE
nodejs-ex-1-build    0/1       Completed   0          58m
nodejs-ex-10-build   0/1       Completed   0          26m
nodejs-ex-11-build   0/1       Error       0          26m
nodejs-ex-12-build   0/1       Error       0          25m
nodejs-ex-13-8079l   1/1       Running     0          14m
nodejs-ex-13-build   0/1       Completed   0          24m
nodejs-ex-14-build   0/1       Error       0          22m
nodejs-ex-15-build   0/1       Completed   0          22m
nodejs-ex-16-build   0/1       Error       0          20m
nodejs-ex-17-build   0/1       Completed   0          20m
nodejs-ex-18-build   0/1       Completed   0          18m
nodejs-ex-19-build   0/1       Completed   0          15m
nodejs-ex-2-build    0/1       Completed   0          44m
nodejs-ex-3-build    0/1       Error       0          41m
nodejs-ex-4-build    0/1       Completed   0          35m
nodejs-ex-5-build    0/1       Completed   0          34m
nodejs-ex-6-build    0/1       Completed   0          33m
nodejs-ex-7-build    0/1       Completed   0          30m
nodejs-ex-8-build    0/1       Completed   0          28m
nodejs-ex-9-build    0/1       Error       0          27m

All the logs of pods with status error are similar:
Pulling image "registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3" ...
Pulling image "registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3" ...
error: build error: unable to get registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3

Comment 9 Stefanie Forrester 2016-10-10 15:53:22 UTC

The pulling error was caused by an issue in the ops registry. Pulls should be 100% successful as of this morning.

Comment 10 Abhishek Gupta 2016-10-13 22:29:44 UTC

Moving it to QE for verification since they were able to reproduce this in their testing earlier.

Comment 11 zhaliu 2016-10-14 10:06:26 UTC

I have verified the issue on STG 3.3.0.35 for 19 times, and no problems are found during the building.
oc get pod
NAME                 READY     STATUS      RESTARTS   AGE
nodejs-ex-1-build    0/1       Completed   0          1h
nodejs-ex-10-build   0/1       Completed   0          42m
nodejs-ex-11-build   0/1       Completed   0          40m
nodejs-ex-12-build   0/1       Completed   0          31m
nodejs-ex-13-build   0/1       Completed   0          28m
nodejs-ex-14-build   0/1       Completed   0          26m
nodejs-ex-15-build   0/1       Completed   0          22m
nodejs-ex-16-build   0/1       Completed   0          19m
nodejs-ex-17-build   0/1       Completed   0          10m
nodejs-ex-18-build   0/1       Completed   0          8m
nodejs-ex-19-build   0/1       Completed   0          6m
nodejs-ex-19-w6xd2   1/1       Running     0          4m
nodejs-ex-2-build    0/1       Completed   0          58m
nodejs-ex-3-build    0/1       Completed   0          57m
nodejs-ex-4-build    0/1       Completed   0          56m
nodejs-ex-5-build    0/1       Completed   0          54m
nodejs-ex-6-build    0/1       Completed   0          52m
nodejs-ex-7-build    0/1       Completed   0          49m
nodejs-ex-8-build    0/1       Completed   0          46m
nodejs-ex-9-build    0/1       Completed   0          43m

Comment 17 Abhishek Gupta 2016-11-02 18:51:59 UTC

So, is this a benign error that we just see in the logs?

Just so I understand this fully, does this apply to a specific case/scenario? For example, does it happen only with images built with newer docker? If this applies to every image, then should we expect to see this error once for every image push/pull attempt? Is there a performance impact of this where the registry is required to handle additional calls/requests?

Comment 18 Alexey Gladkov 2016-11-03 15:58:13 UTC

This error indicates that there was an attempt to send a schema v2 manifest. It happens only with new docker. This part of the algorithm of selection of the supported schema.

Unfortunately we cannot remove these messages from the log.

Comment 19 Abhishek Gupta 2016-11-03 21:41:54 UTC

Marking UpcomingRelease while we figure out when we can enable schema v2 for the registry [1]. The impact of enabling it on older docker clients is highlighted here [2].

[1] https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#docker-registry-configuration-reference-middleware
[2] https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#middleware-repository-acceptschema2

Comment 20 Michal Fojtik 2017-02-20 14:25:29 UTC

Abhishek we think you can enable the v2 now (ping agladkov for more info).

Comment 22 XiuJuan Wang 2017-06-26 08:02:20 UTC

Create one hundred repeat builds,and can't reproduce this issue in [1].

Server https://api.ded-stage-aws.openshift.com:443
openshift v3.5.5.10
kubernetes v1.5.2+43a9be4

Comment 24 Ben Parees 2017-10-24 12:43:49 UTC

This is ancient, can it be resolved?

Comment 25 Red Hat Bugzilla 2023-09-14 03:31:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days