Bug 1381532
| Summary: | [dedicated][aws]error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Steve Speicher <sspeiche> | ||||
| Component: | Image Registry | Assignee: | Abhishek Gupta <abhgupta> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | zhaliu | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.x | CC: | abhgupta, agladkov, aos-bugs, bingli, bparees, dakini, mfojtik, mifiedle, pweil, rjstoneus, xiuwang, xtian, yufchang, zhaliu | ||||
| Target Milestone: | --- | Keywords: | OnlineDedicated | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-11-06 23:03:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Steve Speicher
2016-10-04 11:45:19 UTC
dakini, can you add logs from your end or maybe you know what the problem is? Created attachment 1207204 [details]
docker-registry pod logs
I'm seeing an unusually high number of http.response.status=500 in the docker logs. It might be normal to see this once every few days, but this is much more frequent than expected. This is what we're currently running in prod: atomic-openshift-3.3.0.33-1.git.0.8601ee7.el7.x86_64 registry.ops.openshift.com/openshift3/ose-docker-registry:v3.3.0.33 [root@preview-master-ed4d3 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-15-6oiqo 1/1 Running 0 14h docker-registry-15-xv1eq 1/1 Running 0 14h The two errors I'm seeing are these: err.detail="manifest invalid: manifest invalid" err.detail="Timeout: request did not complete within allowed duration" The timeout error is the same one we're getting in the intercom reconciler pod. I know there's been some very high load on etcd since the upgrade yesterday. I'll follow up on that today. I had to update the docker-registry config during this upgrade, because the old config was not backwards-compatible during INT testing.
Here is the docker-registry config, before the upgrade:
version: 0.1
log:
level: debug
http:
addr: :5000
storage:
cache:
layerinfo: inmemory
delete:
enabled: true
s3:
accesskey: REDACTED
secretkey: REDACTED
region: us-east-1
bucket: oso-preview-docker-registry
encrypt: true
secure: true
v4auth: true
rootdirectory: /registry
auth:
openshift:
realm: openshift
middleware:
repository:
- name: openshift
options:
pullthrough: true
This is the config after the upgrade:
version: 0.1
log:
level: debug
http:
addr: :5000
storage:
cache:
blobdescriptor: inmemory
delete:
enabled: true
s3:
accesskey: REDACTED
secretkey: REDACTED
region: us-east-1
bucket: oso-preview-docker-registry
encrypt: true
secure: true
v4auth: true
rootdirectory: /registry
auth:
openshift:
realm: openshift
middleware:
registry:
- name: openshift
repository:
- name: openshift
options:
acceptschema2: false
pullthrough: true
enforcequota: false
projectcachettl: 1m
blobrepositorycachettl: 10m
storage:
- name: openshift
I think this is the same problem I ran into earlier: Robert@crimson ~/Documents/GitHub/golang-ex (master) $ docker push registry.preview.openshift.com/rjstone/golang:1.6-onbuild The push refers to a repository [registry.preview.openshift.com/rjstone/golang] f96e3740a202: Layer already exists c312ac6b8ebe: Layer already exists c9253187e251: Layer already exists b01c73507859: Layer already exists b16ea3826a8b: Layer already exists a5eb0fc1decb: Pushed b2ac5371e0f2: Pushed 142a601d9793: Pushed received unexpected HTTP status: 500 Internal Server Error Robert@crimson ~/Documents/GitHub/golang-ex (master) $ docker push registry.preview.openshift.com/rjstone/golang:1.6-onbuild The push refers to a repository [registry.preview.openshift.com/rjstone/golang] f96e3740a202: Layer already exists c312ac6b8ebe: Layer already exists c9253187e251: Layer already exists b01c73507859: Layer already exists b16ea3826a8b: Layer already exists a5eb0fc1decb: Layer already exists b2ac5371e0f2: Layer already exists 142a601d9793: Layer already exists 1.6-onbuild: digest: sha256:931aba0613140d839c60544b9a07460bddd8faa1f83f916cba323789447f5fd6 size: 11963 (failed on first try after all pushes succeeded, worked on second) Any update on this? Hitting it again. 2016-10-05T21:25:36.185770530Z npm info ok 2016-10-05T21:25:40.542307520Z 2016-10-05T21:25:47.868490410Z 2016-10-05T21:25:47.868514363Z Pushing image 172.30.47.227:5000/sspeiche/njs:latest ... 2016-10-05T21:25:48.515258649Z Pushed 4/5 layers, 80% complete 2016-10-05T21:25:53.597362788Z Pushed 5/5 layers, 100% complete 2016-10-05T21:26:25.296186600Z Registry server Address: 2016-10-05T21:26:25.296223927Z Registry server User Name: serviceaccount 2016-10-05T21:26:25.296229481Z Registry server Email: serviceaccount 2016-10-05T21:26:25.296233826Z Registry server Password: <<non-empty>> 2016-10-05T21:26:25.296238316Z error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error The majority of the errors are still "manifest invalid". I don't see anything on the ops side that might be causing it. [root@preview-master-e69da ~]# grep status=500 docker-registry-15-6oiqo.1.log |grep -c Timeout 230 [root@preview-master-e69da ~]# grep status=500 docker-registry-15-6oiqo.1.log |grep -c 'manifest invalid' 879 Michal, could you check over our config and see if we're maybe missing an option that's needed? I read over these before deploying to INT/STG/Prod, but maybe I missed something. https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#advanced-overriding-the-registry-configuration In the PROD env, after about 20 times rebuilding, there was not similar errors, but pulling error occurred for several times. steps: 1. Create new app nodejs-ex 2. Repeat building for several times: result: oc get pod NAME READY STATUS RESTARTS AGE nodejs-ex-1-build 0/1 Completed 0 58m nodejs-ex-10-build 0/1 Completed 0 26m nodejs-ex-11-build 0/1 Error 0 26m nodejs-ex-12-build 0/1 Error 0 25m nodejs-ex-13-8079l 1/1 Running 0 14m nodejs-ex-13-build 0/1 Completed 0 24m nodejs-ex-14-build 0/1 Error 0 22m nodejs-ex-15-build 0/1 Completed 0 22m nodejs-ex-16-build 0/1 Error 0 20m nodejs-ex-17-build 0/1 Completed 0 20m nodejs-ex-18-build 0/1 Completed 0 18m nodejs-ex-19-build 0/1 Completed 0 15m nodejs-ex-2-build 0/1 Completed 0 44m nodejs-ex-3-build 0/1 Error 0 41m nodejs-ex-4-build 0/1 Completed 0 35m nodejs-ex-5-build 0/1 Completed 0 34m nodejs-ex-6-build 0/1 Completed 0 33m nodejs-ex-7-build 0/1 Completed 0 30m nodejs-ex-8-build 0/1 Completed 0 28m nodejs-ex-9-build 0/1 Error 0 27m All the logs of pods with status error are similar: Pulling image "registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3" ... Pulling image "registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3" ... error: build error: unable to get registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3 The pulling error was caused by an issue in the ops registry. Pulls should be 100% successful as of this morning. Moving it to QE for verification since they were able to reproduce this in their testing earlier. I have verified the issue on STG 3.3.0.35 for 19 times, and no problems are found during the building. oc get pod NAME READY STATUS RESTARTS AGE nodejs-ex-1-build 0/1 Completed 0 1h nodejs-ex-10-build 0/1 Completed 0 42m nodejs-ex-11-build 0/1 Completed 0 40m nodejs-ex-12-build 0/1 Completed 0 31m nodejs-ex-13-build 0/1 Completed 0 28m nodejs-ex-14-build 0/1 Completed 0 26m nodejs-ex-15-build 0/1 Completed 0 22m nodejs-ex-16-build 0/1 Completed 0 19m nodejs-ex-17-build 0/1 Completed 0 10m nodejs-ex-18-build 0/1 Completed 0 8m nodejs-ex-19-build 0/1 Completed 0 6m nodejs-ex-19-w6xd2 1/1 Running 0 4m nodejs-ex-2-build 0/1 Completed 0 58m nodejs-ex-3-build 0/1 Completed 0 57m nodejs-ex-4-build 0/1 Completed 0 56m nodejs-ex-5-build 0/1 Completed 0 54m nodejs-ex-6-build 0/1 Completed 0 52m nodejs-ex-7-build 0/1 Completed 0 49m nodejs-ex-8-build 0/1 Completed 0 46m nodejs-ex-9-build 0/1 Completed 0 43m So, is this a benign error that we just see in the logs? Just so I understand this fully, does this apply to a specific case/scenario? For example, does it happen only with images built with newer docker? If this applies to every image, then should we expect to see this error once for every image push/pull attempt? Is there a performance impact of this where the registry is required to handle additional calls/requests? This error indicates that there was an attempt to send a schema v2 manifest. It happens only with new docker. This part of the algorithm of selection of the supported schema. Unfortunately we cannot remove these messages from the log. Marking UpcomingRelease while we figure out when we can enable schema v2 for the registry [1]. The impact of enabling it on older docker clients is highlighted here [2]. [1] https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#docker-registry-configuration-reference-middleware [2] https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#middleware-repository-acceptschema2 Abhishek we think you can enable the v2 now (ping agladkov for more info). Create one hundred repeat builds,and can't reproduce this issue in [1]. Server https://api.ded-stage-aws.openshift.com:443 openshift v3.5.5.10 kubernetes v1.5.2+43a9be4 This is ancient, can it be resolved? The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |