Description of problem: Repeated builds, on 4th build on upgraded preview (prod), getting an image push failure. Version-Release number of selected component (if applicable): OpenShift Master: v3.3.0.33 Kubernetes Master: v1.3.0+52492b4 How reproducible: Launch nodejs-ex with defaults. Repeat build steps multiple times until failure. Additional info: 2016-10-04T10:49:14.935343243Z npm info ok 2016-10-04T10:49:18.963139727Z 2016-10-04T10:49:25.687201638Z 2016-10-04T10:49:25.687223345Z Pushing image 172.30.47.227:5000/sspeiche/njs:latest ... 2016-10-04T10:49:26.546152209Z Pushed 4/5 layers, 80% complete 2016-10-04T10:49:31.707420179Z Pushed 5/5 layers, 100% complete 2016-10-04T10:50:03.661885255Z Registry server Address: 2016-10-04T10:50:03.661924932Z Registry server User Name: serviceaccount 2016-10-04T10:50:03.661930464Z Registry server Email: serviceaccount 2016-10-04T10:50:03.661933557Z Registry server Password: <<non-empty>> 2016-10-04T10:50:03.661937061Z error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error
dakini, can you add logs from your end or maybe you know what the problem is?
Created attachment 1207204 [details] docker-registry pod logs
I'm seeing an unusually high number of http.response.status=500 in the docker logs. It might be normal to see this once every few days, but this is much more frequent than expected. This is what we're currently running in prod: atomic-openshift-3.3.0.33-1.git.0.8601ee7.el7.x86_64 registry.ops.openshift.com/openshift3/ose-docker-registry:v3.3.0.33 [root@preview-master-ed4d3 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-15-6oiqo 1/1 Running 0 14h docker-registry-15-xv1eq 1/1 Running 0 14h The two errors I'm seeing are these: err.detail="manifest invalid: manifest invalid" err.detail="Timeout: request did not complete within allowed duration" The timeout error is the same one we're getting in the intercom reconciler pod. I know there's been some very high load on etcd since the upgrade yesterday. I'll follow up on that today.
I had to update the docker-registry config during this upgrade, because the old config was not backwards-compatible during INT testing. Here is the docker-registry config, before the upgrade: version: 0.1 log: level: debug http: addr: :5000 storage: cache: layerinfo: inmemory delete: enabled: true s3: accesskey: REDACTED secretkey: REDACTED region: us-east-1 bucket: oso-preview-docker-registry encrypt: true secure: true v4auth: true rootdirectory: /registry auth: openshift: realm: openshift middleware: repository: - name: openshift options: pullthrough: true This is the config after the upgrade: version: 0.1 log: level: debug http: addr: :5000 storage: cache: blobdescriptor: inmemory delete: enabled: true s3: accesskey: REDACTED secretkey: REDACTED region: us-east-1 bucket: oso-preview-docker-registry encrypt: true secure: true v4auth: true rootdirectory: /registry auth: openshift: realm: openshift middleware: registry: - name: openshift repository: - name: openshift options: acceptschema2: false pullthrough: true enforcequota: false projectcachettl: 1m blobrepositorycachettl: 10m storage: - name: openshift
I think this is the same problem I ran into earlier: Robert@crimson ~/Documents/GitHub/golang-ex (master) $ docker push registry.preview.openshift.com/rjstone/golang:1.6-onbuild The push refers to a repository [registry.preview.openshift.com/rjstone/golang] f96e3740a202: Layer already exists c312ac6b8ebe: Layer already exists c9253187e251: Layer already exists b01c73507859: Layer already exists b16ea3826a8b: Layer already exists a5eb0fc1decb: Pushed b2ac5371e0f2: Pushed 142a601d9793: Pushed received unexpected HTTP status: 500 Internal Server Error Robert@crimson ~/Documents/GitHub/golang-ex (master) $ docker push registry.preview.openshift.com/rjstone/golang:1.6-onbuild The push refers to a repository [registry.preview.openshift.com/rjstone/golang] f96e3740a202: Layer already exists c312ac6b8ebe: Layer already exists c9253187e251: Layer already exists b01c73507859: Layer already exists b16ea3826a8b: Layer already exists a5eb0fc1decb: Layer already exists b2ac5371e0f2: Layer already exists 142a601d9793: Layer already exists 1.6-onbuild: digest: sha256:931aba0613140d839c60544b9a07460bddd8faa1f83f916cba323789447f5fd6 size: 11963 (failed on first try after all pushes succeeded, worked on second)
Any update on this? Hitting it again. 2016-10-05T21:25:36.185770530Z npm info ok 2016-10-05T21:25:40.542307520Z 2016-10-05T21:25:47.868490410Z 2016-10-05T21:25:47.868514363Z Pushing image 172.30.47.227:5000/sspeiche/njs:latest ... 2016-10-05T21:25:48.515258649Z Pushed 4/5 layers, 80% complete 2016-10-05T21:25:53.597362788Z Pushed 5/5 layers, 100% complete 2016-10-05T21:26:25.296186600Z Registry server Address: 2016-10-05T21:26:25.296223927Z Registry server User Name: serviceaccount 2016-10-05T21:26:25.296229481Z Registry server Email: serviceaccount 2016-10-05T21:26:25.296233826Z Registry server Password: <<non-empty>> 2016-10-05T21:26:25.296238316Z error: build error: Failed to push image: Received unexpected HTTP status: 500 Internal Server Error
The majority of the errors are still "manifest invalid". I don't see anything on the ops side that might be causing it. [root@preview-master-e69da ~]# grep status=500 docker-registry-15-6oiqo.1.log |grep -c Timeout 230 [root@preview-master-e69da ~]# grep status=500 docker-registry-15-6oiqo.1.log |grep -c 'manifest invalid' 879 Michal, could you check over our config and see if we're maybe missing an option that's needed? I read over these before deploying to INT/STG/Prod, but maybe I missed something. https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#advanced-overriding-the-registry-configuration
In the PROD env, after about 20 times rebuilding, there was not similar errors, but pulling error occurred for several times. steps: 1. Create new app nodejs-ex 2. Repeat building for several times: result: oc get pod NAME READY STATUS RESTARTS AGE nodejs-ex-1-build 0/1 Completed 0 58m nodejs-ex-10-build 0/1 Completed 0 26m nodejs-ex-11-build 0/1 Error 0 26m nodejs-ex-12-build 0/1 Error 0 25m nodejs-ex-13-8079l 1/1 Running 0 14m nodejs-ex-13-build 0/1 Completed 0 24m nodejs-ex-14-build 0/1 Error 0 22m nodejs-ex-15-build 0/1 Completed 0 22m nodejs-ex-16-build 0/1 Error 0 20m nodejs-ex-17-build 0/1 Completed 0 20m nodejs-ex-18-build 0/1 Completed 0 18m nodejs-ex-19-build 0/1 Completed 0 15m nodejs-ex-2-build 0/1 Completed 0 44m nodejs-ex-3-build 0/1 Error 0 41m nodejs-ex-4-build 0/1 Completed 0 35m nodejs-ex-5-build 0/1 Completed 0 34m nodejs-ex-6-build 0/1 Completed 0 33m nodejs-ex-7-build 0/1 Completed 0 30m nodejs-ex-8-build 0/1 Completed 0 28m nodejs-ex-9-build 0/1 Error 0 27m All the logs of pods with status error are similar: Pulling image "registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3" ... Pulling image "registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3" ... error: build error: unable to get registry.ops.openshift.com/rhscl/nodejs-4-rhel7@sha256:d1e5e8b533a2c77936107fefd3a61d80d0bb1ecfdf8387ae919626bcf54447d3
The pulling error was caused by an issue in the ops registry. Pulls should be 100% successful as of this morning.
Moving it to QE for verification since they were able to reproduce this in their testing earlier.
I have verified the issue on STG 3.3.0.35 for 19 times, and no problems are found during the building. oc get pod NAME READY STATUS RESTARTS AGE nodejs-ex-1-build 0/1 Completed 0 1h nodejs-ex-10-build 0/1 Completed 0 42m nodejs-ex-11-build 0/1 Completed 0 40m nodejs-ex-12-build 0/1 Completed 0 31m nodejs-ex-13-build 0/1 Completed 0 28m nodejs-ex-14-build 0/1 Completed 0 26m nodejs-ex-15-build 0/1 Completed 0 22m nodejs-ex-16-build 0/1 Completed 0 19m nodejs-ex-17-build 0/1 Completed 0 10m nodejs-ex-18-build 0/1 Completed 0 8m nodejs-ex-19-build 0/1 Completed 0 6m nodejs-ex-19-w6xd2 1/1 Running 0 4m nodejs-ex-2-build 0/1 Completed 0 58m nodejs-ex-3-build 0/1 Completed 0 57m nodejs-ex-4-build 0/1 Completed 0 56m nodejs-ex-5-build 0/1 Completed 0 54m nodejs-ex-6-build 0/1 Completed 0 52m nodejs-ex-7-build 0/1 Completed 0 49m nodejs-ex-8-build 0/1 Completed 0 46m nodejs-ex-9-build 0/1 Completed 0 43m
So, is this a benign error that we just see in the logs? Just so I understand this fully, does this apply to a specific case/scenario? For example, does it happen only with images built with newer docker? If this applies to every image, then should we expect to see this error once for every image push/pull attempt? Is there a performance impact of this where the registry is required to handle additional calls/requests?
This error indicates that there was an attempt to send a schema v2 manifest. It happens only with new docker. This part of the algorithm of selection of the supported schema. Unfortunately we cannot remove these messages from the log.
Marking UpcomingRelease while we figure out when we can enable schema v2 for the registry [1]. The impact of enabling it on older docker clients is highlighted here [2]. [1] https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#docker-registry-configuration-reference-middleware [2] https://docs.openshift.org/latest/install_config/registry/extended_registry_configuration.html#middleware-repository-acceptschema2
Abhishek we think you can enable the v2 now (ping agladkov for more info).
Create one hundred repeat builds,and can't reproduce this issue in [1]. Server https://api.ded-stage-aws.openshift.com:443 openshift v3.5.5.10 kubernetes v1.5.2+43a9be4
This is ancient, can it be resolved?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days