Bug 1754042

Summary:

[gcp] `oc adm release new` fails if upload to GCS storage fails

Product:

OpenShift Container Platform

Reporter:

Adam Kaplan <adam.kaplan>

Component:

Image Registry

Assignee:

Clayton Coleman <ccoleman>

Status:

CLOSED ERRATA

QA Contact:

Wenjing Zheng <wzheng>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.3.0

CC:

aos-bugs, ccoleman, hwmwm, obulatov, wking

Target Milestone:

---

Target Release:

4.3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-01-23 11:06:39 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Request twice log	none

Description Adam Kaplan 2019-09-20 15:59:45 UTC

Description of problem:
Running `oc adm release new...` fails with error
```
uploading: registry.svc.ci.openshift.org/ci-op-hs1rgqjk/release-scratch sha256:c267ab2e6cb9696348a0cd8deb7892c45d83ef580f1848ed1ccab6b832a967af 24.83MiB
error: unable to push quay.io/openshift-release-dev/ocp-v4.0-art-dev: failed to upload blob sha256:794e9e8f732745cd3301be325c2579b68749ab41c96cbfe9adfee0112ee83157: received unexpected HTTP status: 500 Internal Server Error
info: Mirroring completed in 2m23.49s (34.79MB/s)
```

This is ultimately caused by an error uploading the blob to GCS storage:

```
docker-registry-13-bdklf.log:time="2019-09-19T04:12:02.774422669Z" level=error msg="unknown error completing upload: EOF" go.version=go1.10.3 http.request.host=registry.svc.ci.openshift.org http.request.id=08bcccf0-4c71-4e09-b244-e525187eff59 http.request.method=PUT http.request.remoteaddr=172.16.128.1 http.request.uri="/v2/ci-op-glb491np/release-scratch/blobs/uploads/922b5d5c-f3a6-428c-818e-1b6fd4d7e252?_state=ZqsaOQcFTnKCwyc1p4OEBVXLfMRaEZn0W4x4CgyEv-R7Ik5hbWUiOiJjaS1vcC1nbGI0OTFucC9yZWxlYXNlLXNjcmF0Y2giLCJVVUlEIjoiOTIyYjVkNWMtZjNhNi00MjhjLTgxOGUtMWI2ZmQ0ZDdlMjUyIiwiT2Zmc2V0IjozMjY5MzY5NywiU3RhcnRlZEF0IjoiMjAxOS0wOS0xOVQwNDoxMTo0OVoifQ%3D%3D&digest=sha256%3Aeb56c3e87377c893574ff6b82ce5a5bfac10fb3b7c472dcbb2bcca74ee0cc5ae" http.request.useragent=Go-http-client/1.1 instance.id=14108979-2fb5-4443-b967-8c2240a4bb5c openshift.auth.user="system:serviceaccount:ci-op-glb491np:default" openshift.auth.userid=1782dc6a-da93-11e9-bd1c-42010a8e0003 vars.name=ci-op-glb491np/release-scratch vars.uuid=922b5d5c-f3a6-428c-818e-1b6fd4d7e252
```

Version-Release number of selected component (if applicable): 3.11.z (registry)
4.2.0 (oc client)


How reproducible: Sometimes


Steps to Reproduce:
1. Run release mirror command, as done in CI [1]

Actual results:

Sometimes will flake with upload blob error.

Expected results:

Succeeds.


Additional info:

[1] https://github.com/openshift/release/commit/29bb995a05bf00c09add552c6ebd7116f7424d68#diff-e9ff14653190666556f490397d19c553

Comment 1 Adam Kaplan 2019-09-20 16:03:36 UTC

CC Trevor King - it may be more realistic for `oc adm release new` to be updated to retry in situations like this. There's not much the registry can do if the backend storage drops out.

Comment 2 Clayton Coleman 2019-09-20 17:30:10 UTC

We need to update all the image commands to retry a few times.  There's some in mirror, some here, but in general this is a broad sweeping problem with dealing with problematic registries.

Comment 3 Clayton Coleman 2019-09-20 17:30:33 UTC

One problem is that if you get a 500 uploading a 1gb file, do you *really* want to retry that a second time?

Comment 4 W. Trevor King 2019-09-20 17:35:14 UTC

Absolutely, just resume your push [1].

[1]: https://github.com/opencontainers/distribution-spec/blob/v1.0.0-rc0/spec.md#initiate-resumable-blob-upload

Comment 5 Oleg Bulatov 2019-09-22 20:37:29 UTC

Clients usually send a layer as a single chunk, so uploads can't be really resumed (technically they can be resumed, but only from the first byte).

Comment 6 W. Trevor King 2019-09-23 16:33:21 UTC

> Clients usually send a layer as a single chunk...

Which clients?  Looks like containers/image doesn't support this yet [1], but I don't see why it couldn't learn about it.  I don't have the Quay or OpenShift registry code handy to grep through for server-side support, but Blob-Upload-UUID is clearly part of the distribution spec.

[1]: https://github.com/containers/image/search?utf8=%E2%9C%93&q=Blob-Upload-UUID&type=

Comment 7 Oleg Bulatov 2019-09-24 10:10:09 UTC

All clients that I know.

https://github.com/docker/distribution/blob/master/registry/client/blob_writer.go (this is used by Docker, ReadFrom sends the entire blob and it doesn't support the Content-Range header)
https://github.com/containers/image/blob/a911b201c9edde74bfaabc85a61e85cb60466dff/docker/docker_image_dest.go#L141 (chunked upload is not supported either, also io.Reader is not seekable, so if some last bytes are lost on the server side, they can't be reread and resend)

On the server side Distribution has problems as well due to the eventually consistent nature of S3.

Comment 8 Clayton Coleman 2019-09-24 23:09:05 UTC

Yeah without registry/client support I'm hesitant to go down that path.  Note that we *should*, just not really a good use of time.  I think we could easily do a time based retry (track how long previous took, then retry if below some bounded attempt. The point of retries is to reduce overall prob of failure.

Comment 11 Wenjing Zheng 2019-10-25 09:55:42 UTC

Hi Oleg, I can get 500 internal error but not sure whether oc re-tries: https://privatebin-it-iso.int.open.paas.redhat.com/?dacaec7d6f1952e1#7a9UQOx7a+RxQzTZtFmKirEfduiWwWTBxS1NRVN4RKs=

Whether just try with 500 error is enough? it seems hard to get other error code.

Comment 18 Wenjing Zheng 2019-11-07 09:51:15 UTC

Created attachment 1633599 [details]
Request twice log

Comment 20 errata-xmlrpc 2020-01-23 11:06:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 21 W. Trevor King 2020-02-06 20:20:06 UTC

oc PR landed in 99a4f81 [1].  Checking to see when this went out:

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.3.0-rc.0 | grep cli-artifacts
  cli-artifacts                                 https://github.com/openshift/oc                                            6a937dfe56ff26255d09702c69b8406040c14505
$ git log --oneline origin/release-4.3 | grep -n '6a937df\|99a4f81'
5:6a937dfe5 Merge pull request #208 from openshift-cherrypick-robot/cherry-pick-207-to-release-4.3
124:99a4f8118 Merge pull request #112 from smarterclayton/retry_error

So yeah, this was well before 4.3.0-rc.0.  And as yet no backport to 4.2.

[1]: https://github.com/openshift/oc/pull/112#event-2686554337