1755115 – Error on cluster image import never retries

Bug 1755115 - Error on cluster image import never retries

Summary: Error on cluster image import never retries

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	ImageStreams
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Gabe Montero
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:	devex
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-24 19:33 UTC by Christoph Blecker
Modified:	2020-05-13 21:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-13 21:25:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-samples-operator pull 186	'None'	closed	Bug 1755115: turn on scheduled import for payload imagestreams not managed by samp…	2020-11-26 17:15:16 UTC
Github	openshift openshift-apiserver pull 28	'None'	closed	Bug 1755115: allow scheduled imagestream imports for dockerimages with sha/digest …	2020-11-26 17:15:40 UTC
Github	openshift openshift-apiserver pull 29	'None'	closed	Bug 1755115: allow scheduled imagestream imports for dockerimages with sha/digest …	2020-11-26 17:15:17 UTC
Github	openshift openshift-controller-manager pull 29	'None'	closed	Bug 1755115: do not schedule imports for streams with sha/digest references whose …	2020-11-26 17:15:17 UTC
Github	openshift openshift-controller-manager pull 32	'None'	closed	Bug 1755115: handle multiple tags when scheduling image imports	2020-11-26 17:15:41 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-05-13 21:25:06 UTC

Description Christoph Blecker 2019-09-24 19:33:10 UTC

Description of problem:
One of the system ImageStreams (openshift/cli) was not successful in importing the image at cluster creation time. However, the cluster never retried, so the expected ImageStreamTag is not available.

Version-Release number of selected component (if applicable):
4.1.17


How reproducible:
Unknown


Steps to Reproduce:
1. Install OCP cluster
2. Receive HTTP error response from Quay during image retrieval

Actual results:
```
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  creationTimestamp: "2019-09-23T14:18:26Z"
  generation: 2
  name: cli
  namespace: openshift
  resourceVersion: "15043"
  selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/cli
  uid: 0275f846-de0d-11e9-99be-0a580a810018
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295
    generation: 2
    importPolicy: {}
    name: latest
    referencePolicy:
      type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/cli
  publicDockerImageRepository: default-route-openshift-image-registry.apps.example.com/openshift/cli
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2019-09-23T14:18:27Z"
      message: 'Internal error occurred: Get https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295:
        error parsing HTTP 429 response body: invalid character ''<'' looking for
        beginning of value: "<html>\r\n<head><title>429 Too Many Requests</title></head>\r\n<body
        bgcolor=\"white\">\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx/1.12.1</center>\r\n</body>\r\n</html>\r\n"'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: latest
```


Expected results:
```
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  creationTimestamp: "2019-09-23T19:45:51Z"
  generation: 2
  name: cli
  namespace: openshift
  resourceVersion: "15222"
  selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/cli
  uid: bf808254-de3a-11e9-a1db-0a580a810015
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295
    generation: 2
    importPolicy: {}
    name: latest
    referencePolicy:
      type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/cli
  publicDockerImageRepository: default-route-openshift-image-registry.apps.example.com/openshift/cli
  tags:
  - items:
    - created: "2019-09-23T19:45:51Z"
      dockerImageReference: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295
      generation: 2
      image: sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295
    tag: latest
```

Additional info:

Comment 1 Wenjing Zheng 2019-09-25 06:17:06 UTC

QE cannot reproduce on 4.1.7/4.1.8 and 4.2:
$ oc describe is cli -n openshift
Name:			cli
Namespace:		openshift
Created:		5 hours ago
Labels:			<none>
Annotations:		<none>
Image Repository:	image-registry.openshift-image-registry.svc:5000/openshift/cli
Image Lookup:		local=false
Unique Images:		1
Tags:			1

latest
  tagged from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295

  * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295
      5 hours ago
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.17    True        False         170m    Cluster version is 4.1.17

Comment 2 Christoph Blecker 2019-09-25 06:55:01 UTC

Sorry if I wasn't clear. The issue is that this cluster experienced an error from Quay, but then didn't retry. The error itself from Quay is transient, but should be retried by the client.

Comment 3 Adam Kaplan 2019-09-25 13:52:48 UTC

This imagestream and other "release payload" imagestreams are defined by the samples operator, but technically are created by the CVO. By default these only try the import once.

We're in a bit of a weird state here since the samples operator doesn't truly know about these, and doesn't retry to import them (unlike the other sample imagestreams).

Moving to Samples for now.

Comment 4 Gabe Montero 2019-09-25 13:57:23 UTC

These payload imagestreams are installed via the samples operator's manifest into the openshift namespace, but are otherwise not managed by samples operator.

Exactly how this should be treated is being actively discussed.

Comment 5 Oleg Bulatov 2019-09-25 14:49:00 UTC

Don't we have the scheduler importer enabled for these streams?

Comment 6 Gabe Montero 2019-09-25 15:25:28 UTC

We do not Oleg.  See https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml

Looking into that is step 1 in a sequence discussed on forum-devex.

If it does what we want with these payload images, there are concerns about repeated queries after a successful import.  To quote Clayton:  and then the bug is “once it’s imported on a digest hash, you don’t have to try to import it again”

so we have to either verify scheduled import already handles that nuance, or update it to handle it.  Of course, if you already know one way or the other, please advise.

Longer term items were also discussed.  But that is outside the potential scope of this bugzilla.

Comment 7 Ben Parees 2019-09-25 17:40:34 UTC

Gabe's PR to potentially enable scheduled import:

https://github.com/openshift/cluster-samples-operator/pull/186


> so we have to either verify scheduled import already handles that nuance

99% certain it does not...  scheduled import just bumps the generation on the imagestream(tag?) to trigger an import, i believe.


In the meantime, a knowledge base article might be appropriate to instruct CEE/Customers about how to run "oc import-image" to fix this if they hit it.

Comment 8 Gabe Montero 2019-09-25 18:46:24 UTC

I'll be cooking an openshift-apiserver PR and an openshift-controller-manager PR to achieve the meets min discussed between Clayton, Ben, and I on forum-devex.

Comment 12 Gabe Montero 2019-10-14 15:19:43 UTC

For QE, "reproducing" a failure in the initial import of the payload resources like cli or must-gather will be difficult.

I would simply suggest
a) running oc get is -o yaml on those imagstreams to ensure the scheduled import is configured  ... i.e. https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml#L83
b) assuming the import of those was successful, examine the output from oc get is -o yaml and see that there have not been subsequent updates after say 15 to 30 minutes ... i.e. the generation of the latest tag for cli for example is still 2

Comment 13 XiuJuan Wang 2019-10-15 11:28:58 UTC

All images included payload but not managed by samples operator have enabled scheduled imagepolicy in 4.3.0-0.nightly-2019-10-15-020113.

When image with digest hash is imported successfully, it will not import second time periodically. Eg.the generation of latest tag for must-gather imagestream is 2 after importing successfully 6 hours later.
$ oc get is must-gather -n openshift -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  creationTimestamp: "2019-10-15T03:15:54Z"
  generation: 2
  name: must-gather
  namespace: openshift
  resourceVersion: "17645"
  selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather
  uid: 3ebc6332-e0ab-4bba-9a10-21fe78dbe4f9
spec:
  lookupPolicy:
	local: false
  tags:
  - annotations: null
	from:
  	kind: DockerImage
  	name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:992297412229777f18359233485b99c6632742100aa79fd6ba6444278e282df2
	generation: 2
	importPolicy:
  	scheduled: true
	name: latest
	referencePolicy:
  	type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/must-gather
  tags:
  - items:
	- created: "2019-10-15T03:15:55Z"
  	dockerImageReference: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:992297412229777f18359233485b99c6632742100aa79fd6ba6444278e282df2
  	generation: 2
  	image: sha256:992297412229777f18359233485b99c6632742100aa79fd6ba6444278e282df2
	tag: latest


But with multiple tags, the import failed tag only retry once in 2h. I tag cli new tag with an unexist digest hash in 8:01 and check it in 9:44
http://pastebin.test.redhat.com/805883

Don't retry for my created imagestream. Tag the image at 9:04, and check again at 9:50
http://pastebin.test.redhat.com/805884

Comment 14 Gabe Montero 2019-10-15 14:25:55 UTC

http://pastebin.test.redhat.com/805883 looks correct to me .... it has retried on your non-existent image several times,
as indicated by the 5/4 generation values in spec/status.

where as the valid tag has generation 2 for spec/status as expected

Changing the existing tag that was valid to an invalid tag is not a supported scenario here.  That tag was initially valid,
so we punt on further scheduling, before you made it invalid.

You can mark this VERIFIED IMO @XiuJuan.

thanks

Comment 15 XiuJuan Wang 2019-10-16 07:55:28 UTC

I am good to verify the valid tag has generation 2 for spec/status as expected.

But I am a little confused, when we created a new imagestream, the generation in spec and status are 2, if add a new tag, the generation increased by the addition of 2.
IMO, in the http://pastebin.test.redhat.com/805883, the tag v1 only retried once.

This example ia my new created imagestreamtags, not tag vaild to a invalid one. they didn't retry importing.
$ oc get is myruby -n openshift -o yaml 
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  annotations:
    openshift.io/image.dockerRepositoryCheck: "2019-10-16T07:41:48Z"
  creationTimestamp: "2019-10-16T06:03:54Z"
  generation: 4
  name: myruby
  namespace: openshift
  resourceVersion: "98749"
  selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/myruby
  uid: d57e1a51-5857-493b-beb1-ae8201a9aaa3
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: registry.redhat.io/rhscl/ruby-22-rhel7@sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d9603581007
    generation: 2
    importPolicy:
      scheduled: true
    name: v1
    referencePolicy:
      type: Source
  - annotations: null
    from:
      kind: DockerImage
      name: registry.redhat.io/rhscl/ruby-22-rhel7@sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d96035dbcdd
    generation: 4
    importPolicy:
      scheduled: true
    name: v2
    referencePolicy:
      type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/myruby
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2019-10-16T06:03:56Z"
      message: 'Internal error occurred: content integrity error: the schema 2 manifest
        retrieved with digest sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d9603581007
        does not match the digest calculated from the content sha256:2e83b9e07e85960060096b6aff7ee202a5f52e0e18447641b080b1f3879e0901'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: v1
  - conditions:
    - generation: 4
      lastTransitionTime: "2019-10-16T07:41:48Z"
      message: 'Internal error occurred: error parsing HTTP 404 response body: invalid
        character ''F'' looking for beginning of value: "File not found.\""'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: v2

Comment 16 Gabe Montero 2019-10-16 13:59:22 UTC

I think the needinfo was intended for me

Comment 17 Gabe Montero 2019-10-16 14:09:14 UTC

I'll run through @XiuJuan's scenarios with tracing on today, and walk through the precise flow, in order to explain in detail how things work, and 
hopefully address her confusion.

Comment 18 Gabe Montero 2019-10-16 20:32:48 UTC

1) First, addressing the question on spec/status generation == 2 on create:

So on imagestream create, the spec generation comes from the imagestream's k8s objectmeta generation field.  So any update anywhere on the stream (selfLink, resourceVersion, whatever)
results in that getting bumped.

see the PrepareFoCreate method at https://github.com/openshift/openshift-apiserver/blob/master/pkg/image/apiserver/registry/imagestream/strategy.go#L92

The fact that you see it start as 2 is a function of that.

It does not stem from the fact that we tried 2 imports.

You'll see that in all the sample imagestreams in the openshift namespace if you take a look.

2) Next, XiuJuan's edits of the cli imagestream in the openshift namespace ...those produced unpredictable results for me because since that imagestream
is managaed by the CVO, it will sync the spec of the imagestream with what it has in its manifest, and remove the "v1" tag you add from the spec (it stays in the status of course
since only the imagestream controller touches that).

So yes I would see the initial import after it is created.  But then it becomes a race between if the scheduler fires before the CVO nuked the tag on me in my testing.

Also, the CVO updating the stream is what bumps the generation on the stream and subsequently the tag in this case.

So this is an invalid test case.

3) Finally, creating my own ruby imagestream .... first, I changed the 2.5 tag to reference a bogus registry.redhat.io/rhscl/ruby-25-rhel7@sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d96035dbcdd with scheduling on.

I see multiple import attempts within the expected time interval with tracing on, but the temporary bump of the spec generation is not persisted to etcd.  So you won't see it with oc get is -o yaml

This is in fact the typical behavior for a scheduled imagestream, and I was initially fooled by the CVO stuff when I provided my prior instructions.  Apologies for the confusion.

The only way to confirm the scheduling is via examining the openshift-controller-manager pod logs (in particular the leader) with at least a logLevel of 5 set on the cluster wide openshiftcontrollermanager object.

Comment 19 XiuJuan Wang 2019-10-17 07:30:26 UTC

@Gabe Thanks a million！！Your detailed explain enlightens me.

I could see the scheduling retry info for failed import imagesteams in openshift-controller-manager leader pod logs.

I1017 07:14:15.395443       1 image_trigger_controller.go:335] Started syncing image stream "xiu/myruby"
I1017 07:14:15.395453       1 image_trigger_controller.go:337] Finished syncing image stream "xiu/myruby" (10.514µs)
I1017 07:14:15.395481       1 image_trigger_controller.go:335] Started syncing image stream "xiu/ruby1"
I1017 07:14:15.395506       1 image_trigger_controller.go:335] Started syncing image stream "xiu/ruby2"
I1017 07:14:15.395511       1 image_trigger_controller.go:337] Finished syncing image stream "xiu/ruby1" (40.752µs)
I1017 07:14:15.395520       1 image_trigger_controller.go:337] Finished syncing image stream "xiu/ruby2" (23.497µs)
I1017 07:14:15.428455       1 imagestream_controller.go:160] Queued import of stream openshift/ruby...
I1017 07:14:15.428467       1 imagestream_controller.go:144] Finished syncing image stream "openshift/ruby" (13.714µs)
I1017 07:14:15.428482       1 imagestream_controller.go:160] Queued import of stream xiu/ruby1...
I1017 07:14:15.428485       1 imagestream_controller.go:160] Queued import of stream xiu/ruby2...
I1017 07:14:15.428492       1 imagestream_controller.go:144] Finished syncing image stream "xiu/ruby1" (11.661µs)
I1017 07:14:15.428494       1 imagestream_controller.go:144] Finished syncing image stream "xiu/ruby2" (11.393µs)
I1017 07:14:15.428467       1 imagestream_controller.go:160] Queued import of stream xiu/myruby...
I1017 07:14:15.428516       1 imagestream_controller.go:144] Finished syncing image stream "xiu/myruby" (50.165µs)

Mark this bug to verified.

Comment 20 Gabe Montero 2019-10-17 12:22:00 UTC

sounds great XiuJuan thanks

Comment 22 errata-xmlrpc 2020-05-13 21:25:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.