Description of problem: One of the system ImageStreams (openshift/cli) was not successful in importing the image at cluster creation time. However, the cluster never retried, so the expected ImageStreamTag is not available. Version-Release number of selected component (if applicable): 4.1.17 How reproducible: Unknown Steps to Reproduce: 1. Install OCP cluster 2. Receive HTTP error response from Quay during image retrieval Actual results: ``` apiVersion: image.openshift.io/v1 kind: ImageStream metadata: creationTimestamp: "2019-09-23T14:18:26Z" generation: 2 name: cli namespace: openshift resourceVersion: "15043" selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/cli uid: 0275f846-de0d-11e9-99be-0a580a810018 spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295 generation: 2 importPolicy: {} name: latest referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/cli publicDockerImageRepository: default-route-openshift-image-registry.apps.example.com/openshift/cli tags: - conditions: - generation: 2 lastTransitionTime: "2019-09-23T14:18:27Z" message: 'Internal error occurred: Get https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295: error parsing HTTP 429 response body: invalid character ''<'' looking for beginning of value: "<html>\r\n<head><title>429 Too Many Requests</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx/1.12.1</center>\r\n</body>\r\n</html>\r\n"' reason: InternalError status: "False" type: ImportSuccess items: null tag: latest ``` Expected results: ``` apiVersion: image.openshift.io/v1 kind: ImageStream metadata: creationTimestamp: "2019-09-23T19:45:51Z" generation: 2 name: cli namespace: openshift resourceVersion: "15222" selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/cli uid: bf808254-de3a-11e9-a1db-0a580a810015 spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295 generation: 2 importPolicy: {} name: latest referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/cli publicDockerImageRepository: default-route-openshift-image-registry.apps.example.com/openshift/cli tags: - items: - created: "2019-09-23T19:45:51Z" dockerImageReference: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295 generation: 2 image: sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295 tag: latest ``` Additional info:
QE cannot reproduce on 4.1.7/4.1.8 and 4.2: $ oc describe is cli -n openshift Name: cli Namespace: openshift Created: 5 hours ago Labels: <none> Annotations: <none> Image Repository: image-registry.openshift-image-registry.svc:5000/openshift/cli Image Lookup: local=false Unique Images: 1 Tags: 1 latest tagged from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7094f088e38f18acfc3df53e0dd377db63d907dc7ecfc8859284d30767b30295 5 hours ago $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.17 True False 170m Cluster version is 4.1.17
Sorry if I wasn't clear. The issue is that this cluster experienced an error from Quay, but then didn't retry. The error itself from Quay is transient, but should be retried by the client.
This imagestream and other "release payload" imagestreams are defined by the samples operator, but technically are created by the CVO. By default these only try the import once. We're in a bit of a weird state here since the samples operator doesn't truly know about these, and doesn't retry to import them (unlike the other sample imagestreams). Moving to Samples for now.
These payload imagestreams are installed via the samples operator's manifest into the openshift namespace, but are otherwise not managed by samples operator. Exactly how this should be treated is being actively discussed.
Don't we have the scheduler importer enabled for these streams?
We do not Oleg. See https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml Looking into that is step 1 in a sequence discussed on forum-devex. If it does what we want with these payload images, there are concerns about repeated queries after a successful import. To quote Clayton: and then the bug is “once it’s imported on a digest hash, you don’t have to try to import it again” so we have to either verify scheduled import already handles that nuance, or update it to handle it. Of course, if you already know one way or the other, please advise. Longer term items were also discussed. But that is outside the potential scope of this bugzilla.
Gabe's PR to potentially enable scheduled import: https://github.com/openshift/cluster-samples-operator/pull/186 > so we have to either verify scheduled import already handles that nuance 99% certain it does not... scheduled import just bumps the generation on the imagestream(tag?) to trigger an import, i believe. In the meantime, a knowledge base article might be appropriate to instruct CEE/Customers about how to run "oc import-image" to fix this if they hit it.
I'll be cooking an openshift-apiserver PR and an openshift-controller-manager PR to achieve the meets min discussed between Clayton, Ben, and I on forum-devex.
For QE, "reproducing" a failure in the initial import of the payload resources like cli or must-gather will be difficult. I would simply suggest a) running oc get is -o yaml on those imagstreams to ensure the scheduled import is configured ... i.e. https://github.com/openshift/cluster-samples-operator/blob/master/manifests/08-openshift-imagestreams.yaml#L83 b) assuming the import of those was successful, examine the output from oc get is -o yaml and see that there have not been subsequent updates after say 15 to 30 minutes ... i.e. the generation of the latest tag for cli for example is still 2
All images included payload but not managed by samples operator have enabled scheduled imagepolicy in 4.3.0-0.nightly-2019-10-15-020113. When image with digest hash is imported successfully, it will not import second time periodically. Eg.the generation of latest tag for must-gather imagestream is 2 after importing successfully 6 hours later. $ oc get is must-gather -n openshift -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: creationTimestamp: "2019-10-15T03:15:54Z" generation: 2 name: must-gather namespace: openshift resourceVersion: "17645" selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/must-gather uid: 3ebc6332-e0ab-4bba-9a10-21fe78dbe4f9 spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:992297412229777f18359233485b99c6632742100aa79fd6ba6444278e282df2 generation: 2 importPolicy: scheduled: true name: latest referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/must-gather tags: - items: - created: "2019-10-15T03:15:55Z" dockerImageReference: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:992297412229777f18359233485b99c6632742100aa79fd6ba6444278e282df2 generation: 2 image: sha256:992297412229777f18359233485b99c6632742100aa79fd6ba6444278e282df2 tag: latest But with multiple tags, the import failed tag only retry once in 2h. I tag cli new tag with an unexist digest hash in 8:01 and check it in 9:44 http://pastebin.test.redhat.com/805883 Don't retry for my created imagestream. Tag the image at 9:04, and check again at 9:50 http://pastebin.test.redhat.com/805884
http://pastebin.test.redhat.com/805883 looks correct to me .... it has retried on your non-existent image several times, as indicated by the 5/4 generation values in spec/status. where as the valid tag has generation 2 for spec/status as expected Changing the existing tag that was valid to an invalid tag is not a supported scenario here. That tag was initially valid, so we punt on further scheduling, before you made it invalid. You can mark this VERIFIED IMO @XiuJuan. thanks
I am good to verify the valid tag has generation 2 for spec/status as expected. But I am a little confused, when we created a new imagestream, the generation in spec and status are 2, if add a new tag, the generation increased by the addition of 2. IMO, in the http://pastebin.test.redhat.com/805883, the tag v1 only retried once. This example ia my new created imagestreamtags, not tag vaild to a invalid one. they didn't retry importing. $ oc get is myruby -n openshift -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: annotations: openshift.io/image.dockerRepositoryCheck: "2019-10-16T07:41:48Z" creationTimestamp: "2019-10-16T06:03:54Z" generation: 4 name: myruby namespace: openshift resourceVersion: "98749" selfLink: /apis/image.openshift.io/v1/namespaces/openshift/imagestreams/myruby uid: d57e1a51-5857-493b-beb1-ae8201a9aaa3 spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: registry.redhat.io/rhscl/ruby-22-rhel7@sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d9603581007 generation: 2 importPolicy: scheduled: true name: v1 referencePolicy: type: Source - annotations: null from: kind: DockerImage name: registry.redhat.io/rhscl/ruby-22-rhel7@sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d96035dbcdd generation: 4 importPolicy: scheduled: true name: v2 referencePolicy: type: Source status: dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/myruby tags: - conditions: - generation: 2 lastTransitionTime: "2019-10-16T06:03:56Z" message: 'Internal error occurred: content integrity error: the schema 2 manifest retrieved with digest sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d9603581007 does not match the digest calculated from the content sha256:2e83b9e07e85960060096b6aff7ee202a5f52e0e18447641b080b1f3879e0901' reason: InternalError status: "False" type: ImportSuccess items: null tag: v1 - conditions: - generation: 4 lastTransitionTime: "2019-10-16T07:41:48Z" message: 'Internal error occurred: error parsing HTTP 404 response body: invalid character ''F'' looking for beginning of value: "File not found.\""' reason: InternalError status: "False" type: ImportSuccess items: null tag: v2
I think the needinfo was intended for me
I'll run through @XiuJuan's scenarios with tracing on today, and walk through the precise flow, in order to explain in detail how things work, and hopefully address her confusion.
1) First, addressing the question on spec/status generation == 2 on create: So on imagestream create, the spec generation comes from the imagestream's k8s objectmeta generation field. So any update anywhere on the stream (selfLink, resourceVersion, whatever) results in that getting bumped. see the PrepareFoCreate method at https://github.com/openshift/openshift-apiserver/blob/master/pkg/image/apiserver/registry/imagestream/strategy.go#L92 The fact that you see it start as 2 is a function of that. It does not stem from the fact that we tried 2 imports. You'll see that in all the sample imagestreams in the openshift namespace if you take a look. 2) Next, XiuJuan's edits of the cli imagestream in the openshift namespace ...those produced unpredictable results for me because since that imagestream is managaed by the CVO, it will sync the spec of the imagestream with what it has in its manifest, and remove the "v1" tag you add from the spec (it stays in the status of course since only the imagestream controller touches that). So yes I would see the initial import after it is created. But then it becomes a race between if the scheduler fires before the CVO nuked the tag on me in my testing. Also, the CVO updating the stream is what bumps the generation on the stream and subsequently the tag in this case. So this is an invalid test case. 3) Finally, creating my own ruby imagestream .... first, I changed the 2.5 tag to reference a bogus registry.redhat.io/rhscl/ruby-25-rhel7@sha256:c09034c309655169f990c49c9bbe68669fef40089c15fa51d4ca5d96035dbcdd with scheduling on. I see multiple import attempts within the expected time interval with tracing on, but the temporary bump of the spec generation is not persisted to etcd. So you won't see it with oc get is -o yaml This is in fact the typical behavior for a scheduled imagestream, and I was initially fooled by the CVO stuff when I provided my prior instructions. Apologies for the confusion. The only way to confirm the scheduling is via examining the openshift-controller-manager pod logs (in particular the leader) with at least a logLevel of 5 set on the cluster wide openshiftcontrollermanager object.
@Gabe Thanks a million!!Your detailed explain enlightens me. I could see the scheduling retry info for failed import imagesteams in openshift-controller-manager leader pod logs. I1017 07:14:15.395443 1 image_trigger_controller.go:335] Started syncing image stream "xiu/myruby" I1017 07:14:15.395453 1 image_trigger_controller.go:337] Finished syncing image stream "xiu/myruby" (10.514µs) I1017 07:14:15.395481 1 image_trigger_controller.go:335] Started syncing image stream "xiu/ruby1" I1017 07:14:15.395506 1 image_trigger_controller.go:335] Started syncing image stream "xiu/ruby2" I1017 07:14:15.395511 1 image_trigger_controller.go:337] Finished syncing image stream "xiu/ruby1" (40.752µs) I1017 07:14:15.395520 1 image_trigger_controller.go:337] Finished syncing image stream "xiu/ruby2" (23.497µs) I1017 07:14:15.428455 1 imagestream_controller.go:160] Queued import of stream openshift/ruby... I1017 07:14:15.428467 1 imagestream_controller.go:144] Finished syncing image stream "openshift/ruby" (13.714µs) I1017 07:14:15.428482 1 imagestream_controller.go:160] Queued import of stream xiu/ruby1... I1017 07:14:15.428485 1 imagestream_controller.go:160] Queued import of stream xiu/ruby2... I1017 07:14:15.428492 1 imagestream_controller.go:144] Finished syncing image stream "xiu/ruby1" (11.661µs) I1017 07:14:15.428494 1 imagestream_controller.go:144] Finished syncing image stream "xiu/ruby2" (11.393µs) I1017 07:14:15.428467 1 imagestream_controller.go:160] Queued import of stream xiu/myruby... I1017 07:14:15.428516 1 imagestream_controller.go:144] Finished syncing image stream "xiu/myruby" (50.165µs) Mark this bug to verified.
sounds great XiuJuan thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062