1779413 – Python imagestream fails to completely import from registry.redhat.io

Bug 1779413 - Python imagestream fails to completely import from registry.redhat.io

Summary: Python imagestream fails to completely import from registry.redhat.io

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Samples
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:	devex
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-03 23:04 UTC by W. Trevor King
Modified:	2020-05-04 11:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:18:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24263	0	'None'	closed	Bug 1779413: more aggressive (than samples operator) image-imports to recover fail…	2020-05-28 03:58:00 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:18:49 UTC

Description W. Trevor King 2019-12-03 23:04:28 UTC

4.3 release promotion CI [1]:

Dec 03 22:13:19.021 E kube-apiserver Kube API started failing: Get https://api.ci-op-jw2mh699-34698.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Job wrapped up with:

Failing tests:

[Conformance][templates] templateinstance readiness test  should report failed soon after an annotated objects has failed [Suite:openshift/conformance/parallel/minimal]
[Conformance][templates] templateinstance readiness test  should report ready soon after all annotated objects are ready [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance] oc new-app  should fail with a --name longer than 58 characters [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance] oc new-app  should succeed with an imagestream [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables  should fail resolving unresolvable valueFrom in docker build environment variable references [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables  should fail resolving unresolvable valueFrom in sti build environment variable references [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables  should successfully resolve valueFrom in docker build environment variables [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables  should successfully resolve valueFrom in s2i build environment variables [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][pruning] prune builds based on settings in the buildconfig  [Conformance] buildconfigs should have a default history limit set when created via the group api [Suite:openshift/conformance/parallel/minimal]
[Feature:Builds][pruning] prune builds based on settings in the buildconfig  should prune builds after a buildConfig change [Suite:openshift/conformance/parallel]
[Feature:Builds][pruning] prune builds based on settings in the buildconfig  should prune canceled builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]
[Feature:Builds][pruning] prune builds based on settings in the buildconfig  should prune completed builds based on the successfulBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]
[Feature:Builds][pruning] prune builds based on settings in the buildconfig  should prune errored builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]
[Feature:Builds][pruning] prune builds based on settings in the buildconfig  should prune failed builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]

which may or may not be related to the Kube API going out.  Cluster was at least alive enough for a must-gather [2].  68 jobs with this message in the past 24h (5% of failed e2e jobs) [3].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513
[2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/artifacts/e2e-gcp/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/
[3]: https://search.svc.ci.openshift.org/chart?search=Kube%20API%20started%20failing.*Client.Timeout%20exceeded%20while%20awaiting%20headers

Comment 2 Adam Kaplan 2019-12-04 20:42:05 UTC

Builds were failing because a required imagestream (python) was failing to import due to "unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials."

I'm leaning towards this being a flake on registry.redhat.io, since the tag for python 2.7 eventually did import, whereas the later versions had not imported yet.

@Trevor is this a persistent failure?

Comment 3 W. Trevor King 2019-12-04 21:21:29 UTC

> Trevor is this a persistent failure?

"Kube API started failing.*Client.Timeout exceeded while awaiting headers" is a very common failure (link to CI-search in my initial comment).  If we want to retarget this bug to not be about those kube-apiserver monitor conditions [1], and look at the build issues instead, the relevant build-log text seems to be [2]:

fail [github.com/openshift/origin/test/extended/builds/build_pruning.go:43]: Unexpected error:
    <*errors.errorString | 0xc0022bc030>: {
        s: "Failed to import expected imagestreams",
    }
    Failed to import expected imagestreams
occurred

failed: (2m51s) 2019-12-03T22:03:03 "[Feature:Builds][pruning] prune builds based on settings in the buildconfig  should prune canceled builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]"

We've seen that six times today [3].  But personally I think that build thing should be its own bug (I didn't see any existing bugs mentioning "Failed to import expected imagestreams") so this can be about whether the "Client.Timeout exceeded while awaiting headers" monitor output is showing a real issue or is distracting debugging noise that should be quieted.  My impression was that we want 100% uptime for the Kube API and that even short flaps like:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt | grep 'Kube API started failing\|Kube API started responding' | sort | uniq
Dec 03 22:13:19.021 E kube-apiserver Kube API started failing: Get https://api.ci-op-jw2mh699-34698.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Dec 03 22:13:19.299 I kube-apiserver Kube API started responding to GET requests

were things we wanted to fix.

[1]: https://github.com/openshift/origin/blob/9d9c044e53d4d27b64f9407f7596ba86a0f78e23/pkg/monitor/api.go#L78-L82
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt
[3]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=Failed%20to%20import%20expected%20imagestreams

Comment 4 Adam Kaplan 2019-12-05 03:23:24 UTC

@Trevor I've retitled this BZ and updated the component (Samples) to refer to the root cause of the build test failures. I think opening a separate BZ to investigate the Kube API server issue is the right course of action.

Comment 5 W. Trevor King 2019-12-05 03:52:40 UTC

> I think opening a separate BZ to investigate the Kube API server issue is the right course of action.

Done with bug 1779938.

Comment 6 Gabe Montero 2019-12-05 19:34:15 UTC

It should be noted that by the time the must-gather was gathered, the python imagestream successfully imported.  See https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/artifacts/e2e-gcp/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift/image.openshift.io/imagestreams.yaml

If you look, the creation timestamp on the python imagestream is the same as the one seen in the imagestream dump in https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt (reference [2] from Trevor's https://bugzilla.redhat.com/show_bug.cgi?id=1779413#c3 )

It appears to have rectified about 20 minutes later based on the timestamps around the failure (last transition time) and success (created time)

See 

    - conditions:
      - generation: 2
        lastTransitionTime: "2019-12-03T21:48:46Z"
        message: 'Internal error occurred: registry.redhat.io/rhscl/python-35-rhel7:latest:
          Get https://registry.redhat.io/v2/rhscl/python-35-rhel7/manifests/latest:
          unauthorized: Please login to the Red Hat Registry using your Customer Portal
          credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication'
        reason: InternalError
        status: "False"
        type: ImportSuccess
      items: null
      tag: "3.5"


vs. 

    - items:
      - created: "2019-12-03T22:08:31Z"
        dockerImageReference: registry.redhat.io/rhscl/python-35-rhel7@sha256:26aa80a9db33f08b67ef8f37ea0593bac3800ccbedd2d62eabfd38b2501c8762
        generation: 4
        image: sha256:26aa80a9db33f08b67ef8f37ea0593bac3800ccbedd2d62eabfd38b2501c8762
      tag: "3.5"

given the generation count when from 2 to 4 and the fact the samples operator retries on 10 minute intervals, that lines up with 2 retries to work around this registry.redhat.io outage / flake

And yes, this chalks up as such an external dependency flake ... which is not uncommon for the TBR / registry.redhat.io

Comment 7 Gabe Montero 2019-12-05 19:37:55 UTC

in case my #comment 6 wasn't clear

- samples operator tried to reimport at 2019-12-03T21:58 and registry.redhat.io 's pythons repo was still down and the import failed
- it tried again at 2019-12-03T22:08 and it finally succeeded

there were similar patterns with the other python imagestream tags that did not import the first go around

Comment 8 W. Trevor King 2019-12-05 19:39:31 UTC

> It appears to have rectified about 20 minutes later based on the timestamps around the failure (last transition time) and success (created time)
> ...
> And yes, this chalks up as such an external dependency flake ... which is not uncommon for the TBR / registry.redhat.io

Checking [1] again, there were 12 in the past 24h.  Is it worth raising timeout values on these tests or putting in some other retry logic so we can survive an upstream registry hiccup?

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=Failed%20to%20import%20expected%20imagestreams

Comment 9 Gabe Montero 2019-12-05 20:06:30 UTC

(In reply to W. Trevor King from comment #8)
> > It appears to have rectified about 20 minutes later based on the timestamps around the failure (last transition time) and success (created time)
> > ...
> > And yes, this chalks up as such an external dependency flake ... which is not uncommon for the TBR / registry.redhat.io
> 
> Checking [1] again, there were 12 in the past 24h.  Is it worth raising
> timeout values on these tests or putting in some other retry logic so we can
> survive an upstream registry hiccup?

yeah there has been a lot of historical tension on the build/imageeco tests taking too long as it is .... dealing with retries to deal with 10 / 20 minute outages like what we just saw here
is going to hit the sore spot for that historical tension

After a few minutes of thinking, I can think of a cheaper short term answer and a more expensive long term answer in addition to continuing to live with it as is or 
putting more pressure on the folks in charge of registry.redhat.io (which is above my pay grade) to make it more resilient 

a) cheaper change: if the imports are not ready, abort the test without error, but somehow log it so we can continue to track how often it happens via https://ci-search-ci-search-next.svc.ci.openshift.org and see if we want to worry about how often we are missing coverage
b) more expensive change: move off of the use of openshift namespace imagestreams, at least for the build suite, to manually defined imagestreams that leverage the docker.io/quay.io upstream versions of those sclorg images ... assuming those are "more resilient"

Ben / Adam - thoughts on those ideas, or can y'all think of other alternatives?

> 
> [1]:
> https://ci-search-ci-search-next.svc.ci.openshift.org/
> chart?search=Failed%20to%20import%20expected%20imagestreams

Comment 10 W. Trevor King 2019-12-05 20:10:34 UTC

Can we set ImageContentSourcePolicies to say "check registry.redhat.io, but if that doesn't work fall back to quay.io for $THESE_REPOS_NEEDED_BY_THE_TEST"?

Comment 11 Ben Parees 2019-12-05 20:15:23 UTC

c) retry the imagestream import within the test logic itself, if some things are not imported yet.

Comment 12 Ben Parees 2019-12-05 20:17:24 UTC

And no, Trevor, because these are not imports by SHA (nor do i think this content is in quay? but i'm not sure).


to clarify what i mean by (c), i mean literally have to test logic run "oc import-image imagestreamname --all -n openshift" for any imagestream that's not successfully imported at the time we check them in the test code.

Comment 13 Gabe Montero 2019-12-05 20:18:38 UTC

(In reply to Ben Parees from comment #11)
> c) retry the imagestream import within the test logic itself, if some things
> are not imported yet.

though in this particular case that probably would have failed for another 10 minutes at least 

but it might narrow the window some

Comment 14 Gabe Montero 2019-12-05 21:30:30 UTC

Reopening to ruminate over the 3 options Ben and I threw out here and are discussing, along with Trevor/Adam, in slack

Comment 16 Gabe Montero 2019-12-14 14:23:43 UTC

I saw hits for "a manual image-import was submitted" in CI search

I think we can mark this verified

Comment 17 XiuJuan Wang 2019-12-20 08:47:47 UTC

Although 'a manual image-import' is add, but the import-failed event still exist.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/238

Comment 18 Gabe Montero 2019-12-21 15:21:23 UTC

That fips errror https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/238 is a result of https://bugzilla.redhat.com/show_bug.cgi?id=1775973 and should be addressed via https://github.com/openshift/openshift-controller-manager/pull/56

Try again with runs that include that change.

Comment 19 XiuJuan Wang 2019-12-23 02:24:30 UTC

Thanks Gabe, 
The text "a manual image-import was submitted" has been added.

Comment 20 Gabe Montero 2020-04-17 13:40:21 UTC

this was a test case only change - no doc update required

Comment 22 errata-xmlrpc 2020-05-04 11:18:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.