4.3 release promotion CI [1]: Dec 03 22:13:19.021 E kube-apiserver Kube API started failing: Get https://api.ci-op-jw2mh699-34698.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Job wrapped up with: Failing tests: [Conformance][templates] templateinstance readiness test should report failed soon after an annotated objects has failed [Suite:openshift/conformance/parallel/minimal] [Conformance][templates] templateinstance readiness test should report ready soon after all annotated objects are ready [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance] oc new-app should fail with a --name longer than 58 characters [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance] oc new-app should succeed with an imagestream [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables should fail resolving unresolvable valueFrom in docker build environment variable references [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables should fail resolving unresolvable valueFrom in sti build environment variable references [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables should successfully resolve valueFrom in docker build environment variables [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance][valueFrom] process valueFrom in build strategy environment variables should successfully resolve valueFrom in s2i build environment variables [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][pruning] prune builds based on settings in the buildconfig [Conformance] buildconfigs should have a default history limit set when created via the group api [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][pruning] prune builds based on settings in the buildconfig should prune builds after a buildConfig change [Suite:openshift/conformance/parallel] [Feature:Builds][pruning] prune builds based on settings in the buildconfig should prune canceled builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel] [Feature:Builds][pruning] prune builds based on settings in the buildconfig should prune completed builds based on the successfulBuildsHistoryLimit setting [Suite:openshift/conformance/parallel] [Feature:Builds][pruning] prune builds based on settings in the buildconfig should prune errored builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel] [Feature:Builds][pruning] prune builds based on settings in the buildconfig should prune failed builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel] which may or may not be related to the Kube API going out. Cluster was at least alive enough for a must-gather [2]. 68 jobs with this message in the past 24h (5% of failed e2e jobs) [3]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513 [2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/artifacts/e2e-gcp/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/ [3]: https://search.svc.ci.openshift.org/chart?search=Kube%20API%20started%20failing.*Client.Timeout%20exceeded%20while%20awaiting%20headers
Builds were failing because a required imagestream (python) was failing to import due to "unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials." I'm leaning towards this being a flake on registry.redhat.io, since the tag for python 2.7 eventually did import, whereas the later versions had not imported yet. @Trevor is this a persistent failure?
> Trevor is this a persistent failure? "Kube API started failing.*Client.Timeout exceeded while awaiting headers" is a very common failure (link to CI-search in my initial comment). If we want to retarget this bug to not be about those kube-apiserver monitor conditions [1], and look at the build issues instead, the relevant build-log text seems to be [2]: fail [github.com/openshift/origin/test/extended/builds/build_pruning.go:43]: Unexpected error: <*errors.errorString | 0xc0022bc030>: { s: "Failed to import expected imagestreams", } Failed to import expected imagestreams occurred failed: (2m51s) 2019-12-03T22:03:03 "[Feature:Builds][pruning] prune builds based on settings in the buildconfig should prune canceled builds based on the failedBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]" We've seen that six times today [3]. But personally I think that build thing should be its own bug (I didn't see any existing bugs mentioning "Failed to import expected imagestreams") so this can be about whether the "Client.Timeout exceeded while awaiting headers" monitor output is showing a real issue or is distracting debugging noise that should be quieted. My impression was that we want 100% uptime for the Kube API and that even short flaps like: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt | grep 'Kube API started failing\|Kube API started responding' | sort | uniq Dec 03 22:13:19.021 E kube-apiserver Kube API started failing: Get https://api.ci-op-jw2mh699-34698.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Dec 03 22:13:19.299 I kube-apiserver Kube API started responding to GET requests were things we wanted to fix. [1]: https://github.com/openshift/origin/blob/9d9c044e53d4d27b64f9407f7596ba86a0f78e23/pkg/monitor/api.go#L78-L82 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt [3]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=Failed%20to%20import%20expected%20imagestreams
@Trevor I've retitled this BZ and updated the component (Samples) to refer to the root cause of the build test failures. I think opening a separate BZ to investigate the Kube API server issue is the right course of action.
> I think opening a separate BZ to investigate the Kube API server issue is the right course of action. Done with bug 1779938.
It should be noted that by the time the must-gather was gathered, the python imagestream successfully imported. See https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/artifacts/e2e-gcp/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift/image.openshift.io/imagestreams.yaml If you look, the creation timestamp on the python imagestream is the same as the one seen in the imagestream dump in https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/513/build-log.txt (reference [2] from Trevor's https://bugzilla.redhat.com/show_bug.cgi?id=1779413#c3 ) It appears to have rectified about 20 minutes later based on the timestamps around the failure (last transition time) and success (created time) See - conditions: - generation: 2 lastTransitionTime: "2019-12-03T21:48:46Z" message: 'Internal error occurred: registry.redhat.io/rhscl/python-35-rhel7:latest: Get https://registry.redhat.io/v2/rhscl/python-35-rhel7/manifests/latest: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication' reason: InternalError status: "False" type: ImportSuccess items: null tag: "3.5" vs. - items: - created: "2019-12-03T22:08:31Z" dockerImageReference: registry.redhat.io/rhscl/python-35-rhel7@sha256:26aa80a9db33f08b67ef8f37ea0593bac3800ccbedd2d62eabfd38b2501c8762 generation: 4 image: sha256:26aa80a9db33f08b67ef8f37ea0593bac3800ccbedd2d62eabfd38b2501c8762 tag: "3.5" given the generation count when from 2 to 4 and the fact the samples operator retries on 10 minute intervals, that lines up with 2 retries to work around this registry.redhat.io outage / flake And yes, this chalks up as such an external dependency flake ... which is not uncommon for the TBR / registry.redhat.io
in case my #comment 6 wasn't clear - samples operator tried to reimport at 2019-12-03T21:58 and registry.redhat.io 's pythons repo was still down and the import failed - it tried again at 2019-12-03T22:08 and it finally succeeded there were similar patterns with the other python imagestream tags that did not import the first go around
> It appears to have rectified about 20 minutes later based on the timestamps around the failure (last transition time) and success (created time) > ... > And yes, this chalks up as such an external dependency flake ... which is not uncommon for the TBR / registry.redhat.io Checking [1] again, there were 12 in the past 24h. Is it worth raising timeout values on these tests or putting in some other retry logic so we can survive an upstream registry hiccup? [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=Failed%20to%20import%20expected%20imagestreams
(In reply to W. Trevor King from comment #8) > > It appears to have rectified about 20 minutes later based on the timestamps around the failure (last transition time) and success (created time) > > ... > > And yes, this chalks up as such an external dependency flake ... which is not uncommon for the TBR / registry.redhat.io > > Checking [1] again, there were 12 in the past 24h. Is it worth raising > timeout values on these tests or putting in some other retry logic so we can > survive an upstream registry hiccup? yeah there has been a lot of historical tension on the build/imageeco tests taking too long as it is .... dealing with retries to deal with 10 / 20 minute outages like what we just saw here is going to hit the sore spot for that historical tension After a few minutes of thinking, I can think of a cheaper short term answer and a more expensive long term answer in addition to continuing to live with it as is or putting more pressure on the folks in charge of registry.redhat.io (which is above my pay grade) to make it more resilient a) cheaper change: if the imports are not ready, abort the test without error, but somehow log it so we can continue to track how often it happens via https://ci-search-ci-search-next.svc.ci.openshift.org and see if we want to worry about how often we are missing coverage b) more expensive change: move off of the use of openshift namespace imagestreams, at least for the build suite, to manually defined imagestreams that leverage the docker.io/quay.io upstream versions of those sclorg images ... assuming those are "more resilient" Ben / Adam - thoughts on those ideas, or can y'all think of other alternatives? > > [1]: > https://ci-search-ci-search-next.svc.ci.openshift.org/ > chart?search=Failed%20to%20import%20expected%20imagestreams
Can we set ImageContentSourcePolicies to say "check registry.redhat.io, but if that doesn't work fall back to quay.io for $THESE_REPOS_NEEDED_BY_THE_TEST"?
c) retry the imagestream import within the test logic itself, if some things are not imported yet.
And no, Trevor, because these are not imports by SHA (nor do i think this content is in quay? but i'm not sure). to clarify what i mean by (c), i mean literally have to test logic run "oc import-image imagestreamname --all -n openshift" for any imagestream that's not successfully imported at the time we check them in the test code.
(In reply to Ben Parees from comment #11) > c) retry the imagestream import within the test logic itself, if some things > are not imported yet. though in this particular case that probably would have failed for another 10 minutes at least but it might narrow the window some
Reopening to ruminate over the 3 options Ben and I threw out here and are discussing, along with Trevor/Adam, in slack
I saw hits for "a manual image-import was submitted" in CI search I think we can mark this verified
Although 'a manual image-import' is add, but the import-failed event still exist. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/238
That fips errror https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.4/238 is a result of https://bugzilla.redhat.com/show_bug.cgi?id=1775973 and should be addressed via https://github.com/openshift/openshift-controller-manager/pull/56 Try again with runs that include that change.
Thanks Gabe, The text "a manual image-import was submitted" has been added.
this was a test case only change - no doc update required
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581