In some recent 4.3 release jobs [1,2,3]: Asked to pull fresh copy of "registry.redhat.io/rhscl/nodejs-10-rhel7@sha256:23e9412a49fcd999fda5296c5b4c082ab29caa561645e4e9a07a366cad41c981". ... Warning: Pull failed, retrying in 5s ... F1124 02:12:48.081564 1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication ... fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error: <*errors.errorString | 0xc0026d64a0>: { s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"", } The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed" occurred ... failed: (1m6s) 2019-11-24T02:12:52 "[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" This seems like a flake, as mentioned (I think) in [4], but I thought I'd open a new bug to track the flake since that bug was closed as fixed. Do we know if this is a cluster-side flake or flake in registry.redhat.io's auth server or...? Looks like we get two or three of these a day in CI [5], and it would be nice to be able to drive that flake rate down. [1]: https://search.svc.ci.openshift.org/chart?search=error:%20build%20error:%20After%20retrying%202%20times,%20Pull%20image%20still%20failed%20due%20to%20error:%20unable%20to%20retrieve%20auth%20token:%20invalid%20username/password:%20unauthorized:%20Please%20login%20to%20the%20Red%20Hat%20Registry%20using%20your%20Customer%20Portal%20credentials [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.3/152 [4]: https://bugzilla.redhat.com/show_bug.cgi?id=1745418#c3 [5]: https://search.svc.ci.openshift.org/?search=error%3A+build+error%3A+After+retrying+2+times%2C+Pull+image+still+failed+due+to+error%3A+unable+to+retrieve+auth+token%3A+invalid+username%2Fpassword%3A+unauthorized%3A+Please+login+to+the+Red+Hat+Registry+using+your+Customer+Portal+credentials&maxAge=168h&context=0&type=build-log
Also, aren't failing tests supposed to get a single retry to punch through one-off flakes? But this is being listed as a failing test: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73/build-log.txt | grep -A6 'Flaky tests' | head -n7 Flaky tests: [Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel] Failing tests: [Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal] despite only being run once: $ curl -s 'https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73/build-log.txt' | grep 'name of 58 characters' started: (0/877/2128) "[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" [It] should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal] failed: (1m6s) 2019-11-24T02:12:52 "[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" Nov 24 02:11:47.444 - 65s I test="[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" running Nov 24 02:12:52.988 I test="[Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" failed [Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal] [Feature:Builds][Conformance] oc new-app should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal] Although the flaky test seems to only have one run as well: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73/build-log.txt | grep TestDockercfgTokenDeletedController started: (1/1128/2128) "[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]" [It] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel] failed: (5m13s) 2019-11-24T02:19:55 "[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]" Nov 24 02:14:42.785 - 312s I test="[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]" running Nov 24 02:19:55.773 I test="[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]" failed [Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel] [Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel] so I don't know what's going on there.
*** Bug 1776416 has been marked as a duplicate of this bug. ***
@Gabe - I thought we had addressed many of the flakes with this test, guess not?
So it would seem Adam. Looking at the must gather data, by the time the data is gathered, the nodejs imagestream in question has its stats/dockerImageRegistry set, so in theory the local policy should have been honored and the internal image registry accessed vs. registry.redhat.io. I also do not see any of the events / error logs we added around this scenario. So either we need to add more debug around the paths implicated by Ben's original theory, or yeah we add more retry in the test itself (assuming Ben's theory is still correct but our events etc. are getting missed) or we punt on the test again. Or a new theory is needed. At least based on Trevor's query it seems to be flaking less than before.
Actually in looking at the failures a little more closely, I think the local pull through is working. If you look at the following from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73 I1124 02:12:42.423538 1 daemonless.go:61] found valid .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg Warning: Pull failed, retrying in 5s ... F1124 02:12:48.081564 1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication Nov 24 02:12:52.800: INFO: skipping dumping cluster info - cluster too large That "found valid .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg" entry means it is using the image registry pull secret. So either there was a flake accessing the image registry, or the secret was not ready. I saw no errors for the namespace in question at the above time frame in either the image registry nor apiserver. I noticed there is not wait for the builder service account call in the new-app test. That seems the most plausible to me at this point. I'll construct an extended test PR accordingly. But of course Ben/Adam chime in as you see fit.
Also adding a dump of the build pod yaml to confirm the translation of the local pull through spec when the error occurs
Based on https://search.svc.ci.openshift.org/?search=should+succeed+with+a+--name+of+58+characters&maxAge=336h&context=2&type=all I see no hits for the oc new-app test that was hitting premature inability to pull from registry.redhat.io I think we can mark this verified
(In reply to Gabe Montero from comment #8) > Based on > https://search.svc.ci.openshift.org/?search=should+succeed+with+a+-- > name+of+58+characters&maxAge=336h&context=2&type=all That's not the right search. With comment 0's [1], I still see a few of these in the past 24h, e.g. [2] with: 2019-12-14T05:40:32.956450913Z F1214 05:40:32.956388 1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication And if you just want to search for test failures, you need a "failed:" prefix like [3], otherwise you hit jobs that passed that test but failed on some other test (the build log would include "passed:.*name of 58 characters"). That search turns up failures like [4] with: fail [github.com/openshift/origin/test/extended/builds/new_app.go:62]: Unexpected error: <*errors.StatusError | 0xc0030f9720>: { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "Unauthorized", Reason: "Unauthorized", Details: nil, Code: 401, }, } Unauthorize [1]: https://search.svc.ci.openshift.org/chart?search=error:%20build%20error:%20After%20retrying%202%20times,%20Pull%20image%20still%20failed%20due%20to%20error:%20unable%20to%20retrieve%20auth%20token:%20invalid%20username/password:%20unauthorized:%20Please%20login%20to%20the%20Red%20Hat%20Registry%20using%20your%20Customer%20Portal%20credentials [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/6331/rehearse-6331-pull-ci-openshift-origin-master-e2e-aws/7 [3]: https://search.svc.ci.openshift.org/chart?search=failed:.*should%20succeed%20with%20a%20--name%20of%2058%20characters [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.4/198
(In reply to W. Trevor King from comment #9) > (In reply to Gabe Montero from comment #8) > > Based on > > https://search.svc.ci.openshift.org/?search=should+succeed+with+a+-- > > name+of+58+characters&maxAge=336h&context=2&type=all > > That's not the right search. With comment 0's [1], I still see a few of > these in the past 24h, e.g. [2] with: > > 2019-12-14T05:40:32.956450913Z F1214 05:40:32.956388 1 > helpers.go:114] error: build error: After retrying 2 times, Pull image still > failed due to error: unable to retrieve auth token: invalid > username/password: unauthorized: Please login to the Red Hat Registry using > your Customer Portal credentials. Further instructions can be found here: > https://access.redhat.com/RegistryAuthentication Just to be clear, the change associated with this PR does not prevent the failures from that query from occurring. They can still occur if registry.redhat.io is in fact flaky The change was simply trying to narrow the window. > > And if you just want to search for test failures, you need a "failed:" > prefix like [3], otherwise you hit jobs that passed that test but failed on > some other test (the build log would include "passed:.*name of 58 > characters"). That search turns up failures like [4] with: > > fail [github.com/openshift/origin/test/extended/builds/new_app.go:62]: > Unexpected error: > <*errors.StatusError | 0xc0030f9720>: { > ErrStatus: { > TypeMeta: {Kind: "", APIVersion: ""}, > ListMeta: { > SelfLink: "", > ResourceVersion: "", > Continue: "", > RemainingItemCount: nil, > }, > Status: "Failure", > Message: "Unauthorized", > Reason: "Unauthorized", > Details: nil, > Code: 401, > }, > } > Unauthorize > > [1]: > https://search.svc.ci.openshift.org/chart?search=error:%20build%20error: > %20After%20retrying%202%20times, > %20Pull%20image%20still%20failed%20due%20to%20error: > %20unable%20to%20retrieve%20auth%20token:%20invalid%20username/password: > %20unauthorized: > %20Please%20login%20to%20the%20Red%20Hat%20Registry%20using%20your%20Customer > %20Portal%20credentials > [2]: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/ > openshift_release/6331/rehearse-6331-pull-ci-openshift-origin-master-e2e-aws/ > 7 > [3]: > https://search.svc.ci.openshift.org/chart?search=failed:. > *should%20succeed%20with%20a%20--name%20of%2058%20characters that said, yes thanks for this corrected search, which is what I was trying to do. And yes, I'm seeing some failures still where we are trying to pull from the registry.redhat.io vs. pull through via the imagestream on a build. Looking at failure https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/319 currently. > [4]: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-ocp-installer-e2e-aws-upi-4.4/198
I noticed in the failed e2e above the build was triggered by an image change ... we now have a theory that a timing gap with those and localref imagestreams exists. If that does not prove to be the thing, we'll punt and starting linking the necessary TBR secret to the builder SA for these tests.
Passed in 4.4.0-0.nightly-2019-12-18-015816 e2e test, need monitoring more jobs, and will add result here. https://search.svc.ci.openshift.org/?search=error%3A+build+error%3A+After+retrying+2+times%2C+Pull+image+still+failed+due+to+error%3A+unable+to+retrieve+auth+token%3A+invalid+username%2Fpassword%3A+unauthorized%3A+Please+login+to+the+Red+Hat+Registry+using+your+Customer+Portal+credentials&maxAge=168h&context=0&type=build-log
After dozen of ci(4.4.0-0.nightly-2019-12-18-064553 - 4.4.0-0.nightly-2019-12-19-223334) jobs triggered, no this error "After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials" retrived again. Mark this bug to verified.
Can we backport the fixes to 4.3? Looks like it's happening there too, e.g. in 4.3.0-0.nightly-2019-12-20-005113 [1]: F1220 02:02:40.578281 1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/627
4.3.z clone https://bugzilla.redhat.com/show_bug.cgi?id=1785845 created
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581