Bug 1775973 - CI: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password
Summary: CI: build error: After retrying 2 times, Pull image still failed due to error...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
: 1776416 (view as bug list)
Depends On:
Blocks: 1752313 1785845
TreeView+ depends on / blocked
 
Reported: 2019-11-24 05:59 UTC by W. Trevor King
Modified: 2020-05-04 11:17 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the local ref setting on imagestreams can be ignored if imagestream access occurs very soon after cluster startup or restart of the openshift controller manager because the openshift controller manager might have stale imagestreams its cache with incomplete metadata Consequence: if the imagestream was backed by a private registry which the internal image registry had credentials for, but the consumer did not have credentials for, the subsequent image pull would fail Fix: the openshift controller manager was updated to refresh its imagestream cache if it was apparent the metadata initialization was incomplete Result: local ref imagestream policy is preserved even during the timing windows that occur immediately after cluster start or openshift controller manager restart
Clone Of:
: 1785845 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:16:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift openshift-controller-manager pull 56 None closed Bug 1775973: use OCM config registry hostname with imagechangetrigger in case imag… 2020-11-18 21:55:45 UTC
Github openshift origin pull 24249 'None' closed Bug 1775973: make sure builder/default SAs ready for new-app extended tests image … 2020-11-18 21:55:45 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:17:26 UTC

Description W. Trevor King 2019-11-24 05:59:02 UTC
In some recent 4.3 release jobs [1,2,3]:

Asked to pull fresh copy of "registry.redhat.io/rhscl/nodejs-10-rhel7@sha256:23e9412a49fcd999fda5296c5b4c082ab29caa561645e4e9a07a366cad41c981".
...
Warning: Pull failed, retrying in 5s ...
F1124 02:12:48.081564       1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication
...
fail [github.com/openshift/origin/test/extended/builds/new_app.go:56]: Unexpected error:
    <*errors.errorString | 0xc0026d64a0>: {
        s: "The build \"a234567890123456789012345678901234567890123456789012345678-1\" status is \"Failed\"",
    }
    The build "a234567890123456789012345678901234567890123456789012345678-1" status is "Failed"
occurred
...
failed: (1m6s) 2019-11-24T02:12:52 "[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]"

This seems like a flake, as mentioned (I think) in [4], but I thought I'd open a new bug to track the flake since that bug was closed as fixed.  Do we know if this is a cluster-side flake or flake in registry.redhat.io's auth server or...?  Looks like we get two or three of these a day in CI [5], and it would be nice to be able to drive that flake rate down.

[1]: https://search.svc.ci.openshift.org/chart?search=error:%20build%20error:%20After%20retrying%202%20times,%20Pull%20image%20still%20failed%20due%20to%20error:%20unable%20to%20retrieve%20auth%20token:%20invalid%20username/password:%20unauthorized:%20Please%20login%20to%20the%20Red%20Hat%20Registry%20using%20your%20Customer%20Portal%20credentials
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.3/152
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1745418#c3
[5]: https://search.svc.ci.openshift.org/?search=error%3A+build+error%3A+After+retrying+2+times%2C+Pull+image+still+failed+due+to+error%3A+unable+to+retrieve+auth+token%3A+invalid+username%2Fpassword%3A+unauthorized%3A+Please+login+to+the+Red+Hat+Registry+using+your+Customer+Portal+credentials&maxAge=168h&context=0&type=build-log

Comment 1 W. Trevor King 2019-11-24 06:12:54 UTC
Also, aren't failing tests supposed to get a single retry to punch through one-off flakes?  But this is being listed as a failing test:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73/build-log.txt | grep -A6 'Flaky tests' | head -n7
  Flaky tests:

  [Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]

  Failing tests:

  [Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]

despite only being run once:

  $ curl -s 'https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73/build-log.txt' | grep 'name of 58 characters'
  started: (0/877/2128) "[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]"
  [It] should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]
  failed: (1m6s) 2019-11-24T02:12:52 "[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]"
  Nov 24 02:11:47.444 - 65s   I test="[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" running
  Nov 24 02:12:52.988 I test="[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]" failed
  [Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]
  [Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]

Although the flaky test seems to only have one run as well:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73/build-log.txt | grep TestDockercfgTokenDeletedController
  started: (1/1128/2128) "[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]"
  [It] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]
  failed: (5m13s) 2019-11-24T02:19:55 "[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]"
  Nov 24 02:14:42.785 - 312s  I test="[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]" running
  Nov 24 02:19:55.773 I test="[Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]" failed
  [Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]
  [Feature:OpenShiftControllerManager] TestDockercfgTokenDeletedController [Suite:openshift/conformance/parallel]

so I don't know what's going on there.

Comment 2 W. Trevor King 2019-11-26 00:25:25 UTC
*** Bug 1776416 has been marked as a duplicate of this bug. ***

Comment 3 Adam Kaplan 2019-11-26 16:20:35 UTC
@Gabe - I thought we had addressed many of the flakes with this test, guess not?

Comment 4 Gabe Montero 2019-12-01 21:03:34 UTC
So it would seem Adam.

Looking at the must gather data, by the time the data is gathered, the nodejs imagestream in question has its stats/dockerImageRegistry set, so in theory the local policy should have been honored and the internal image registry accessed vs. registry.redhat.io.

I also do not see any of the events / error logs we added around this scenario.

So either we need to add more debug around the paths implicated by Ben's original theory, or yeah we add more retry in the test itself (assuming Ben's theory is still correct but our events etc. are getting missed) or we punt on the test again.

Or a new theory is needed.

At least based on Trevor's query it seems to be flaking less than before.

Comment 5 Gabe Montero 2019-12-02 20:36:53 UTC
Actually in looking at the failures a little more closely, I think the local pull through is working.  If you look at the following from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.3/73

I1124 02:12:42.423538       1 daemonless.go:61] found valid .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg
Warning: Pull failed, retrying in 5s ...
F1124 02:12:48.081564       1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

Nov 24 02:12:52.800: INFO: skipping dumping cluster info - cluster too large


That "found valid .dockercfg at /var/run/secrets/openshift.io/pull/.dockercfg" entry means it is using the image registry pull secret.

So either there was a flake accessing the image registry, or the secret was not ready.

I saw no errors for the namespace in question at the above time frame in either the image registry nor apiserver.

I noticed there is not wait for the builder service account call in the new-app test.

That seems the most plausible to me at this point.

I'll construct an extended test PR accordingly.

But of course Ben/Adam chime in as you see fit.

Comment 6 Gabe Montero 2019-12-02 20:49:31 UTC
Also adding a dump of the build pod yaml to confirm the translation of the local pull through spec when the error occurs

Comment 8 Gabe Montero 2019-12-14 14:21:07 UTC
Based on https://search.svc.ci.openshift.org/?search=should+succeed+with+a+--name+of+58+characters&maxAge=336h&context=2&type=all

I see no hits for the oc new-app test that was hitting premature inability to pull from registry.redhat.io

I think we can mark this verified

Comment 9 W. Trevor King 2019-12-14 16:42:29 UTC
(In reply to Gabe Montero from comment #8)
> Based on
> https://search.svc.ci.openshift.org/?search=should+succeed+with+a+--
> name+of+58+characters&maxAge=336h&context=2&type=all

That's not the right search.  With comment 0's [1], I still see a few of these in the past 24h, e.g. [2] with:

  2019-12-14T05:40:32.956450913Z F1214 05:40:32.956388       1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

And if you just want to search for test failures, you need a "failed:" prefix like [3], otherwise you hit jobs that passed that test but failed on some other test (the build log would include "passed:.*name of 58 characters").  That search turns up failures like [4] with:

  fail [github.com/openshift/origin/test/extended/builds/new_app.go:62]: Unexpected error:
    <*errors.StatusError | 0xc0030f9720>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "Unauthorized",
            Reason: "Unauthorized",
            Details: nil,
            Code: 401,
        },
    }
    Unauthorize

[1]: https://search.svc.ci.openshift.org/chart?search=error:%20build%20error:%20After%20retrying%202%20times,%20Pull%20image%20still%20failed%20due%20to%20error:%20unable%20to%20retrieve%20auth%20token:%20invalid%20username/password:%20unauthorized:%20Please%20login%20to%20the%20Red%20Hat%20Registry%20using%20your%20Customer%20Portal%20credentials
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/6331/rehearse-6331-pull-ci-openshift-origin-master-e2e-aws/7
[3]: https://search.svc.ci.openshift.org/chart?search=failed:.*should%20succeed%20with%20a%20--name%20of%2058%20characters
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.4/198

Comment 10 Gabe Montero 2019-12-16 17:37:22 UTC
(In reply to W. Trevor King from comment #9)
> (In reply to Gabe Montero from comment #8)
> > Based on
> > https://search.svc.ci.openshift.org/?search=should+succeed+with+a+--
> > name+of+58+characters&maxAge=336h&context=2&type=all
> 
> That's not the right search.  With comment 0's [1], I still see a few of
> these in the past 24h, e.g. [2] with:
> 
>   2019-12-14T05:40:32.956450913Z F1214 05:40:32.956388       1
> helpers.go:114] error: build error: After retrying 2 times, Pull image still
> failed due to error: unable to retrieve auth token: invalid
> username/password: unauthorized: Please login to the Red Hat Registry using
> your Customer Portal credentials. Further instructions can be found here:
> https://access.redhat.com/RegistryAuthentication

Just to be clear, the change associated with this PR does not prevent the failures
from that query from occurring.

They can still occur if registry.redhat.io is in fact flaky

The change was simply trying to narrow the window.

> 
> And if you just want to search for test failures, you need a "failed:"
> prefix like [3], otherwise you hit jobs that passed that test but failed on
> some other test (the build log would include "passed:.*name of 58
> characters").  That search turns up failures like [4] with:
> 
>   fail [github.com/openshift/origin/test/extended/builds/new_app.go:62]:
> Unexpected error:
>     <*errors.StatusError | 0xc0030f9720>: {
>         ErrStatus: {
>             TypeMeta: {Kind: "", APIVersion: ""},
>             ListMeta: {
>                 SelfLink: "",
>                 ResourceVersion: "",
>                 Continue: "",
>                 RemainingItemCount: nil,
>             },
>             Status: "Failure",
>             Message: "Unauthorized",
>             Reason: "Unauthorized",
>             Details: nil,
>             Code: 401,
>         },
>     }
>     Unauthorize
> 
> [1]:
> https://search.svc.ci.openshift.org/chart?search=error:%20build%20error:
> %20After%20retrying%202%20times,
> %20Pull%20image%20still%20failed%20due%20to%20error:
> %20unable%20to%20retrieve%20auth%20token:%20invalid%20username/password:
> %20unauthorized:
> %20Please%20login%20to%20the%20Red%20Hat%20Registry%20using%20your%20Customer
> %20Portal%20credentials
> [2]:
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/
> openshift_release/6331/rehearse-6331-pull-ci-openshift-origin-master-e2e-aws/
> 7
> [3]:
> https://search.svc.ci.openshift.org/chart?search=failed:.
> *should%20succeed%20with%20a%20--name%20of%2058%20characters

that said, yes thanks for this corrected search, which is what I was trying to do.

And yes, I'm seeing some failures still where we are trying to pull from the registry.redhat.io
vs. pull through via the imagestream on a build.

Looking at failure https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/319
currently.


> [4]:
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-ocp-installer-e2e-aws-upi-4.4/198

Comment 11 Gabe Montero 2019-12-16 18:48:04 UTC
I noticed in the failed e2e above the build was triggered by an image change ... we now have a theory that a timing gap with those and localref imagestreams exists.

If that does not prove to be the thing, we'll punt and starting linking the necessary TBR secret to the builder SA for these tests.

Comment 15 XiuJuan Wang 2019-12-20 03:32:00 UTC
After dozen of ci(4.4.0-0.nightly-2019-12-18-064553 - 4.4.0-0.nightly-2019-12-19-223334) jobs triggered, no this error "After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials" retrived again.

Mark this bug to verified.

Comment 16 W. Trevor King 2019-12-20 04:09:15 UTC
Can we backport the fixes to 4.3?  Looks like it's happening there too, e.g. in 4.3.0-0.nightly-2019-12-20-005113 [1]:

F1220 02:02:40.578281       1 helpers.go:114] error: build error: After retrying 2 times, Pull image still failed due to error: unable to retrieve auth token: invalid username/password: unauthorized: Please login to the Red Hat Registry using your Customer Portal credentials. Further instructions can be found here: https://access.redhat.com/RegistryAuthentication

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/627

Comment 17 Gabe Montero 2019-12-21 15:24:07 UTC
4.3.z clone https://bugzilla.redhat.com/show_bug.cgi?id=1785845 created

Comment 19 errata-xmlrpc 2020-05-04 11:16:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.