1694878 – Unexpected `Unauthorized` errors in e2e extended tests when openshift-apiserver available==true

Bug 1694878 - Unexpected `Unauthorized` errors in e2e extended tests when openshift-apiserver available==true

Summary: Unexpected `Unauthorized` errors in e2e extended tests when openshift-apiserv...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Mo
QA Contact:	Chuan Yu
Docs Contact:
URL:
Whiteboard:	buildcop
Duplicates (2):	1695048 1702103 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-01 22:51 UTC by Gabe Montero
Modified:	2019-06-04 10:46 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:46:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:46:57 UTC

Internal Links: 1702743

Description Gabe Montero 2019-04-01 22:51:39 UTC

Description of problem:

After discussions with David Eads, and a little bit with Erica VB as well, this is an off shoot from master team's https://bugzilla.redhat.com/show_bug.cgi?id=1692832, which is tracking down `Unauthorized` errors on client access when the openshift-apiserver is tracking as up with available==true and failing/progressing == false.


Version-Release number of selected component (if applicable): 4.1


How reproducible:


We've been seeing this in e2e-aws-builds and e2e-aws-jenkins jobs for a couple of weeks now, even after some prior reportings of unexpected client `Unauthroized` errors have been addressed.

Here is some triage of the last two instances:

1) https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22435/pull-ci-openshift-origin-master-e2e-aws-builds/1173


The client side error details:

```
Mar 29 20:47:52.061: INFO: Creating project "e2e-test-build-proxy-pc7b9"
Mar 29 20:47:52.208: INFO: Waiting on permissions in project "e2e-test-build-proxy-pc7b9" ...
Mar 29 20:47:52.227: INFO: Waiting for ServiceAccount "default" to be provisioned...
[AfterEach] [Feature:Builds][Slow] builds should support proxies
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/cli.go:88
STEP: Collecting events from namespace "e2e-test-build-proxy-pc7b9".
STEP: Found 0 events.
Mar 29 20:48:28.736: INFO: skipping dumping cluster info - cluster too large
STEP: Deleting namespaces
Mar 29 20:48:28.776: INFO: namespace : e2e-test-build-proxy-pc7b9 api call to delete is complete 
STEP: Waiting for namespaces to vanish
[AfterEach] [Feature:Builds][Slow] builds should support proxies
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:148
Mar 29 20:48:40.816: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Mar 29 20:48:40.853: INFO: Running AfterSuite actions on all node
Mar 29 20:48:40.853: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/util/cli.go:249]: Expected error:
    <*errors.StatusError | 0xc421fe8fc0>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""},
            Status: "Failure",
            Message: "Unauthorized",
            Reason: "Unauthorized",
            Details: nil,
            Code: 401,
        },
    }
    Unauthorized
not to have occurred
```

Errors in the openshift-apiservers that jumped out at me at least somewhat near the relevant timeframe:

- `E0329 20:49:08.628602       1 webhook.go:192] Failed to make webhook authorizer request: .authorization.k8s.io "" is invalid: spec.user: Invalid value: "": at least one of user or group must be specified` from the 56qbb replica ... similar ones in the drtb9 and wqkc4 replicas


Error in the openshift-authentication operator:

- it said it was unhappy for a sec, but then very quickly said things were fine,  about 20 seconds before the e2e error:

```
I0329 20:47:33.680407       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"45ed3d1f-5263-11e9-bdd7-0eed886f6fe6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: {"conditions":[{"type":"Failing","status":"True","lastTransitionTime":"2019-03-29T20:44:43Z","reason":"Failing","message":"Failing: error checking payload readiness: unable to check route health: failed to GET route: dial tcp: lookup openshift-authentication-openshift-authentication.apps.ci-op-7vwrzsvc-85706.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host"},{"type":"Progressing","status":"True","lastTransitionTime":"2019-03-29T20:46:19Z","reason":"Progressing","message":"Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://172.30.0.1/.well-known/oauth-authorization-server endpoint data"},{"type":"Available","status":"False","lastTransitionTime":"2019-03-29T20:43:51Z","reason":"Available"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2019-03-29T20:43:37Z","reason":"NoData"}],"versions":[{"name":"integrated-oauth-server","version":"0.0.1-2019-03-29-201731_openshift"}

A: ],"relatedObjects":[{"group":"operator.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"infrastructures","name":"cluster"},{"group":"config.openshift.io","resource":"oauths","name":"cluster"},{"group":"","resource":"namespaces","name":"openshift-config"},{"group":"","resource":"namespaces","name":"openshift-config-managed"},{"group":"","resource":"namespaces","name":"openshift-authentication"},{"group":"","resource":"namespaces","name":"openshift-authentication-operator"}],"extension":null}


B: ,{"name":"operator","version":"0.0.1-2019-03-29-201731"}],"relatedObjects":[{"group":"operator.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"infrastructures","name":"cluster"},{"group":"config.openshift.io","resource":"oauths","name":"cluster"},{"group":"","resource":"namespaces","name":"openshift-config"},{"group":"","resource":"namespaces","name":"openshift-config-managed"},{"group":"","resource":"namespaces","name":"openshift-authentication"},{"group":"","resource":"namespaces","name":"openshift-authentication-operator"}],"extension":null}


I0329 20:47:33.683818       1 status_controller.go:150] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-03-29T20:47:33Z","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-03-29T20:47:33Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-03-29T20:47:33Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-03-29T20:43:37Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}
I0329 20:47:33.690194       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"45ed3d1f-5263-11e9-bdd7-0eed886f6fe6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Failing changed from True to False (""),Progressing changed from True to False (""),Available changed from False to True ("")
```

No relevant errors in the openshift authentication servers themselves that I could see, other than an occassional `http: TLS handshake error from ... remote error: tls: bad certificate`

2) https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22435/pull-ci-openshift-origin-master-e2e-aws-builds/1180

client side error details:

```
STEP: create application build configs for 3 strategies
Apr  1 15:19:32.315: INFO: Running 'oc create --config=/tmp/configfile584211276 --namespace=e2e-test-forcepull-zs2qt -f /tmp/fixture-testdata-dir104908641/test/extended/testdata/forcepull-test.json'
Apr  1 15:19:40.533: INFO: Error running &{/usr/bin/oc [oc create --config=/tmp/configfile584211276 --namespace=e2e-test-forcepull-zs2qt -f /tmp/fixture-testdata-dir104908641/test/extended/testdata/forcepull-test.json] []   buildconfig.build.openshift.io/ruby-sample-build-td created
buildconfig.build.openshift.io/ruby-sample-build-ts created
error: You must be logged in to the server (Unauthorized)
 buildconfig.build.openshift.io/ruby-sample-build-td created
buildconfig.build.openshift.io/ruby-sample-build-ts created
error: You must be logged in to the server (Unauthorized)
 [] <nil> 0xc422836f00 exit status 1 <nil> <nil> true [0xc422a26e00 0xc422a26e28 0xc422a26e28] [0xc422a26e00 0xc422a26e28] [0xc422a26e08 0xc422a26e20] [0x8b8ab0 0x8b8bb0] 0xc42238e5a0 <nil>}:
buildconfig.build.openshift.io/ruby-sample-build-td created
buildconfig.build.openshift.io/ruby-sample-build-ts created
error: You must be logged in to the server (Unauthorized)
buildconfig.build.openshift.io/ruby-sample-build-td created
buildconfig.build.openshift.io/ruby-sample-build-ts created
error: You must be logged in to the server (Unauthorized)
[AfterEach] 
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/forcepull.go:99
Apr  1 15:19:40.534: INFO: Running 'oc delete --config=/tmp/admin.kubeconfig --namespace=e2e-test-forcepull-zs2qt clusterrolebinding custombuildaccess-e2e-test-forcepull-zs2qt-user'
``` 

Errors in openshift-apiservers:
- soon after more of those E0401 15:19:52.465974       1 webhook.go:192] Failed to make webhook authorizer request: .authorization.k8s.io "" is invalid: spec.user: Invalid value: "": at least one of user or group must be specified 

Errors in openshift authentication operator:
- a couple of minutes before, some churn:
```
{"lastTransitionTime":"2019-04-01T15:14:04Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}
I0401 15:17:20.493001       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"c0c48da4-5490-11e9-9101-0a4b15f660f8", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Progressing changed from False to True ("Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://172.30.0.1/.well-known/oauth-authorization-server endpoint data"),Available changed from True to False ("")
I0401 15:17:22.706773       1 status_controller.go:150] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-01T15:17:18Z","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-04-01T15:17:22Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-04-01T15:17:22Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-01T15:14:04Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}
I0401 15:17:22.714099       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"c0c48da4-5490-11e9-9101-0a4b15f660f8", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Progressing changed from True to False (""),Available changed from False to True ("")
```

Errors from openshift auth servers:
- some more intermittent http: TLS handshake error from 10.131.0.5:49028: remote error: tls: bad certificate throughout

Comment 1 Standa Laznicka 2019-04-03 11:48:04 UTC

The apiserver message is unfortunately but a red herring caused by sending an empty user.Info structure to SAR in SCC admission plugin. This should probably be fixed but it is not the root cause of the trouble.

"TLS handshake error from 10.131.0.5:49028: remote error: tls: bad certificate throughout" looks bad, though, I'll try to investigate that next.

Comment 2 Gabe Montero 2019-04-03 14:31:54 UTC

The bug https://bugzilla.redhat.com/show_bug.cgi?id=1695048 that Devan opened yesterday looks an awful lot like this one as well.

Comment 3 Gabe Montero 2019-04-03 18:21:12 UTC

So I found the damming 401 audit log for an `Unauthorized` failure on an `oc start-build ...`

ip-10-0-137-162.ec2.internal-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"c846969e-c563-4dce-b3de-11685224ca3f","stage":"ResponseComplete","requestURI":"/apis/build.openshift.io/v1/namespaces/e2e-test-cli-start-build-c4jjg/buildconfigs/sample-build/webhooks/mysecret/generic","verb":"create","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.0.6.236"],"userAgent":"curl/7.29.0","objectRef":{"resource":"buildconfigs","namespace":"e2e-test-cli-start-build-c4jjg","name":"sample-build","apiGroup":"build.openshift.io","apiVersion":"v1","subresource":"webhooks"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-04-02T21:00:02.514046Z","stageTimestamp":"2019-04-02T21:00:02.565089Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:webhooks\" of ClusterRole \"system:webhook\" to Group \"system:unauthenticated\""}}
ip-10-0-172-97.ec2.internal-audit-2019-04-02T21-13-56.645.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e0f84536-0250-4071-82e0-4436be5e98b6","stage":"ResponseComplete","requestURI":"/apis/build.openshift.io/v1/namespaces/e2e-test-cli-start-build-c4jjg/buildconfigs/sample-build/webhooks/mysecret/generic","verb":"create","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.0.6.236"],"userAgent":"curl/7.29.0","objectRef":{"resource":"buildconfigs","namespace":"e2e-test-cli-start-build-c4jjg","name":"sample-build","apiGroup":"build.openshift.io","apiVersion":"v1","subresource":"webhooks"},"responseStatus":{"metadata":{},"code":401},"requestReceivedTimestamp":"2019-04-02T21:00:02.507874Z","stageTimestamp":"2019-04-02T21:00:02.602991Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:webhooks\" of ClusterRole \"system:webhook\" to Group \"system:unauthenticated\""}}

In conversations with Mo, it looks like the authorization provided for the build config webhook is no bueno

So there appears to be some sort of webhook flavor to this, even if the empty user error message from before is in fact benign.

Mo said he is working on another debug PR.

Be sure to run
/test e2e-aws-builds
/test e2e-aws-jenkins 

in that PR.  Getting an `Unauthorized` should be pretty likely.

I've triggered them in Standa's debug PR.

Comment 4 Gabe Montero 2019-04-04 19:26:31 UTC

So neither Mo's or my debug PRs turned up anymore unexpected 401's on the bc webhook flows (my debug did uncover one of our e2e's that expectedly induces an error in the bc webhook flow :-) ).

Today, I saw several `Unauthorized` errors that mapped to 404's on the user lookup during the e2e's.

I posted details on one in https://github.com/openshift/origin/pull/22482#issuecomment-480028888

I'll start trying to figure out how to add missing user diagnostics to my debug PR.

Of course suggestions/comments welcome.

Comment 5 Erica von Buelow 2019-04-05 13:22:21 UTC

*** Bug 1695048 has been marked as a duplicate of this bug. ***

Comment 6 Erica von Buelow 2019-04-23 00:17:38 UTC

*** Bug 1702103 has been marked as a duplicate of this bug. ***

Comment 7 Erica von Buelow 2019-04-23 00:19:51 UTC

Sally - can you add information about what your initial investigations today found?

Comment 8 Clayton Coleman 2019-04-23 01:25:41 UTC

Marking this as urgent since 1/2 ci runs fail in part due to the flakes this causes.  Generates a huge amount of noise impairing our ability to find other issues.

Comment 9 Mo 2019-04-23 02:52:59 UTC

I will update this tomorrow on my efforts over the last few weeks trying to track this down.

Sally, please focus on other BZs.

Comment 10 Adam Kaplan 2019-04-24 14:38:21 UTC

Seems that we have a newer manifestation of this issue pushing imagestreamtags to the registry. Getting lots of these errors in latest registry and build tests:

error: build error: Failed to push image: error copying layers and metadata from "containers-storage:[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.skip_mount_home=false]image-registry.openshift-image-registry.svc/e2e-test-build-multistage-l2px4/multi-stage:v1" to "docker://image-registry.openshift-image-registry.svc/e2e-test-build-multistage-l2px4/multi-stage:v1": Error trying to reuse blob sha256:8ba884070f611d31cb2c42eddb691319dc9facf5e0ec67672fcfa135181ab3df at destination: Error checking whether a blob sha256:8ba884070f611d31cb2c42eddb691319dc9facf5e0ec67672fcfa135181ab3df exists in image-registry.openshift-image-registry.svc/e2e-test-build-multistage-l2px4/multi-stage: unauthorized: authentication required

This is critical as all build tests are failing on this.

Comment 12 Mo 2019-04-27 13:13:03 UTC

https://github.com/openshift/origin/pull/22679 is merged and that will fix this issue.

https://github.com/openshift/cluster-authentication-operator/pull/118 is related but not required to fix this issue (so QA can begin testing immediately).  #118 fixes a different issue where CAO reports available too soon which made the bug fixed by #22679 more common (as it is a race).

Comment 14 Gabe Montero 2019-04-29 20:14:15 UTC

FWIW, after consistently seeing the various forms of `Unauthorized` throughout last week in my various PRs, since Mo's PRs merged 2 days ago, I have had a double digit amount of runs across e2e-aws-build/jenkins/image-ecosystem in my various PRs run `Unauthorized` free.

Bug reporters from dev do not mark bugs verified, but from things are looking good from my end.  Thanks Mo.

Comment 15 Chuan Yu 2019-04-30 04:34:51 UTC

Verified.

Run the e2e testing locally, no such issue now.

Comment 17 errata-xmlrpc 2019-06-04 10:46:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.