Description of problem: After discussions with David Eads, and a little bit with Erica VB as well, this is an off shoot from master team's https://bugzilla.redhat.com/show_bug.cgi?id=1692832, which is tracking down `Unauthorized` errors on client access when the openshift-apiserver is tracking as up with available==true and failing/progressing == false. Version-Release number of selected component (if applicable): 4.1 How reproducible: We've been seeing this in e2e-aws-builds and e2e-aws-jenkins jobs for a couple of weeks now, even after some prior reportings of unexpected client `Unauthroized` errors have been addressed. Here is some triage of the last two instances: 1) https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22435/pull-ci-openshift-origin-master-e2e-aws-builds/1173 The client side error details: ``` Mar 29 20:47:52.061: INFO: Creating project "e2e-test-build-proxy-pc7b9" Mar 29 20:47:52.208: INFO: Waiting on permissions in project "e2e-test-build-proxy-pc7b9" ... Mar 29 20:47:52.227: INFO: Waiting for ServiceAccount "default" to be provisioned... [AfterEach] [Feature:Builds][Slow] builds should support proxies /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/cli.go:88 STEP: Collecting events from namespace "e2e-test-build-proxy-pc7b9". STEP: Found 0 events. Mar 29 20:48:28.736: INFO: skipping dumping cluster info - cluster too large STEP: Deleting namespaces Mar 29 20:48:28.776: INFO: namespace : e2e-test-build-proxy-pc7b9 api call to delete is complete STEP: Waiting for namespaces to vanish [AfterEach] [Feature:Builds][Slow] builds should support proxies /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:148 Mar 29 20:48:40.816: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready Mar 29 20:48:40.853: INFO: Running AfterSuite actions on all node Mar 29 20:48:40.853: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/util/cli.go:249]: Expected error: <*errors.StatusError | 0xc421fe8fc0>: { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""}, Status: "Failure", Message: "Unauthorized", Reason: "Unauthorized", Details: nil, Code: 401, }, } Unauthorized not to have occurred ``` Errors in the openshift-apiservers that jumped out at me at least somewhat near the relevant timeframe: - `E0329 20:49:08.628602 1 webhook.go:192] Failed to make webhook authorizer request: .authorization.k8s.io "" is invalid: spec.user: Invalid value: "": at least one of user or group must be specified` from the 56qbb replica ... similar ones in the drtb9 and wqkc4 replicas Error in the openshift-authentication operator: - it said it was unhappy for a sec, but then very quickly said things were fine, about 20 seconds before the e2e error: ``` I0329 20:47:33.680407 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"45ed3d1f-5263-11e9-bdd7-0eed886f6fe6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: {"conditions":[{"type":"Failing","status":"True","lastTransitionTime":"2019-03-29T20:44:43Z","reason":"Failing","message":"Failing: error checking payload readiness: unable to check route health: failed to GET route: dial tcp: lookup openshift-authentication-openshift-authentication.apps.ci-op-7vwrzsvc-85706.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host"},{"type":"Progressing","status":"True","lastTransitionTime":"2019-03-29T20:46:19Z","reason":"Progressing","message":"Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://172.30.0.1/.well-known/oauth-authorization-server endpoint data"},{"type":"Available","status":"False","lastTransitionTime":"2019-03-29T20:43:51Z","reason":"Available"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2019-03-29T20:43:37Z","reason":"NoData"}],"versions":[{"name":"integrated-oauth-server","version":"0.0.1-2019-03-29-201731_openshift"} A: ],"relatedObjects":[{"group":"operator.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"infrastructures","name":"cluster"},{"group":"config.openshift.io","resource":"oauths","name":"cluster"},{"group":"","resource":"namespaces","name":"openshift-config"},{"group":"","resource":"namespaces","name":"openshift-config-managed"},{"group":"","resource":"namespaces","name":"openshift-authentication"},{"group":"","resource":"namespaces","name":"openshift-authentication-operator"}],"extension":null} B: ,{"name":"operator","version":"0.0.1-2019-03-29-201731"}],"relatedObjects":[{"group":"operator.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"authentications","name":"cluster"},{"group":"config.openshift.io","resource":"infrastructures","name":"cluster"},{"group":"config.openshift.io","resource":"oauths","name":"cluster"},{"group":"","resource":"namespaces","name":"openshift-config"},{"group":"","resource":"namespaces","name":"openshift-config-managed"},{"group":"","resource":"namespaces","name":"openshift-authentication"},{"group":"","resource":"namespaces","name":"openshift-authentication-operator"}],"extension":null} I0329 20:47:33.683818 1 status_controller.go:150] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-03-29T20:47:33Z","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-03-29T20:47:33Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-03-29T20:47:33Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-03-29T20:43:37Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} I0329 20:47:33.690194 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"45ed3d1f-5263-11e9-bdd7-0eed886f6fe6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Failing changed from True to False (""),Progressing changed from True to False (""),Available changed from False to True ("") ``` No relevant errors in the openshift authentication servers themselves that I could see, other than an occassional `http: TLS handshake error from ... remote error: tls: bad certificate` 2) https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22435/pull-ci-openshift-origin-master-e2e-aws-builds/1180 client side error details: ``` STEP: create application build configs for 3 strategies Apr 1 15:19:32.315: INFO: Running 'oc create --config=/tmp/configfile584211276 --namespace=e2e-test-forcepull-zs2qt -f /tmp/fixture-testdata-dir104908641/test/extended/testdata/forcepull-test.json' Apr 1 15:19:40.533: INFO: Error running &{/usr/bin/oc [oc create --config=/tmp/configfile584211276 --namespace=e2e-test-forcepull-zs2qt -f /tmp/fixture-testdata-dir104908641/test/extended/testdata/forcepull-test.json] [] buildconfig.build.openshift.io/ruby-sample-build-td created buildconfig.build.openshift.io/ruby-sample-build-ts created error: You must be logged in to the server (Unauthorized) buildconfig.build.openshift.io/ruby-sample-build-td created buildconfig.build.openshift.io/ruby-sample-build-ts created error: You must be logged in to the server (Unauthorized) [] <nil> 0xc422836f00 exit status 1 <nil> <nil> true [0xc422a26e00 0xc422a26e28 0xc422a26e28] [0xc422a26e00 0xc422a26e28] [0xc422a26e08 0xc422a26e20] [0x8b8ab0 0x8b8bb0] 0xc42238e5a0 <nil>}: buildconfig.build.openshift.io/ruby-sample-build-td created buildconfig.build.openshift.io/ruby-sample-build-ts created error: You must be logged in to the server (Unauthorized) buildconfig.build.openshift.io/ruby-sample-build-td created buildconfig.build.openshift.io/ruby-sample-build-ts created error: You must be logged in to the server (Unauthorized) [AfterEach] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/forcepull.go:99 Apr 1 15:19:40.534: INFO: Running 'oc delete --config=/tmp/admin.kubeconfig --namespace=e2e-test-forcepull-zs2qt clusterrolebinding custombuildaccess-e2e-test-forcepull-zs2qt-user' ``` Errors in openshift-apiservers: - soon after more of those E0401 15:19:52.465974 1 webhook.go:192] Failed to make webhook authorizer request: .authorization.k8s.io "" is invalid: spec.user: Invalid value: "": at least one of user or group must be specified Errors in openshift authentication operator: - a couple of minutes before, some churn: ``` {"lastTransitionTime":"2019-04-01T15:14:04Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} I0401 15:17:20.493001 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"c0c48da4-5490-11e9-9101-0a4b15f660f8", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Progressing changed from False to True ("Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://172.30.0.1/.well-known/oauth-authorization-server endpoint data"),Available changed from True to False ("") I0401 15:17:22.706773 1 status_controller.go:150] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-01T15:17:18Z","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-04-01T15:17:22Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-04-01T15:17:22Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-01T15:14:04Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} I0401 15:17:22.714099 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"c0c48da4-5490-11e9-9101-0a4b15f660f8", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Progressing changed from True to False (""),Available changed from False to True ("") ``` Errors from openshift auth servers: - some more intermittent http: TLS handshake error from 10.131.0.5:49028: remote error: tls: bad certificate throughout
The apiserver message is unfortunately but a red herring caused by sending an empty user.Info structure to SAR in SCC admission plugin. This should probably be fixed but it is not the root cause of the trouble. "TLS handshake error from 10.131.0.5:49028: remote error: tls: bad certificate throughout" looks bad, though, I'll try to investigate that next.
The bug https://bugzilla.redhat.com/show_bug.cgi?id=1695048 that Devan opened yesterday looks an awful lot like this one as well.
So I found the damming 401 audit log for an `Unauthorized` failure on an `oc start-build ...` ip-10-0-137-162.ec2.internal-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"c846969e-c563-4dce-b3de-11685224ca3f","stage":"ResponseComplete","requestURI":"/apis/build.openshift.io/v1/namespaces/e2e-test-cli-start-build-c4jjg/buildconfigs/sample-build/webhooks/mysecret/generic","verb":"create","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.0.6.236"],"userAgent":"curl/7.29.0","objectRef":{"resource":"buildconfigs","namespace":"e2e-test-cli-start-build-c4jjg","name":"sample-build","apiGroup":"build.openshift.io","apiVersion":"v1","subresource":"webhooks"},"responseStatus":{"metadata":{},"status":"Failure","reason":"Unauthorized","code":401},"requestReceivedTimestamp":"2019-04-02T21:00:02.514046Z","stageTimestamp":"2019-04-02T21:00:02.565089Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:webhooks\" of ClusterRole \"system:webhook\" to Group \"system:unauthenticated\""}} ip-10-0-172-97.ec2.internal-audit-2019-04-02T21-13-56.645.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e0f84536-0250-4071-82e0-4436be5e98b6","stage":"ResponseComplete","requestURI":"/apis/build.openshift.io/v1/namespaces/e2e-test-cli-start-build-c4jjg/buildconfigs/sample-build/webhooks/mysecret/generic","verb":"create","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.0.6.236"],"userAgent":"curl/7.29.0","objectRef":{"resource":"buildconfigs","namespace":"e2e-test-cli-start-build-c4jjg","name":"sample-build","apiGroup":"build.openshift.io","apiVersion":"v1","subresource":"webhooks"},"responseStatus":{"metadata":{},"code":401},"requestReceivedTimestamp":"2019-04-02T21:00:02.507874Z","stageTimestamp":"2019-04-02T21:00:02.602991Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:webhooks\" of ClusterRole \"system:webhook\" to Group \"system:unauthenticated\""}} In conversations with Mo, it looks like the authorization provided for the build config webhook is no bueno So there appears to be some sort of webhook flavor to this, even if the empty user error message from before is in fact benign. Mo said he is working on another debug PR. Be sure to run /test e2e-aws-builds /test e2e-aws-jenkins in that PR. Getting an `Unauthorized` should be pretty likely. I've triggered them in Standa's debug PR.
So neither Mo's or my debug PRs turned up anymore unexpected 401's on the bc webhook flows (my debug did uncover one of our e2e's that expectedly induces an error in the bc webhook flow :-) ). Today, I saw several `Unauthorized` errors that mapped to 404's on the user lookup during the e2e's. I posted details on one in https://github.com/openshift/origin/pull/22482#issuecomment-480028888 I'll start trying to figure out how to add missing user diagnostics to my debug PR. Of course suggestions/comments welcome.
*** Bug 1695048 has been marked as a duplicate of this bug. ***
*** Bug 1702103 has been marked as a duplicate of this bug. ***
Sally - can you add information about what your initial investigations today found?
Marking this as urgent since 1/2 ci runs fail in part due to the flakes this causes. Generates a huge amount of noise impairing our ability to find other issues.
I will update this tomorrow on my efforts over the last few weeks trying to track this down. Sally, please focus on other BZs.
Seems that we have a newer manifestation of this issue pushing imagestreamtags to the registry. Getting lots of these errors in latest registry and build tests: error: build error: Failed to push image: error copying layers and metadata from "containers-storage:[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.skip_mount_home=false]image-registry.openshift-image-registry.svc/e2e-test-build-multistage-l2px4/multi-stage:v1" to "docker://image-registry.openshift-image-registry.svc/e2e-test-build-multistage-l2px4/multi-stage:v1": Error trying to reuse blob sha256:8ba884070f611d31cb2c42eddb691319dc9facf5e0ec67672fcfa135181ab3df at destination: Error checking whether a blob sha256:8ba884070f611d31cb2c42eddb691319dc9facf5e0ec67672fcfa135181ab3df exists in image-registry.openshift-image-registry.svc/e2e-test-build-multistage-l2px4/multi-stage: unauthorized: authentication required This is critical as all build tests are failing on this.
https://github.com/openshift/origin/pull/22679 is merged and that will fix this issue. https://github.com/openshift/cluster-authentication-operator/pull/118 is related but not required to fix this issue (so QA can begin testing immediately). #118 fixes a different issue where CAO reports available too soon which made the bug fixed by #22679 more common (as it is a race).
FWIW, after consistently seeing the various forms of `Unauthorized` throughout last week in my various PRs, since Mo's PRs merged 2 days ago, I have had a double digit amount of runs across e2e-aws-build/jenkins/image-ecosystem in my various PRs run `Unauthorized` free. Bug reporters from dev do not mark bugs verified, but from things are looking good from my end. Thanks Mo.
Verified. Run the e2e testing locally, no such issue now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758