Bug 1404024 - Builds failing with "Failed to push image: unauthorized" when running simultaneous builds
Summary: Builds failing with "Failed to push image: unauthorized" when running simulta...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 3.4.0
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: ---
Assignee: Oleg Bulatov
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-12 21:11 UTC by Mike Fiedler
Modified: 2018-01-11 13:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-11 13:33:18 UTC
Target Upstream Version:


Attachments (Terms of Use)
Registry logs (3.96 MB, application/x-gzip)
2016-12-12 21:11 UTC, Mike Fiedler
no flags Details
Failed build log (784 bytes, text/plain)
2016-12-12 21:11 UTC, Mike Fiedler
no flags Details
Application template (7.68 KB, text/plain)
2016-12-12 21:12 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2016-12-12 21:11:06 UTC
Created attachment 1230947 [details]
Registry logs

Description of problem:

When doing simultaneous builds with 2 docker registry pods and S3 storage for the registry, about 20% of the builds are failing with the following error:

error: build error: Failed to push image: unauthorized: authentication required 

Registry logs contains so many error messages, it is difficult to tell what's real.    I do see this for a failed build:   "error authorizing context: access denied".    Search the attached logs for svt925/cakephp5.   That is an example of failed build.   The log for that build is attached as well.


Version-Release number of selected component (if applicable):  3.4.0.34


How reproducible:   Always for some portion of the simultaneous builds.

Environment:

1 m4.4xlarge master
1 m4.4xlarge etcd
2 m4.4xlarge registry nodes configured to use S3 storage
15 m4.xlarge builder nodes


Steps to Reproduce:
1.  Create 1000 projects each with 10 copies of the attached template.   The template create 10 DC (0 replicas), 10 svc, 10 routes and 10 bc for a cakephp application.    1000 projects x 10 apps = 10000 applications
2.  Use https://github.com/openshift/svt/blob/master/openshift_performance/ose3_perf/scripts/build_test.py to build the applications.   I run it with this syntax:

python -u build_test.py -m https://ec2-54-218-88-227.us-west-2.compute.amazonaws.com:8443 -u redhat -p redhat -n 1 -a -b 60 -l

This will build all 10000 apps in batches of 60 (4 simultaneous builds per builder node).  


Actual results:

About 20% of the builds are failing with:

error: build error: Failed to push image: unauthorized: authentication required 


Expected results:

All builds succeed (or very few failures)


Additional info:

Comment 1 Mike Fiedler 2016-12-12 21:11:42 UTC
Created attachment 1230948 [details]
Failed build log

Comment 2 Mike Fiedler 2016-12-12 21:12:15 UTC
Created attachment 1230949 [details]
Application template

Comment 3 Wang Haoran 2016-12-13 03:34:08 UTC
(In reply to Mike Fiedler from comment #2)
> Created attachment 1230949 [details]
> Application template

Mike, could you please provide the registry pod log ?

Comment 4 Michal Minar 2016-12-13 10:08:37 UTC
@Wang it's already in the attachments [1].

[1] https://bugzilla.redhat.com/attachment.cgi?id=1230947

Comment 5 Michal Fojtik 2016-12-13 13:33:42 UTC
Can we also get a full build log (BUILD_LOGLEVEL=5) ?

Comment 6 Mike Fiedler 2016-12-13 14:26:52 UTC
I am reproducing later this morning.   Will set all builds to loglevel 5.

Comment 7 Michal Minar 2016-12-13 15:24:29 UTC
This seems to be the culprit:

    level=error msg="OpenShift client error: the server has asked for the client to provide credentials (post localSubjectAccessReviews)" go.version=go1.7.3 http.request.host="172.27.9.119:5000" http.request.id=c79a91bf-d23f-4729-bd9f-1dbc2a93561e http.request.method=PUT http.request. remoteaddr="172.20.6.1:54238" http.request.uri="/v2/svt925/cakephp5/blobs/uploads/6a93eab3-a771-4d5f-9261-6a253d9f71d7?_state=N3qXqw1Bkk16hGTLpcrQtvkBSQL3vyRSB2IW10KSAVV7Ik5hbWUiOiJzdnQ5MjUvY2FrZXBocDUiLCJVVUlEIjoiNmE5M2VhYjMtYTc3MS00ZDVmLTkyNjEtNmEyNTNkOWY3MWQ3IiwiT2Zmc2V0Ijo2ODY0NzU0MCwiU3RhcnRlZEF0IjoiMjAxNi0xMi0xMlQyMDo0MDozM1oifQ%3D%3D&digest=sha256%3Aaad5d4c7bba900248b9ece846ee73cab334f5f2687b0773de9b1f2b634bf0190"

The token is refused (if provided at all) even though the previous request

    PATCH /v2/svt925/cakephp5/blobs/uploads/6a93eab3-a771-4d5f-9261-6a253d9f71d7?_state=zabmoUWgHohTaxVZqxJenpQP4M3QAjRfABxMMq-7Zwl7Ik5hbWUiOiJzdnQ5MjUvY2FrZXBocDUiLCJVVUlEIjoiNmE5M2VhYjMtYTc3MS00ZDVmLTkyNjEtNmEyNTNkOWY3MWQ3IiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE2LTEyLTEyVDIwOjQwOjMzLjA3ODA5OTAyOVoifQ%3D%3D HTTP/1.1" 202

succeeded. No idea why.

I'd very like to see master log from that time.

I'll try to reproduce locally and see what I can get.

Comment 8 Michal Fojtik 2016-12-13 16:42:57 UTC
(In reply to Michal Minar from comment #7)
> This seems to be the culprit:
> 
>     level=error msg="OpenShift client error: the server has asked for the
> client to provide credentials (post localSubjectAccessReviews)"
> go.version=go1.7.3 http.request.host="172.27.9.119:5000"
> http.request.id=c79a91bf-d23f-4729-bd9f-1dbc2a93561e http.request.method=PUT
> http.request. remoteaddr="172.20.6.1:54238"
> http.request.uri="/v2/svt925/cakephp5/blobs/uploads/6a93eab3-a771-4d5f-9261-
> 6a253d9f71d7?_state=N3qXqw1Bkk16hGTLpcrQtvkBSQL3vyRSB2IW10KSAVV7Ik5hbWUiOiJzd
> nQ5MjUvY2FrZXBocDUiLCJVVUlEIjoiNmE5M2VhYjMtYTc3MS00ZDVmLTkyNjEtNmEyNTNkOWY3MW
> Q3IiwiT2Zmc2V0Ijo2ODY0NzU0MCwiU3RhcnRlZEF0IjoiMjAxNi0xMi0xMlQyMDo0MDozM1oifQ%
> 3D%3D&digest=sha256%3Aaad5d4c7bba900248b9ece846ee73cab334f5f2687b0773de9b1f2b
> 634bf0190"
> 
> The token is refused (if provided at all) even though the previous request
> 
>     PATCH
> /v2/svt925/cakephp5/blobs/uploads/6a93eab3-a771-4d5f-9261-
> 6a253d9f71d7?_state=zabmoUWgHohTaxVZqxJenpQP4M3QAjRfABxMMq-
> 7Zwl7Ik5hbWUiOiJzdnQ5MjUvY2FrZXBocDUiLCJVVUlEIjoiNmE5M2VhYjMtYTc3MS00ZDVmLTky
> NjEtNmEyNTNkOWY3MWQ3IiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE2LTEyLTEyVDIwOjQwO
> jMzLjA3ODA5OTAyOVoifQ%3D%3D HTTP/1.1" 202
> 
> succeeded. No idea why.
> 
> I'd very like to see master log from that time.
> 
> I'll try to reproduce locally and see what I can get.

Maybe it is because the authentication endpoint is stressed and timeouts for some requests. Do we retry the requests that timeout? As David suggested, we should look at the remote authenticator to see if we are timeouting there.

Do we have any logs in registry that might reveal the server timeouted?

Comment 9 Mike Fiedler 2016-12-14 01:23:52 UTC
I don't think this is a registry issue and may not be an issue at all.  Given comment 7 and comment 8 above, I found a correlation between the errors and some etcd timeouts (I was using a single etcd).  I've moved etcd to a 3 node cluster and the same level of stress (and even higher) is not giving these pervasive errors.   I have a handful of builds which have failed with 500 errors in the build logs, but like what was seen before.

This needs some additional investigation as we need to get to 10K++ builds over 1000 namespaces, but currently this reported error is not seen.

I propose closing this bug unless Michal F and Michal M have seen anything that they think needs to be investigated on the registry side.

Comment 10 Michal Fojtik 2016-12-14 09:11:46 UTC
Thanks Mike, moving this off the blocker list.

Comment 11 Michal Minar 2016-12-14 15:05:41 UTC
I'd very like to see a different message (e.g. "timeout communicating with master API"). Unfortunately, registry relies on master API and its client library which needs to generate proper errors.

I'm switching this to Auth component and hope you don't mind.

Comment 12 Jordan Liggitt 2016-12-19 16:41:58 UTC
is this a timeout error encountered when the registry is performing the subjectaccessreview check? if so, the registry can return a different error in that case

Comment 13 Jordan Liggitt 2016-12-19 16:42:40 UTC
I think the registry still has the information it needs to return a better error. The API doesn't return unauthorized errors in timeout cases

Comment 14 Oleg Bulatov 2018-01-09 20:15:26 UTC
$ ag c2a698bb-3558-4198-a2bb-5edab2658cd8 ./registry1.log
75105:time="2016-12-12T20:40:30.275116212Z" level=error msg="OpenShift access denied: User \"system:serviceaccount:svt925:builder\" cannot get imagestreams/layers in project \"svt581\"" go.version=go1.7.3 http.request.host="172.27.9.119:5000" http.request.id=c2a698bb-3558-4198-a2bb-5edab2658cd8 http.request.method=POST http.request.remoteaddr="172.20.6.1:53966" http.request.uri="/v2/svt925/cakephp5/blobs/uploads/?from=svt581%2Fcakephp0&mount=sha256%3A2cd00c647ebf035b4e71a116ddcce480cfd6c8406b0b39ebdaad50b8edcc557f" http.request.useragent="docker/1.12.3 go/go1.6.2 git-commit/8b91553-redhat kernel/3.10.0-514.el7.x86_64 os/linux arch/amd64 UpstreamClient(go-dockerclient)" instance.id=ede6f3ba-4dd2-444d-a68a-ba0665ff93a6 vars.name="svt925/cakephp5"

The error came from the master API side.

Is it still reproducible?

Comment 15 Oleg Bulatov 2018-01-09 20:21:05 UTC
It also might be a cross-mount problem. It's fixed by:

> commit 860fa7f54f78603891afc4d5918ca615015c44d9
> Author: Jordan Liggitt <jliggitt>
> Date:   Tue Jun 28 01:01:53 2016 -0400
> 
>     Delegate enforcement of cross-repo mount auth errors to blobstore

and a bunch of additional fixes by Michal Minar.

Comment 16 Mike Fiedler 2018-01-10 12:56:46 UTC
Re-testing on 3.9 today.

Comment 17 Mike Fiedler 2018-01-10 18:18:25 UTC
Verified issue does not occur in 3.9.0-0.16.0.   Do you want to answer this as WORKSFORME?  UPSTREAM?   Or I can mark it VERIFIED if you want to put it ON_QA


Note You need to log in before you can comment on or make changes to this bug.