Bug 1462542 - API token is not present at build time
Summary: API token is not present at build time
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.6.z
Assignee: David Eads
QA Contact: DeShuai Ma
URL:
Whiteboard:
: 1447386 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-18 16:34 UTC by Stefanie Forrester
Modified: 2023-09-14 03:59 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-26 04:11:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Zabbix IOPS graph (527.69 KB, image/png)
2017-06-19 14:01 UTC, Stefanie Forrester
no flags Details
Zabbix app-create graph (371.82 KB, image/png)
2017-06-19 14:01 UTC, Stefanie Forrester
no flags Details
Zabbix app-creates since June 1 (495.39 KB, image/png)
2017-06-19 14:02 UTC, Stefanie Forrester
no flags Details
Logs from suggest /metrics while creating project (3.84 KB, text/plain)
2017-06-27 20:20 UTC, Eric Paris
no flags Details
watching etcd while creating a new project (9.58 KB, text/plain)
2017-06-27 22:28 UTC, Eric Paris
no flags Details
journalctl --since='-60m' -b -u atomic-openshift-master-api | grep 'ended with:' (3.15 KB, text/plain)
2017-06-28 18:24 UTC, Eric Paris
no flags Details
journalctl --since='-60m' -b -u systemd-journald -u atomic-openshift-master-api | grep -E '(ended with:)|(Suppress)' | grep master (7.19 KB, text/plain)
2017-06-28 18:32 UTC, Eric Paris
no flags Details
cat /etc/origin/master/master-config.yaml (9.24 KB, text/plain)
2017-06-28 19:30 UTC, Eric Paris
no flags Details
journalctl --since='-60m' -u atomic-openshift-master-controllers | grep 'reflector.go' (9.53 KB, text/plain)
2017-06-28 19:48 UTC, Eric Paris
no flags Details

Description Stefanie Forrester 2017-06-18 16:34:26 UTC
Description of problem:

Builds are failing in starter-us-east-1 because of a delay in creating the dockercfg token used for builds.

Version-Release number of selected component (if applicable):

atomic-openshift-3.5.5.19-1.git.0.aef7e02.el7.x86_64
etcd-3.1.9-1.el7.x86_64

How reproducible:

20 out of 20 test builds in starter-us-east-1 hit this issue.

Steps to Reproduce:
1. Create a new project and any app that does a build. For example, 'oc new-project <name>; oc new-app cakephp-mysql-persistent -n <name>'
2.
3.

Actual results:

The deployer and builder tokens are not present in time for the first build attempt to succeed. Subsequent builds succeed after the token has been created.

Expected results:

The first build should succeed.

Additional info:

Most of the time, this error was encountered:

'No API token found for service account "deployer", retry after the token is automatically created and added to the service account'

Occasionally, this error would appear too:

'Error creating: No API token found for service account "builder", retry after the token is automatically created and added to the service account'

Also, 'oc get sa' shows the docker token missing from the build sa in the project. After some time, the docker token shows up and builds will succeed. See attached logs for details.

Comment 2 Eric Paris 2017-06-18 19:01:13 UTC
Can we identify when this started happening and what changes might have happened around that time? Did we upgrade an versions? Make config changes? Anything like that?

Comment 3 Jordan Liggitt 2017-06-19 13:52:20 UTC
If new-app is run immediately after a project is created, it races with service account token generation. This has been this way since 3.0 (c.f. https://bugzilla.redhat.com/show_bug.cgi?id=1318917#c5)

The time it takes for the service account and tokens to be provisioned has been greatly improved, but the fundamental race still exists.

@stefanie - how long after running `oc new-project` are all three service accounts and all nine secrets present in the namespace?

Comment 4 Stefanie Forrester 2017-06-19 14:00:53 UTC
I was on-call this weekend and got paged about it. That's the first incident that I'm aware of. But according to Zabbix, there has been an increasing number of app-create failures since June 2nd. Overall, it just appears to be getting worse with time. I'm also seeing an increasing number of IOPs used over time on this Zabbix graph.

The last upgrade was etcd on June 9th. atomic-openshift was upgraded on May 25th. I'll attach graphs and test how long it takes secrets to appear next...

Comment 5 Stefanie Forrester 2017-06-19 14:01:30 UTC
Created attachment 1289114 [details]
Zabbix IOPS graph

Comment 6 Stefanie Forrester 2017-06-19 14:01:57 UTC
Created attachment 1289115 [details]
Zabbix app-create graph

Comment 7 Stefanie Forrester 2017-06-19 14:02:26 UTC
Created attachment 1289116 [details]
Zabbix app-creates since June 1

Comment 8 Stefanie Forrester 2017-06-19 14:14:06 UTC
I was keeping track of the time by watching the age of the rc's in affected projects. Some of the rc's were 4 minutes old before the dockercfg secrets were created.

Comment 9 Jordan Liggitt 2017-06-19 14:19:15 UTC
Is this a cluster that is otherwise healthy, with controllers running as expected?

Comment 16 Jordan Liggitt 2017-06-19 17:09:21 UTC
> I wasn't able to find any instances of 'service' and 'token' on the same line. Here's the full output of 'oc get --raw /metrics'.

Oops, those metrics are on the controller metrics port, which is on port 8444 by default. Something like this:

oc get --server=https://<master>:8444 --raw /metrics 2>&1 | grep serviceaccount_tokens

Comment 18 Stefanie Forrester 2017-06-19 20:28:40 UTC
*** Bug 1447386 has been marked as a duplicate of this bug. ***

Comment 19 Jordan Liggitt 2017-06-19 20:32:14 UTC
in the controllers unit file, what is the --listen address passed to the controllers? that would dictate where the metrics were available.

also, you would need to check the metrics against the controller that currently had the active lease

Comment 20 Stefanie Forrester 2017-06-19 20:37:13 UTC
Thanks, I'm on the active controller master now.

[root@starter-us-east-1-master-25064 ~]# grep OPTIONS /etc/sysconfig/atomic-openshift-master-controllers
OPTIONS=--loglevel=2 --listen=https://0.0.0.0:8444

I can see you were right about the port it listens on. Using that, the command to grab metrics is working:


[root@starter-us-east-1-master-25064 ~]# oc get --server=https://ip-172-31-54-162.ec2.internal:8444 --raw /metrics 2>&1 | grep serviceaccount_tokens
# HELP serviceaccount_tokens_secret_adds Total number of adds handled by workqueue: serviceaccount_tokens_secret
# TYPE serviceaccount_tokens_secret_adds counter
serviceaccount_tokens_secret_adds 3.86200587e+08
# HELP serviceaccount_tokens_secret_depth Current depth of workqueue: serviceaccount_tokens_secret
# TYPE serviceaccount_tokens_secret_depth gauge
serviceaccount_tokens_secret_depth 0
# HELP serviceaccount_tokens_secret_queue_latency How long an item stays in workqueueserviceaccount_tokens_secret before being requested.
# TYPE serviceaccount_tokens_secret_queue_latency summary
serviceaccount_tokens_secret_queue_latency{quantile="0.5"} 306044
serviceaccount_tokens_secret_queue_latency{quantile="0.9"} 538312
serviceaccount_tokens_secret_queue_latency{quantile="0.99"} 682085
serviceaccount_tokens_secret_queue_latency_sum 1.26759309437759e+14
serviceaccount_tokens_secret_queue_latency_count 3.86200587e+08
# HELP serviceaccount_tokens_secret_retries Total number of retries handled by workqueue: serviceaccount_tokens_secret
# TYPE serviceaccount_tokens_secret_retries counter
serviceaccount_tokens_secret_retries 0
# HELP serviceaccount_tokens_secret_work_duration How long processing an item from workqueueserviceaccount_tokens_secret takes.
# TYPE serviceaccount_tokens_secret_work_duration summary
serviceaccount_tokens_secret_work_duration{quantile="0.5"} 11
serviceaccount_tokens_secret_work_duration{quantile="0.9"} 85
serviceaccount_tokens_secret_work_duration{quantile="0.99"} 286
serviceaccount_tokens_secret_work_duration_sum 1.2379943135e+10
serviceaccount_tokens_secret_work_duration_count 3.86200587e+08
# HELP serviceaccount_tokens_service_adds Total number of adds handled by workqueue: serviceaccount_tokens_service
# TYPE serviceaccount_tokens_service_adds counter
serviceaccount_tokens_service_adds 51794
# HELP serviceaccount_tokens_service_depth Current depth of workqueue: serviceaccount_tokens_service
# TYPE serviceaccount_tokens_service_depth gauge
serviceaccount_tokens_service_depth 0
# HELP serviceaccount_tokens_service_queue_latency How long an item stays in workqueueserviceaccount_tokens_service before being requested.
# TYPE serviceaccount_tokens_service_queue_latency summary
serviceaccount_tokens_service_queue_latency{quantile="0.5"} 10
serviceaccount_tokens_service_queue_latency{quantile="0.9"} 100645
serviceaccount_tokens_service_queue_latency{quantile="0.99"} 106200
serviceaccount_tokens_service_queue_latency_sum 1.300959845899e+12
serviceaccount_tokens_service_queue_latency_count 51794
# HELP serviceaccount_tokens_service_retries Total number of retries handled by workqueue: serviceaccount_tokens_service
# TYPE serviceaccount_tokens_service_retries counter
serviceaccount_tokens_service_retries 18
# HELP serviceaccount_tokens_service_work_duration How long processing an item from workqueueserviceaccount_tokens_service takes.
# TYPE serviceaccount_tokens_service_work_duration summary
serviceaccount_tokens_service_work_duration{quantile="0.5"} 59733
serviceaccount_tokens_service_work_duration{quantile="0.9"} 135333
serviceaccount_tokens_service_work_duration{quantile="0.99"} 144715
serviceaccount_tokens_service_work_duration_sum 1.532152388e+09
serviceaccount_tokens_service_work_duration_count 51794
[root@starter-us-east-1-master-25064 ~]#

Comment 22 Eric Paris 2017-06-27 16:30:35 UTC
The logs are so full of stuff (think gig per hour) it's near impossible to find any needles in the haystack. What sort of things should I grep for? 'Just looking' is not really possible.

We know this etcd is taking a beating. We know when it snapshots every 20 seconds it can take up to 1sec for it to respond to GETs.

But I don't see much indication of things literally failing, just being slow.

Comment 24 Eric Paris 2017-06-27 20:20:28 UTC
Created attachment 1292450 [details]
Logs from suggest /metrics while creating project

Comment 25 Jordan Liggitt 2017-06-27 21:59:12 UTC
analysis shows secret watch events are not being delivered

1. namespace is created

2. immediately, service accounts are created, so ns watch event was received

3. immediately, populated tokens (from the token controller) and unpopulated tokens (from the dockercfg controller) are created, so sa watch events are received

4. then, there is a long, long silence. metrics show the queue for the token controller as sitting at 0 almost the whole time, so we're not backed up. eventually, the secrets appear.

Verified there is an issue with watch event delivery:
1. manually watched service accounts and secrets
oc get secrets -w
oc get sa -w

2. deleted service accounts and secrets
oc delete secrets --all
oc delete sa --all

3. verified service accounts and secrets got recreated

4. service account events were delivered. secret events were not.

Comment 26 Eric Paris 2017-06-27 22:28:01 UTC
Created attachment 1292476 [details]
watching etcd while creating a new project

Comment 27 Eric Paris 2017-06-27 22:42:18 UTC
Looking at those last logs for default-token-*

create token 7zj2d WITH data
create token rgs4g WITHOUT data
compareAndDelete token 7zj2d
create token 69bt0 WITH data
*wait a minute*
compareAndSwap rgs4g so that it has data.

similar pattern for builder and deployer. Very odd.

Comment 28 Jordan Liggitt 2017-06-27 22:55:43 UTC
That's expected behavior in 3.5… the two token creating controllers both update the service account immediately, the one creating the populated token reacts to a conflict by deleting and requeuing. 

This got improved in 1.7, but is unrelated to the watch delivery issue, which is what is causing the unpopulated token to remain unpopulated for 1-4 minutes.

Comment 29 Jordan Liggitt 2017-06-28 04:35:18 UTC
Transferring to David while I'm out.

Eric, I'd want to pin down whether the issue was in etcd or in the watch cache.

If a list of secrets takes ~1 minute from etcd, it seems possible that the event window could expire before the lister/watcher that feeds the watch cache could establish a watch.

That might make the watch cache devolve into full lists once a minute, which could explain both the load on the server and the extreme lag or absence of secret watch events.

Comment 30 Eric Paris 2017-06-28 18:24:14 UTC
Created attachment 1292688 [details]
journalctl --since='-60m' -b -u atomic-openshift-master-api | grep 'ended with:'

Comment 31 Eric Paris 2017-06-28 18:32:54 UTC
Created attachment 1292690 [details]
journalctl --since='-60m' -b -u systemd-journald -u atomic-openshift-master-api | grep -E '(ended with:)|(Suppress)' | grep master

Just to make sure it is clear that we might have lost many many many more of these messages

Comment 32 Eric Paris 2017-06-28 19:30:29 UTC
Created attachment 1292694 [details]
cat /etc/origin/master/master-config.yaml

Comment 33 Eric Paris 2017-06-28 19:31:51 UTC
Jun 28 19:21:40 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:21:40.177584   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743222542 (743230476)
Jun 28 19:22:59 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:22:59.120664   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743237322 (743243970)
Jun 28 19:24:14 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:24:14.734873   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743250581 (743256981)
Jun 28 19:25:35 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:25:35.053245   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743263795 (743271318)
Jun 28 19:26:54 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:26:54.888254   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743278042 (743284459)
Jun 28 19:28:14 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:28:14.683790   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743291479 (743298150)
Jun 28 19:29:32 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:29:32.212156   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743304981 (743311696)
Jun 28 19:30:53 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:30:53.243054   26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743319014 (743326934)

Comment 34 Eric Paris 2017-06-28 19:32:17 UTC
# time oc get secrets --all-namespaces | wc -l
107814

real	2m32.015s
user	1m51.743s
sys	0m25.743s

Comment 35 Eric Paris 2017-06-28 19:43:24 UTC
comment #33 is, I believe, from an INACTIVE controller

Comment 36 Eric Paris 2017-06-28 19:48:33 UTC
Created attachment 1292695 [details]
journalctl --since='-60m' -u atomic-openshift-master-controllers | grep 'reflector.go'

Comment 37 Eric Paris 2017-06-28 23:17:42 UTC
etcd is not disk drive started, but it clearly is on fire:

# time etcdctl --endpoints=https://172.31.54.162:2379 --ca-file=/etc/origin/master/master.etcd-ca.crt --cert-file=/etc/origin/master/master.etcd-client.crt --key-file=/etc/origin/master/master.etcd-client.key get --quorum /kubernetes.io/secrets/eparis/default-token-n3hh7 > /dev/null
2017-06-28 22:56:16.077912 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated

real	0m0.057s
user	0m0.038s
sys	0m0.014s



# time etcdctl --endpoints=https://172.31.54.162:2379 --ca-file=/etc/origin/master/master.etcd-ca.crt --cert-file=/etc/origin/master/master.etcd-client.crt --key-file=/etc/origin/master/master.etcd-client.key get --quorum /kubernetes.io/secrets/eparis/default-token-n3hh7 > /dev/null

real	0m2.818s
user	0m0.038s
sys	0m0.024s

Comment 38 David Eads 2017-06-29 18:50:31 UTC
Alright, got a theory.  

 1. at some point the storage cache etcd watch expires
 2. storage cache terminates all API secret watches
 3. storage cache holds all future API secret watches until the relist is complete
 4. storage cache reflector does a relist - this takes one to two minutes
 5. storage cache accepts the API secret watches, but has no data for them
 6. storage cache reflector does a watch, but gets rejected on a RV too old (remember its list RV is old)
 7. go back to step 2.


Try increasing the number of RVs available in etcd.

Comment 44 Red Hat Bugzilla 2023-09-14 03:59:22 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.